polaris-datainsight-doc-extract

v1.0.0

Extract structured data from Office documents (DOCX, PPTX, XLSX, HWP, HWPX) using the Polaris AI DataInsight Doc Extract API. Use when the user wants to pars...

1· 277·0 current·0 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
The skill's stated purpose (extracting DOCX/PPTX/XLSX/HWP/HWPX via Polaris DataInsight) matches the runtime instructions (POST to datainsight-api.polarisoffice.com with x-po-di-apikey). However, the registry metadata lists no required environment variables or primary credential while the SKILL.md explicitly requires POLARIS_DATAINSIGHT_API_KEY — an incoherence between claimed requirements and runtime needs.
Instruction Scope
The SKILL.md stays within the stated purpose: it shows how to POST the file, parse the ZIP response, and return structured JSON. It does not instruct reading unrelated system files, harvesting other environment variables, or sending data to third parties besides the documented Polaris endpoints. It does instruct the agent to invoke the skill broadly when users mention document extraction, which is consistent with the skill's function.
Install Mechanism
This is instruction-only (no install spec, no code files). That minimizes install-time risk because nothing is downloaded or written by the skill itself.
!
Credentials
The SKILL.md requires an API key via the environment variable POLARIS_DATAINSIGHT_API_KEY and shows the x-po-di-apikey header, but the skill metadata did not declare any required env vars or a primary credential. Requesting a service API key is proportionate to the purpose, but the metadata omission is a mismatch that could mislead users about what secrets are needed and expected.
Persistence & Privilege
always is false and there are no install scripts or indications the skill will modify agent-wide settings or other skills. The skill does not request permanent platform privileges beyond normal autonomous invocation.
What to consider before installing
This skill appears to do what it says (send a document to Polaris DataInsight and parse the returned ZIP), but the registry metadata failing to declare the required POLARIS_DATAINSIGHT_API_KEY is a red flag you should resolve before installing. Actions to take before use: 1) Ask the publisher to update the registry metadata to list POLARIS_DATAINSIGHT_API_KEY as a required credential and to provide a homepage/source for verification. 2) Verify the API hostname (datainsight-api.polarisoffice.com) and the service terms/privacy policy on the official Polaris/Polaris Office site. 3) Only supply an API key you control; avoid putting long-lived/high-privilege credentials into shared environments. Prefer scoped or ephemeral keys if Polaris supports them. 4) Test with non-sensitive documents first to confirm where data is transmitted and how results are returned. 5) If you cannot confirm the publisher or metadata, treat the skill as untrusted and do not expose sensitive documents or secrets to it.

Like a lobster shell, security has layers — review code before you run it.

latestvk97dnhnzbtznq3cp4mqn7nm4718273h5
277downloads
1stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Polaris AI DataInsight — Doc Extract Skill

Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured unifiedSchema JSON. A single API call gives you the full document structure without any manual parsing.


When to Use This Skill

  • The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
  • The user needs to understand a document's structure (page count, element types, position data, etc.)
  • The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
  • Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
  • The user needs to parse special elements like headers, footers, equations, or shapes

What This Skill Does

  1. Authentication — Authenticates with the Polaris DataInsight API via the x-po-di-apikey header.
  2. Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
  3. Parse ZIP response — The API returns a ZIP file; extract it and load the unifiedSchema JSON inside.
  4. Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
  5. Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.

How to Use

Prerequisites

Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.

Authentication: Include the API key as a header on every request.

Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY

Set the environment variable:

export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

Limits

ItemLimit
Supported formatsHWP, HWPX, DOCX, PPTX, XLSX
Max file size25 MB
Timeout10 minutes
Rate limit10 requests per minute

Basic Usage

Endpoint:

POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract

Extract a document with Python:

import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

# Example usage
import os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")

Extract with curl:

curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "file=@example.docx" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

Advanced Usage

Response Structure (unifiedSchema)

Root:

{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}

Page (pages[]):

{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}

Element types (elements[].type):

typeDescription
textText block
imageImage
tableTable
chartChart
shapeShape
equationEquation
header / footerHeader / Footer

Common element structure:

{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}

Table content:

{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}

Chart content:

{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

Usage Patterns

Extract all text:

def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)

Extract tables as CSV:

def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables

Generate RAG chunks:

def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

Example

User: "Extract all table data from this DOCX report as CSV."

Output:

import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)
=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300

Inspired by: Polaris Office DataInsight API documentation and workflow.


Tips

  • The response is always a ZIP file. Do not try to parse response.content directly as JSON — use zipfile.ZipFile to extract it first.
  • content.csv is available for both table and chart elements, making it the most convenient format for data extraction.
  • The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g., time.sleep(6)) between calls.
  • Use boundaryBox to determine where each element sits on the page — useful for layout analysis.
  • Always store the API key in an environment variable (POLARIS_DATAINSIGHT_API_KEY) and never hardcode it.

Common Use Cases

  • Document search systems: Extract full text and store it in a vector database for semantic search
  • Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
  • HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
  • RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
  • Data migration: Move table and chart data from legacy Office documents into a database

License & Terms

  • Skill Definition: This SKILL.md file is provided under the Apache 2.0 license.
  • Service Access: Usage of the DataInsight API requires a valid subscription or license key.
  • Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
  • Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.

Comments

Loading comments...