Alibabacloud Docmind Parse

Other

Alibaba Cloud DocMind intelligent document parsing tool. Supports PDF, Word, PPT, Excel, images and more, outputting structured Markdown/JSON/HTML. Offers two invocation modes — V2 API direct access and Alibaba Cloud POP — with automatic routing based on credential availability. Use when the user needs to parse documents, extract content (text/tables/images), convert documents to Markdown, or mentions "docmind", "document parsing", "parse file", etc.

Install

openclaw skills install @sdk-team/alibabacloud-docmind-parse

DocMind Document Parsing

Two Invocation Modes

  1. Free Mode (V2 API Direct): Configure the endpoint via the DOCMIND_V2_ENDPOINT environment variable; limited daily free quota.
  2. Alibaba Cloud POP Mode: Credentials are obtained automatically through the default credential chain; 3,000 pages per month free, pay-as-you-go beyond that.

Routing Strategy

  • When the Alibaba Cloud default credential chain is available, prefer POP Mode.
  • When credentials are unavailable but DOCMIND_V2_ENDPOINT is configured, use V2 Free Mode.
  • When the free quota is exhausted, prompt the user to activate the Alibaba Cloud DocMind service.

Environment Variables

VariableDescriptionRequired
DOCMIND_V2_ENDPOINTV2 API service endpoint (domain or IP). Defaults to docmind.aliyuncs.comOptional

POP Mode automatically obtains credentials through the Alibaba Cloud default credential chain (environment variables, config files, ECS RAM roles, etc.) — no manual management needed.


Usage

python scripts/docmind_parse.py <file_path_or_url> [options]

Parameters

ParameterDescriptionDefault
<file_path_or_url>Local file path or file URLRequired
--modeInvocation mode: auto, v2, popauto
--enhancementEnhancement mode: VLM, LLM, DIGITAL, OCR, AUTONone
--outputOutput format: markdown, json, htmlmarkdown
--pagesPage range to parse, e.g. 1-5All
--output-fileOutput file pathStdout
--head-footParse headers and footersOff
--user-promptCustom user promptNone
--optionDocument parsing optionsNone
--markdown-tableTable output format: html, markdownNone
--markdown-imageImage output format: html, markdownNone
--file-extFile extension (alternative to fileName)Auto-detected

Examples

# Parse a URL (auto-select mode)
python scripts/docmind_parse.py https://example.com/doc.pdf

# Parse with VLM enhancement
python scripts/docmind_parse.py https://example.com/doc.pdf --enhancement VLM

# Parse the first 5 pages, output to a Markdown file
python scripts/docmind_parse.py ./report.pdf --pages 1-5 --output-file result.md

# Parse a local file via V2 mode (auto base64 encoding)
python scripts/docmind_parse.py ./contract.pdf --mode v2

# Parse with custom table/image output formats
python scripts/docmind_parse.py https://example.com/doc.pdf --markdown-table markdown --markdown-image html

# Parse headers and footers with a custom prompt
python scripts/docmind_parse.py https://example.com/doc.pdf --head-foot --user-prompt "Extract all footnotes"

# Force Alibaba Cloud POP mode
python scripts/docmind_parse.py ./contract.pdf --mode pop

V2 API Direct Access (Free Mode)

Supports both URL and local file (base64) upload. The request body is organized into four blocks: Document, Processing, Output, and Notification.

Submit Endpoint

POST {DOCMIND_V2_ENDPOINT}/skill/submit

Full request schema:

{
  "document": {
    "fileUrl": "https://example.com/doc.pdf",
    "fileBase64": "<base64-encoded file content, alternative to fileUrl>",
    "fileName": "doc.pdf",
    "fileNameExtension": "pdf"
  },
  "processing": {
    "enhancementMode": "VLM",
    "pageIndex": "1-5",
    "headFoot": false,
    "userPrompt": "Custom prompt",
    "option": "parsing-option"
  },
  "output": {
    "outputFormat": ["markdown"],
    "markdownTable": ["html"],
    "markdownImage": ["html"],
    "docExtraParameters": {"key": "value"},
    "extraParameters": "audio-video-extra-params",
    "ossConfig": {
      "bucket": "my-bucket",
      "endpoint": "oss-cn-hangzhou.aliyuncs.com",
      "accessKeyId": "...",
      "accessKeySecret": "...",
      "securityToken": "..."
    }
  },
  "notification": {
    "enableEventCallback": false
  }
}

document.fileUrl and document.fileBase64 are mutually exclusive. When parsing a local file via V2 mode, the script automatically reads and base64-encodes the file. fileNameExtension is auto-detected from the file extension when not explicitly provided.

Response:

{
  "success": true,
  "data": { "bizId": "docmind-20260519-xxxx" }
}

Query Endpoint

POST {DOCMIND_V2_ENDPOINT}/skill/query

{
  "bizId": "docmind-20260519-xxxx",
  "layoutStepSize": 100,
  "layoutNum": 0
}

Response (on success):

{
  "success": true,
  "data": {
    "status": "success",
    "processing": 100.0,
    "layouts": [ ... ]
  }
}

Rate limiting: max 5 tasks per second per IP, global limit 20.


Alibaba Cloud POP Invocation

Three-step async workflow using the default credential chain to initialize the client:

  1. Submit task - SubmitDocParserJob / SubmitDocParserJobAdvance
  2. Query status - QueryDocParserStatus (poll until success/fail)
  3. Get result - GetDocParserResult (incremental retrieval via LayoutNum + LayoutStepSize pagination)

POP endpoint: docmind-api.cn-hangzhou.aliyuncs.com, API version: 2022-07-11

from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711.client import Client as DocMindClient

cred = CredClient()
config = open_api_models.Config(
    credential=cred,
    endpoint="docmind-api.cn-hangzhou.aliyuncs.com",
    user_agent="AlibabaCloud-Agent-Skills/alibabacloud-docmind-parse/" + os.environ.get("SKILL_SESSION_ID", "unknown")
)
client = DocMindClient(config)

Quota

ModeFree QuotaAfter Exhaustion
V2 API DirectLimited daily quotaPrompt to activate Alibaba Cloud service
Alibaba Cloud POP3,000 pages/monthAutomatic pay-as-you-go

When quota is exhausted, prompt the user to visit https://docmind.console.aliyun.com/doc-overview to activate the service.


Error Handling

Error CodeMeaningResolution
QuotaExhausted / ThrottlingQuota exhausted or rate-limitedPrompt to activate Alibaba Cloud service
FileUrlLegalInvalid file URLVerify the URL is publicly accessible
InvalidFileFormatUnsupported file formatShow the list of supported formats
FileSizeExceededFile too largeV2 limit 5 MB, POP limit 150 MB
OssAccessDeniedError / HTTP 403URL points to a private or restricted OSS resourceSee pre-validation rules below

Pre-validation and Exception Handling

Before invoking the script, the Agent MUST perform the following checks:

  1. Local file path validation: If the input is a local file path, verify that the file exists and is readable before calling the script. If the file does not exist, stop immediately and ask the user to verify the path. The Agent must NEVER create, fabricate, or substitute a file to bypass a missing-file error.

  2. Private URL detection: If the input is a URL and V2 mode returns OssAccessDeniedError or HTTP 403, the URL is likely private or requires authentication. In this case:

    • Inform the user that the URL is not publicly accessible.
    • Ask the user to either provide a publicly accessible URL, or download the file locally and re-run with the local path — the script will automatically encode it as base64 and upload via V2 (--mode v2 handles base64 encoding transparently).
  3. Network unreachable: If the script fails with a connection error (e.g. ConnectionError, ConnectionRefused), check that:

    • The DOCMIND_V2_ENDPOINT (default docmind.aliyuncs.com) is reachable from the current environment.
    • Proxy settings or firewall rules are not blocking outbound HTTPS traffic.

Supported File Formats

  • Documents: PDF, Word (doc/docx), PPT (ppt/pptx), Excel (xls/xlsx/xlsm)
  • Images: JPG, JPEG, PNG, BMP, GIF
  • Other: Markdown, HTML, EPUB, MOBI, RTF, TXT
  • Audio/Video (POP mode only): MP4, MKV, AVI, MOV, WMV, MP3, WAV, AAC

Output Formats

Markdown Output

Each layout block's markdownContent field is concatenated; tables are embedded as HTML tables.

JSON Output

Raw layouts structured data, including type/subType, text, markdownContent, pageNum, index, pos and other fields.


Observability

All outbound HTTP requests (V2 API and Alibaba Cloud POP SDK) set the following User-Agent header:

AlibabaCloud-Agent-Skills/alibabacloud-docmind-parse/{session-id}

The {session-id} segment is read from the SKILL_SESSION_ID environment variable at runtime. If the variable is not set, the value falls back to unknown.

ComponentValue
UA templateAlibabaCloud-Agent-Skills/alibabacloud-docmind-parse/{session-id}
session-id sourceSKILL_SESSION_ID env var
Fallbackunknown
Env var exampleexport SKILL_SESSION_ID=agent-abc123-session-xyz

All skill invocations sharing the same SKILL_SESSION_ID value are correlated as a single session in server-side logs.


Dependency Installation

# POP mode
pip install "alibabacloud-docmind-api20220711>=1.0.0" \
            "alibabacloud-credentials>=1.0.0" \
            "alibabacloud-tea-openapi>=0.3.8" \
            "alibabacloud-tea-util>=0.3.0" \
            "alibabacloud-gateway-pop>=0.1.0"

# V2 mode
pip install "requests>=2.20.0"