MinerU PDF Extractor

Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
2 · 688 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The skill's name/description (PDF → Markdown using MinerU) matches the included scripts and docs: they call MinerU API endpoints, upload files to presigned OSS URLs, poll results and download a ZIP with parsed Markdown. One inconsistency: the registry metadata at the top states "Required env vars: none", but the SKILL.md and all scripts clearly require an API token (MINERU_TOKEN or MINERU_API_KEY). This is likely an authoring/metadata omission rather than malicious behavior, but users should be aware the token is required.
Instruction Scope
The runtime instructions and scripts operate within stated scope: reading a local PDF path (when using local flow), validating/sanitizing inputs, calling MinerU API endpoints under MINERU_BASE_URL, uploading to presigned OSS URLs and downloading results from the official CDN host. Scripts include input sanitization, ZIP validation and directory traversal checks. They do not attempt to read unrelated system files or send data to unexpected external endpoints. Minor tooling note: scripts optionally pipe responses to `python3 -m json.tool` for pretty-printing but SKILL.md does not list python3 as a recommended/required tool.
Install Mechanism
There is no install spec; this is an instruction-only skill with included shell scripts. Nothing in the bundle downloads arbitrary code at install time. Risk is low from the install mechanism itself. However, running the provided scripts will execute code included in the repo, so users should review them before executing.
Credentials
The scripts require a single service credential (MINERU_TOKEN or MINERU_API_KEY) and optionally MINERU_BASE_URL. That is proportional for a MinerU API integration. The only notable mismatch is registry metadata claiming no required env vars while SKILL.md and scripts require the token—this should be corrected. No unrelated secrets or broad cloud credentials (AWS, GCP, etc.) are requested.
Persistence & Privilege
The skill does not request permanent/always-on privileges, does not alter other skills or system-wide configs, and is user-invocable only. Default autonomous invocation is allowed (platform normal) but the skill itself does not request elevated persistence.
Assessment
This skill appears to be what it claims: a set of scripts to call the MinerU API to parse PDFs. Before installing/running: 1) Be sure to set MINERU_TOKEN (or MINERU_API_KEY) — SKILL.md requires it even though the top-level registry metadata omitted it. 2) Review the shell scripts (they are included) and only run them if you trust the source and the MinerU endpoints listed (mineru.net, mineru.oss-cn-shanghai.aliyuncs.com, cdn-mineru.openxlab.org.cn). 3) The scripts use curl and unzip (and may use jq or python3 if present); install those if you want improved JSON handling. 4) Treat your MINERU token as sensitive — do not expose it in public repos or logs, and consider least-privilege options with the provider. 5) If you will process sensitive PDFs, verify the provider's privacy policy before uploading. Overall: coherent and low-risk for its stated purpose, with only the metadata-accuracy and tooling-notes mentioned above to fix.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.5
Download zip
latestvk97bmhbb79rp53ads2vj5vr99s81d1ne

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.


📁 Skill Structure

mineru-pdf-extractor/
├── SKILL.md                          # English documentation
├── SKILL_zh.md                       # Chinese documentation
├── docs/                             # Documentation
│   ├── Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
│   ├── Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
│   ├── MinerU_本地文档解析完整流程.md  # Local parsing complete guide (Chinese)
│   └── MinerU_在线文档解析完整流程.md  # Online parsing complete guide (Chinese)
└── scripts/                          # Executable scripts
    ├── local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    ├── local_file_step2_upload_file.sh         # Local parsing Step 2
    ├── local_file_step3_poll_result.sh         # Local parsing Step 3
    ├── local_file_step4_download.sh            # Local parsing Step 4
    ├── online_file_step1_submit_task.sh        # Online parsing Step 1
    └── online_file_step2_poll_result.sh        # Online parsing Step 2

🔧 Requirements

Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"

Required Command-Line Tools

  • curl - For HTTP requests (usually pre-installed)
  • unzip - For extracting results (usually pre-installed)

Optional Tools

  • jq - For enhanced JSON parsing and security (recommended but not required)
    • If not installed, scripts will use fallback methods
    • Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)

Optional Configuration

# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"

💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key


📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

Quick Start

cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/

Script Descriptions

local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

Usage:

./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]

Parameters:

  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

Usage:

./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>

local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

Usage:

./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]

Output:

FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

local_file_step4_download.sh

Download result ZIP and extract.

Usage:

./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]

Output Structure:

extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Local_File_Parsing_Guide.md


🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

Quick Start

cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/

Script Descriptions

online_file_step1_submit_task.sh

Submit parsing task for online PDF.

Usage:

./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]

Parameters:

  • pdf_url: Complete URL of the online PDF (required)
  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo

Output:

TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

Usage:

./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]

Output Structure:

extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Online_URL_Parsing_Guide.md


📊 Comparison of Two Parsing Methods

FeatureLocal PDF ParsingOnline PDF Parsing
Steps4 steps2 steps
Upload Required✅ Yes❌ No
Average Time30-60 seconds10-20 seconds
Use CaseLocal filesFiles already online (arXiv, websites, etc.)
File Size Limit200MBLimited by source server

⚙️ Advanced Usage

Batch Process Local Files

for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done

Batch Process Online Files

for url in \
  "https://arxiv.org/pdf/2410.17247.pdf" \
  "https://arxiv.org/pdf/2409.12345.pdf"; do
    echo "Processing: $url"
    
    # Step 1
    result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
    task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
    
    # Step 2
    filename=$(basename "$url" .pdf)
    ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done

⚠️ Notes

  1. Token Configuration: Scripts prioritize MINERU_TOKEN, fall back to MINERU_API_KEY if not found
  2. Token Security: Do not hard-code tokens in scripts; use environment variables
  3. URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
  4. File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
  5. Network Stability: Ensure stable network when uploading large files
  6. Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
  7. Optional jq: Installing jq provides enhanced JSON parsing and additional security checks

📚 Reference Documentation

DocumentDescription
docs/Local_File_Parsing_Guide.mdDetailed curl commands and parameters for local PDF parsing
docs/Online_URL_Parsing_Guide.mdDetailed curl commands and parameters for online PDF parsing

External Resources:


Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official

Files

12 total
Select a file
Select a file to preview.

Comments

Loading comments…