Install
openclaw skills install ade-mineru-api-skillsMinerU document extraction CLI that converts PDFs, images, and web pages into Markdown, HTML, LaTeX, or DOCX via the MinerU API. Supports single/batch extraction, web crawling, async tasks, and piped workflows.
openclaw skills install ade-mineru-api-skillscurl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh
irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex
mineru version
Before using, configure your API token (get one from https://mineru.net):
mineru auth # Interactive token setup
export MINERU_TOKEN="your-token" # Or set via environment variable
Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.
The extract command accepts the following input types:
.pdf) — primary use case, supports scanned and digital PDFs.png, .jpg, .jpeg, .webp, .gif,.bmp) — use --ocr for best results on scanned content.docx) — Microsoft Word documentsThe crawl command accepts any HTTP/HTTPS URL and extracts web page content.
--no-table to disable.--no-formula to disable.ch (Chinese). Use --language en for English documents.--model vlm for complex layouts, --model pipeline for speed.mineru extract report.pdf # PDF → Markdown to stdout
mineru extract report.pdf -o ./out/ # Save to file
mineru extract report.pdf -f md,docx # Multiple formats
mineru crawl https://example.com/article # Web page → Markdown
mineru auth or set MINERU_TOKENmineru extract <file-or-url> for documentsmineru crawl <url> for web pages-o directoryConvert PDFs, images, and other documents to Markdown or other formats.
mineru extract report.pdf # Markdown to stdout
mineru extract report.pdf -f html # HTML to stdout
mineru extract report.pdf -o ./out/ # Save to directory
mineru extract report.pdf -o ./out/ -f md,docx # Multiple formats
mineru extract *.pdf -o ./results/ # Batch extract
mineru extract --list files.txt -o ./results/ # Batch from file list
mineru extract https://example.com/doc.pdf # Extract from URL
cat doc.pdf | mineru extract --stdin -o ./out/ # From stdin
| Flag | Short | Default | Description |
|---|---|---|---|
--output | -o | (stdout) | Output path (file or directory) |
--format | -f | md | Output formats: md, json, html, latex, docx (comma-separated) |
--model | (auto) | Model: vlm, pipeline, html | |
--ocr | false | Enable OCR for scanned documents | |
--no-formula | false | Disable formula recognition | |
--no-table | false | Disable table recognition | |
--language | ch | Document language | |
--pages | (all) | Page range, e.g. 1-10,15 | |
--timeout | 300/1800 | Timeout in seconds (single/batch) | |
--list | Read input list from file (one path per line) | ||
--stdin-list | false | Read input list from stdin | |
--stdin | false | Read file content from stdin | |
--stdin-name | stdin.pdf | Filename hint for stdin mode | |
--concurrency | 0 | Batch concurrency (0 = server default) |
Fetch web pages and convert to Markdown.
mineru crawl https://example.com/article # Markdown to stdout
mineru crawl https://example.com/article -f html # HTML to stdout
mineru crawl https://example.com/article -o ./out/ # Save to file
mineru crawl url1 url2 -o ./pages/ # Batch crawl
mineru crawl --list urls.txt -o ./pages/ # Batch from file list
| Flag | Short | Default | Description |
|---|---|---|---|
--output | -o | (stdout) | Output path |
--format | -f | md | Output formats: md, json, html (comma-separated) |
--timeout | 300/1800 | Timeout in seconds (single/batch) | |
--list | Read URL list from file (one per line) | ||
--stdin-list | false | Read URL list from stdin | |
--concurrency | 0 | Batch concurrency |
mineru auth # Interactive token setup
mineru auth --verify # Verify current token is valid
mineru auth --show # Show current token source and masked value
Query the status of a previously submitted extraction task.
mineru status <task-id> # Check status once
mineru status <task-id> --wait # Wait for completion
mineru status <task-id> --wait -o ./out/ # Wait and download results
mineru status <task-id> --wait --timeout 600 # Custom timeout
| Flag | Short | Default | Description |
|---|---|---|---|
--wait | false | Wait for task completion | |
--output | -o | Download results to directory when done | |
--timeout | 300 | Max wait time in seconds |
mineru version # Show version, commit, build date, Go version, OS/arch
These flags apply to all commands:
| Flag | Short | Description |
|---|---|---|
--token | API token (overrides env and config) | |
--base-url | API base URL (for private deployments) | |
--verbose | -v | Verbose mode, print HTTP details |
-o flag: result goes to stdout; status/progress messages go to stderr-o flag: result saved to file/directory; progress messages on stderr-o to specify output directorydocx): cannot output to stdout, must use -o.md filemineru extract report.pdf -o ./output/
# Output: ./output/report.md + ./output/images/
mineru extract scanned.pdf --ocr --pages "1-5" -o ./out/
mineru extract paper.pdf -f md,html,docx -o ./out/
# Output: ./out/paper.md, ./out/paper.html, ./out/paper.docx
# files.txt contains one path per line
mineru extract --list files.txt -o ./results/
mineru extract paper.pdf -f latex -o ./out/
# Output: ./out/paper.tex
mineru extract english-report.pdf --language en -o ./out/
mineru extract resume.docx -o ./out/
# Output: ./out/resume.md
# Download and extract in one pipeline
curl -sL https://example.com/doc.pdf | mineru extract --stdin --stdin-name doc.pdf
mineru crawl https://example.com/docs/guide -o ./docs/
echo -e "https://example.com/page1\nhttps://example.com/page2" | mineru crawl --stdin-list -o ./pages/
# Extract and pipe to another tool
mineru extract report.pdf | wc -w # Word count
mineru extract report.pdf | grep "keyword" # Search content
mineru extract report.pdf -f json | jq '.[]' # Parse structured output
When using this skill on behalf of the user:
mineru extract "report 01.pdf", NOT mineru extract report 01.pdf.mineru extract.mineru extract file.docx.-o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.When the user does NOT specify an output path (-o), the agent MUST generate a default output directory to prevent file overwrites. Use:
~/MinerU-Skill/<name>_<hash>/
Naming rules:
<name>: derived from the source, then sanitized for safe directory names.
https://arxiv.org/pdf/2509.22186 → 2509.22186)report.pdf → report)space, (, ), [, ], &, ', ", !, #, $, `) with _. Collapse consecutive _ into one. Keep alphanumeric, -, _, ., and CJK characters.<hash>: first 6 characters of the MD5 hash of the full original source path or URL (before sanitization). This ensures:
Examples:
| Source | <name> | Output directory |
|---|---|---|
https://arxiv.org/pdf/2509.22186 | 2509.22186 | ~/MinerU-Skill/2509.22186_a3f2b1/ |
https://arxiv.org/pdf/2509.200 | 2509.200 | ~/MinerU-Skill/2509.200_c7e9d4/ |
./report.pdf | report | ~/MinerU-Skill/report_8b1a3f/ |
./report 01.pdf | report_01 | ~/MinerU-Skill/report_01_f4a1c2/ |
./My Doc (final).pdf | My_Doc_final | ~/MinerU-Skill/My_Doc_final_b9e3d7/ |
./个人简介.docx | 个人简介 | ~/MinerU-Skill/个人简介_d2a8f5/ |
How the agent should generate the hash:
echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6
Or on macOS:
echo -n "https://arxiv.org/pdf/2509.22186" | md5 | cut -c1-6
When the user specifies -o: use the user's path as-is, do NOT override with the default directory.
| Code | Meaning | Recovery |
|---|---|---|
| 0 | Success | — |
| 1 | General API or unknown error | Check network connectivity; retry; use --verbose for details |
| 2 | Invalid parameters / usage error | Check command syntax and flag values |
| 3 | Authentication error | Run mineru auth to reconfigure token, or check token expiration |
| 4 | File too large or page limit exceeded | Split the file or use --pages to extract a subset |
| 5 | Extraction failed | The document may be corrupted or unsupported; try a different --model |
| 6 | Timeout | Increase with --timeout; large files may need 600+ seconds |
| 7 | Quota exceeded | Check API quota at https://mineru.net; wait or upgrade plan |
mineru auth or set MINERU_TOKEN env variable--timeout 600 (seconds)-o flag; docx cannot stream to stdout--base-url https://your-server.com/api--model vlm for complex layouts, or --ocr for scanned documents--no-formula is NOT set; try --model vlm for better formula support~/.mineru/config.yaml after mineru authgithub.com/OpenDataLab/mineru-open-sdk)