Install
openclaw skills install paper-parser-skillCLI tool to search, download, and parse academic papers from arXiv into AI-friendly Markdown using MinerU API.
openclaw skills install paper-parser-skillCLI tool for automated academic paper processing.
[!IMPORTANT] External Data Processing: This skill transmits PDF files and paper metadata to MinerU (opendatalab) for layout analysis and Markdown conversion. Please ensure you trust the service and understand their data handling policies before providing an API token in the configuration file.
Security & Provenance:
Before Installing:
MINERU_API_TOKEN grants MinerU access to receive and process uploaded PDFs — use a dedicated, revocable token with minimal scope.[!WARNING] Installing from PyPI executes third-party code. Use a virtual environment if you want to limit blast radius.
pip install paper-parser-skill==v0.1.3
Default path: ~/.paper-parser/config.yaml
[!IMPORTANT]
MINERU_API_TOKENis required for parsing functionality. Get a token at mineru.net.
PAPER_WORKSPACE: "~/paper-parser-workspace"
MINERU_API_TOKEN: "your_token_here" # Required for parsing
MINERU_API_BASE_URL: "https://mineru.net/api/v4"
MINERU_API_TIMEOUT: 600
Alias: pp
| Command | Argument | Description |
|---|---|---|
pp search | <query> | Search arXiv papers |
pp download | <id/query> | Download PDF and metadata |
pp path | <id/query> | Get local workspace path |
[!TIP] Recommended for agent/automation use:
pp submit+pp check(non-blocking async workflow).pp parseandpp allblock the process until cloud processing completes, which can take several minutes and may time out. Prefer the async approach when calling from an agent or pipeline.
| Command | Argument | Options | Description |
|---|---|---|---|
pp submit | <id/path> | --force | [Async ✅] Submit PDF for parsing and return immediately. Idempotent — safe to call repeatedly. If already submitted and pending, checks status instead of re-uploading. |
pp check | <id/path> | — | [Async ✅] Check parse status once. Downloads results automatically when done. |
pp parse | <id/path> | --force | [Blocking ⚠️] Parse PDF synchronously. Blocks until complete. May time out on slow jobs. |
pp all | <id/query> | --force | [Blocking ⚠️] Full workflow (Search → Download → Parse). Blocks until complete. May time out on slow jobs. |
# Step 1: Search for papers by keyword → pick an arXiv ID from results
pp search "retrieval augmented generation"
# → 1. Id: 2312.10997 Title: Retrieval-Augmented Generation for ...
# → 2. Id: 2401.00123 Title: ...
# Step 2: Submit for parsing and return immediately
# (PDF is downloaded automatically if not already cached)
pp submit 2312.10997
# → ⬇️ Downloading PDF...
# → ✅ Submitted! batch_id: xxxxxxxx
# (minutes later, the agent or user calls again)
# Step 3: Check status — downloads & extracts results automatically when done
pp check 2312.10997
# → "⏳ Still processing (state: running, 45s since submission)."
# → or "✅ Parsing complete! 📂 Results in: ~/paper-parser-workspace/2312.10997"
PAPER_WORKSPACE/
└── <arxiv_id>/
├── paper.pdf
├── title.md
├── summary.md
├── .parse_task.json ← async task state (batch_id, status, timestamps)
└── markdowns/
├── 01_Introduction.md
└── images/
requests, click, PyYAML, arxiv, rapidfuzzconfig.yaml file. Get one at mineru.net.