{"skill":{"slug":"pdf-extractor-skill","displayName":"Pdf Extractor Skill","summary":"Extract text and LaTeX formulas from academic PDFs in English and Chinese, outputting structured Markdown with math, tables, and images preserved.","description":"# PDF Extractor Skill\n\nExtract text and mathematical formulas from academic PDF papers. Supports both English and Chinese content.\n\n## When to Use This Skill\n\nUse this skill when:\n- User needs to extract text and LaTeX formulas from PDF papers\n- User mentions \"PDF转文本\", \"PDF提取公式\", \"论文OCR\"\n- User wants to convert academic papers to Markdown format\n\n## Tool Selection\n\n| Tool | Best For | Languages | Math Quality |\n|------|----------|-----------|--------------|\n| **Marker** (推荐) | 中英文论文、复杂公式 | Chinese + English | Excellent |\n| **Nougat** | 纯英文论文、arXiv | English only | Excellent |\n\n**推荐使用 Marker**：支持中英文混排，公式识别效果更好。\n\n---\n\n## Environment Setup\n\n**Conda Environment**: `pdf-extractor`\n**Python Path**: `D:\\anaconda3\\envs\\pdf-extractor\\python.exe`\n\n### Key Dependencies\n- PyTorch 2.10.0+cu128 (CUDA 12.8)\n- marker-pdf (Surya OCR + Texify)\n- nougat-ocr 0.1.17\n- transformers\n\n## Important: Keep This Skill Self-Contained (No Extra Installs)\n\nThis skill is expected to run using ONLY the existing `pdf-extractor` conda environment and the scripts in `scripts/`.\n\nRules:\n- Do NOT run `pip install ...` / `conda install ...` / download random libraries during extraction.\n- If a dependency is missing (e.g., Nougat crashes due to missing `torchvision`), do NOT try to fix by installing packages. Switch tools (prefer Marker) or report the environment issue.\n- Slow runtime is normal for Marker (especially with `--ark-code-latest`). Prefer splitting the PDF rather than changing tools or adding dependencies.\n\nRecommended approach for long PDFs:\n- Use `--page-range` (0-based) to extract per page or small page batches.\n- Merge the resulting markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page outputs so image links remain valid.\n\nExample (per-page extraction with LLM mode):\n```bash\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"paper.pdf\" --ark-code-latest --page-range \"0\" -o \"out/page_01.md\"\n```\n\n---\n\n## Tool 1: Marker (推荐 - 中英文支持)\n\n### Command Line\n\n```bash\n# 转换中文论文 (默认支持中英文)\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"论文.pdf\"\n\n# 指定输出路径\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"paper.pdf\" -o \"output.md\"\n\n# 强制 OCR (用于扫描版 PDF)\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"scanned.pdf\" --force-ocr\n\n# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量（表格/公式/跨页结构更稳）\n# 注意：默认走 ark-code-latest，后台会自动路由到合适的模型\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"paper.pdf\" --ark-code-latest\n\n# 只跑第 1 页做快速验证（0-based page index）\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"paper.pdf\" --ark-code-latest --page-range \"0\" -o \"out_first_page.md\"\n\n# 如需自定义（不推荐）：也可以手动指定 --openai-base-url/--openai-api-key/--openai-model\n\n# 指定语言\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2md_marker.py \"paper.pdf\" --languages Chinese English Japanese\n```\n\n### Python API\n\n```python\nimport sys\nsys.path.insert(0, r'C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts')\nfrom pdf2md_marker import convert_pdf, convert_pdf_cli\n\n# 简单用法\noutput_file = convert_pdf_cli('论文.pdf', 'output.md')\n\n# 完整 API\nmarkdown_text, metadata = convert_pdf(\n    'paper.pdf',\n    output_dir='./output',\n    force_ocr=False,\n    batch_multiplier=2,\n    languages=['Chinese', 'English']\n)\nprint(markdown_text)\n```\n\n### Marker Options\n\n| Option | Description |\n|--------|-------------|\n| `-o, --output` | Output file (.md) or directory |\n| `--force-ocr` | Force OCR even for text PDFs |\n| `--batch-multiplier` | Batch size multiplier (default: 2) |\n| `--languages` | Languages in document (default: Chinese English) |\n\n---\n\n## Tool 2: Nougat (纯英文论文)\n\n### Command Line\n\n```bash\n# Convert entire PDF\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2latex.py \"paper.pdf\"\n\n# Convert specific pages\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2latex.py \"paper.pdf\" -p 0-5\n\n# Custom output\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2latex.py \"paper.pdf\" -o output.mmd\n\n# Save each page separately\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts\\pdf2latex.py \"paper.pdf\" --per-page\n```\n\n### Python API\n\n```python\nimport sys\nsys.path.insert(0, r'C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts')\nfrom pdf2latex import load_model, process_pdf, save_results\n\n# Load model (uses GPU if available)\nmodel, device = load_model()\n\n# Process PDF\nresults = process_pdf('paper.pdf', model, device)\n\n# Save as single markdown file\nsave_results(results, 'output.mmd')\n\n# Or save per page\nsave_results(results, 'output_pages/', format='pages')\n```\n\n### Nougat Options\n\n| Option | Description |\n|--------|-------------|\n| `-o, --output` | Output file or directory |\n| `-p, --pages` | Page range (e.g., \"0-5\" or \"1,3,5\") |\n| `-m, --model` | Model tag (default: 0.1.0-base) |\n| `--dpi` | Render DPI (default: 300) |\n| `--cpu` | Force CPU mode |\n| `--per-page` | Save each page separately |\n\n---\n\n## Output Format\n\nBoth tools output Markdown with LaTeX math:\n- Text is extracted as regular markdown\n- Mathematical formulas are in LaTeX format:\n  - Inline: `$formula$`\n  - Display: `$$formula$$`\n- Tables, figures, and references are preserved\n- Marker also extracts images to separate folder\n\n---\n\n## Comparison\n\n| Feature | Marker | Nougat |\n|---------|--------|--------|\n| Chinese Support | ✓ Excellent | ✗ Poor |\n| English Support | ✓ Excellent | ✓ Excellent |\n| Math Formulas | ✓ (Texify) | ✓ (Native) |\n| Table Extraction | ✓ | ✓ |\n| Image Extraction | ✓ | ✗ |\n| Speed (RTX 4060) | ~2 min/page | ~10-15 sec/page |\n| OCR Quality | Excellent | Good |\n\n---\n\n## Troubleshooting\n\n### Import Errors\nMake sure you're using the correct Python:\n```bash\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe your_script.py\n```\n\n### CUDA Out of Memory\nTry CPU mode (Nougat) or reduce batch size (Marker):\n```bash\n# Nougat: use CPU\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe pdf2latex.py paper.pdf --cpu\n\n# Marker: reduce batch multiplier\nD:\\anaconda3\\envs\\pdf-extractor\\python.exe pdf2md_marker.py paper.pdf --batch-multiplier 1\n```\n\n### Chinese Characters Not Recognized\nUse Marker instead of Nougat for Chinese documents.\n\n### Slow Processing\n- Marker is slower but more accurate (uses multiple ML models)\n- For faster processing on English-only papers, use Nougat\n- Ensure GPU is being used (check CUDA availability)\n\n---\n\n## Model Information\n\n**Marker Models** (downloaded automatically):\n- Surya OCR: Text detection and recognition\n- Texify: Math formula recognition\n- Layout analysis models\n\n**Nougat Base Model** (1.31 GB):\n- Location: `C:\\Users\\cr\\.cache\\torch\\hub\\nougat-0.1.0-base`\n- Best for: Standard academic papers, arXiv papers\n\n---\n\n## Example Workflow\n\n```python\nimport sys\nsys.path.insert(0, r'C:\\Users\\cr\\.config\\opencode\\skills\\pdf-extractor\\scripts')\n\ndef extract_paper(pdf_path, is_chinese=True):\n    \"\"\"\n    Extract text and formulas from academic paper.\n    \n    Args:\n        pdf_path: Path to PDF file\n        is_chinese: True for Chinese papers, False for English only\n    \n    Returns:\n        Extracted markdown text\n    \"\"\"\n    if is_chinese:\n        from pdf2md_marker import convert_pdf\n        text, _ = convert_pdf(pdf_path, languages=['Chinese', 'English'])\n    else:\n        from pdf2latex import load_model, process_pdf\n        model, device = load_model()\n        results = process_pdf(pdf_path, model, device)\n        text = '\\n\\n'.join([t for _, t in results])\n    \n    return text\n\n# Usage\ntext = extract_paper('中文论文.pdf', is_chinese=True)\nprint(text)\n```\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":283,"installsAllTime":2,"installsCurrent":2,"stars":0,"versions":1},"createdAt":1771913646779,"updatedAt":1779077244995},"latestVersion":{"version":"1.0.0","createdAt":1771913646779,"changelog":"PDF Extractor Skill 1.0.0 – Initial Release\n\n- Provides extraction of text and LaTeX math formulas from academic PDF papers in both English and Chinese.\n- Supports two tools: Marker (recommended for mixed/Chinese documents) and Nougat (for English/arXiv papers).\n- Outputs Markdown with preserved formulas, tables, images (Marker), and references.\n- Fully self-contained: runs in a pre-configured conda environment—no runtime installations allowed.\n- Detailed command-line and Python API examples, usage tips for long PDFs, and troubleshooting guidance included.","license":null},"metadata":null,"owner":{"handle":"a851445115","userId":"s178pfv7rsgfxnsafyhxym896x885m60","displayName":"Rui Chen","image":"https://avatars.githubusercontent.com/u/58624449?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1779955504287}}