Install
openclaw skills install toji-doc-extractComprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyzed, summarized, or processed. Handles native PDFs, scanned PDFs (via OCR), and Word documents.
openclaw skills install toji-doc-extractExtracts everything useful from PDF and Word documents so you can process, summarize, or act on the content.
~/.openclaw/venvs/doctoolspymupdf, pdfplumber, python-docx (pre-installed)tesseract (OCR), pandoc (optional), pdftotext (from poppler)/opt/homebrew/bin/scripts/extract.py — the main extractor. Always use the venv:
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py <file> [options]
| Flag | Description |
|---|---|
--format markdown | Human-readable output (default) |
--format json | Structured JSON for programmatic use |
--output-dir <dir> | Where to save extracted images (default: <filename>_extracted/ next to file) |
--ocr | Run Tesseract OCR on pages with no extractable text (scanned docs) |
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/contract.docx
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/scan.pdf --ocr
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/data.pdf --format json
~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf --output-dir ~/Desktop/report_images
Markdown mode gives you:
JSON mode gives you:
type: pdf or docxfile: source pathmetadata: document propertiestext: full concatenated textpages_text: per-page text (PDF only)tables: array of {page, index, rows, cols, data[][]}images: array of {page, index, path, format, width, height}errors: any non-fatal issues encountered--ocr or when page has no text)Always use claude-opus-4-6 (opus) when analyzing extracted content — summarizing, answering questions, processing tables, etc.
--format json and parse programmaticallyqpdf --decrypt input.pdf output.pdf.doc (old Word format), convert first: pandoc input.doc -o output.docx