Toji Doc Extractor

Data & APIs

Comprehensively extract text, tables, images, and metadata from PDF and Word (.docx) documents. Use when a user shares a document and wants it parsed, analyzed, summarized, or processed. Handles native PDFs, scanned PDFs (via OCR), and Word documents.

Install

openclaw skills install toji-doc-extract

doc-extract

Extracts everything useful from PDF and Word documents so you can process, summarize, or act on the content.

When to Use

  • User shares or references a PDF or DOCX file
  • Need to read, summarize, or analyze a document
  • Extracting tables for data processing
  • Pulling images out of a document
  • Checking document metadata (author, date, title)
  • OCR on scanned/image-based PDFs

Requirements

  • Python venv: ~/.openclaw/venvs/doctools
  • Libraries: pymupdf, pdfplumber, python-docx (pre-installed)
  • CLI tools: tesseract (OCR), pandoc (optional), pdftotext (from poppler)
  • All installed at: /opt/homebrew/bin/

Script

scripts/extract.py — the main extractor. Always use the venv:

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py <file> [options]

Options

FlagDescription
--format markdownHuman-readable output (default)
--format jsonStructured JSON for programmatic use
--output-dir <dir>Where to save extracted images (default: <filename>_extracted/ next to file)
--ocrRun Tesseract OCR on pages with no extractable text (scanned docs)

Common Workflows

Extract and read a PDF

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf

Extract a Word doc

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/contract.docx

Scanned PDF (OCR mode)

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/scan.pdf --ocr

Get structured JSON output

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/data.pdf --format json

Save images to a specific folder

~/.openclaw/venvs/doctools/bin/python3 /Users/kong/.openclaw/workspace/skills/doc-extract/scripts/extract.py ~/Desktop/report.pdf --output-dir ~/Desktop/report_images

Output Structure

Markdown mode gives you:

  • Document metadata (title, author, date)
  • Summary (page count, table count, image count)
  • Full extracted text
  • Tables rendered as Markdown tables
  • Image paths for extracted images
  • Any warnings/errors

JSON mode gives you:

  • type: pdf or docx
  • file: source path
  • metadata: document properties
  • text: full concatenated text
  • pages_text: per-page text (PDF only)
  • tables: array of {page, index, rows, cols, data[][]}
  • images: array of {page, index, path, format, width, height}
  • errors: any non-fatal issues encountered

Extraction Strategy

  1. PDF native text → pymupdf (fast, accurate for digital PDFs)
  2. PDF tables → pdfplumber (best-in-class table detection)
  3. PDF images → pymupdf image extraction
  4. PDF scanned pages → tesseract OCR (only triggered with --ocr or when page has no text)
  5. DOCX text → python-docx (preserves paragraph styles/headings)
  6. DOCX tables → python-docx table parser
  7. DOCX images → extracted from document relationships

Model

Always use claude-opus-4-6 (opus) when analyzing extracted content — summarizing, answering questions, processing tables, etc.

Tips

  • For large documents, use --format json and parse programmatically
  • Images are saved to disk — reference the paths in follow-up tasks
  • Tables come out as 2D arrays — easy to convert to CSV or analyze
  • For password-protected PDFs, unlock first with: qpdf --decrypt input.pdf output.pdf
  • For .doc (old Word format), convert first: pandoc input.doc -o output.docx