PyMuPDF PDF Parser Clawdbot Skill

Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
3 · 4.7k · 32 current installs · 33 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description, README, SKILL.md, and the included script all describe and implement fast local PDF parsing to Markdown/JSON/images/tables using PyMuPDF. There are no unrelated binaries, configs, or environment variables required.
Instruction Scope
SKILL.md tells the agent to run the included script on a local PDF and write outputs to a per-document directory. The script reads only the provided PDF and writes output files; it does not reference external endpoints, other config paths, or secret material.
Install Mechanism
No automated install spec is included (instruction-only plus a local script). The only runtime dependency is PyMuPDF (fitz), which the README instructs to install via pip. No downloads from untrusted URLs or archive extraction are present.
Credentials
The skill declares no required environment variables or credentials and the code does not access environment secrets. The dependency on PyMuPDF is appropriate and proportionate to the stated functionality.
Persistence & Privilege
always:false and standard user-invocable/autonomous settings. The skill does not request permanent presence, does not modify other skills' configs, and operates only on files provided at runtime.
Assessment
This skill appears coherent and local-only: it runs a bundled Python script that reads a PDF and writes Markdown/JSON/images to disk and does not contact external services or require credentials. Before installing or running it: (1) install PyMuPDF from a trusted source (pip from PyPI) and inspect the script yourself; (2) run the tool in a sandbox or with unprivileged user access if processing untrusted PDFs—PDF parsing libraries have had vulnerabilities in the past; (3) note the skill's registry metadata lists no homepage/source verification—if you want stronger assurance, fetch the repository linked in the README and verify commit history; (4) if you need robust table/layout extraction consider the larger parser the README references. Overall there are no red flags in the files provided, but treat untrusted PDFs and third-party pip installs with standard caution.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk970tbqmz4hapb0y2xqb5y8dc57zr47c

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

PyMuPDF PDF

Overview

Parse PDFs locally using PyMuPDF for fast, lightweight extraction into Markdown by default, with optional JSON and image/table outputs in a per-document directory.

Prereqs / when to read references

If you hit import errors (PyMuPDF not installed) or Nix libstdc++ issues, read:

  • references/pymupdf-notes.md

Quick start (single PDF)

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/file.pdf \
  --format md \
  --outroot ./pymupdf-output

Options

  • --format md|json|both (default: md)
  • --images to extract images
  • --tables to extract a simple line-based table JSON (quick/rough)
  • --outroot DIR to change output root
  • --lang adds a language hint into JSON output metadata

Output conventions

  • Create ./pymupdf-output/<pdf-basename>/ by default.
  • Markdown output: output.md
  • JSON output: output.json (includes lang)
  • Images: images/ subdir
  • Tables: tables.json (rough line-based)

Notes

  • PyMuPDF is fast but less robust on complex PDFs.
  • For more robust parsing, use a heavy-duty OCR parser (e.g., MinerU) if installed.

Files

4 total
Select a file
Select a file to preview.

Comments

Loading comments…