PDF Utils

v1.0.1

PDF Utils enables OCR of image-based PDFs, extraction of arXiv IDs from text or OCR output, and scriptable PDF tasks like merging, splitting, and rendering.

0· 212·0 current·0 all-time
byLu Wang@wangwllu

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for wangwllu/pdf-utils.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "PDF Utils" (wangwllu/pdf-utils) from ClawHub.
Skill page: https://clawhub.ai/wangwllu/pdf-utils
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install pdf-utils

ClawHub CLI

Package manager switcher

npx clawhub@latest install pdf-utils
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (OCR, arXiv extraction, merge/split/render) matches the provided scripts and docs. The code only requires PyMuPDF, pytesseract, Pillow and the tesseract binary (all relevant to OCR and PDF processing). No unrelated binaries, env vars, or config paths are requested.
Instruction Scope
SKILL.md and the scripts are focused on local PDF processing. The scripts read PDFs, optionally OCR pages, extract arXiv identifiers, and (optionally) download PDFs from arxiv.org. They do not read arbitrary system credentials or other unrelated filesystem locations. Note: some scripts invoke subprocesses (curl for downloads and tesseract --list-langs) and will perform network downloads when the --download flag is used, which is consistent with the documented behavior.
Install Mechanism
This is an instruction-only skill (no install spec). SKILL.md recommends installing tesseract via brew and Python packages via pip. That is expected for OCR functionality but requires the user to run external installers (brew/pip) and to install tesseract language packs; ensure you run these from trusted package sources. No archive downloads or arbitrary URLs are used by an install step.
Credentials
The skill declares no required environment variables or credentials. The code does not attempt to access secrets or unrelated environment variables. Network access is used only to fetch papers from arxiv.org when the download option is selected.
Persistence & Privilege
The skill does not request always:true and does not modify other skills or global agent configuration. It runs as user-invocable code and will only create files/directories where the CLI is instructed to (e.g., output dir for downloads or OCR text).
Assessment
This skill appears coherent and implements what it claims. Before installing/using: (1) review and run the scripts on unprivileged/sample PDFs to confirm behavior; (2) be aware OCR requires installing the tesseract binary and language packs (SKILL.md suggests brew); (3) the extract_refs download option uses curl to fetch PDFs from arxiv.org — only enable downloads when you want network activity and ensure you trust the source; (4) the scripts write output files (papers/, temporary PNGs, OCR text) in locations you specify — run them in directories you control; (5) if you need higher assurance, inspect or run the included tests and review the small subprocess calls (curl, tesseract) which are expected for this functionality.

Like a lobster shell, security has layers — review code before you run it.

latestvk97fbcrncvw2mpy96619ccr8vs83ceas
212downloads
0stars
2versions
Updated 1mo ago
v1.0.1
MIT-0

PDF Utils

Use this skill for local, scriptable PDF processing. It is a stable 1.x skill for OCR, arXiv reference mining, and repeatable PyMuPDF workflows. Prefer the built-in pdf tool for AI-style reading, summarization, question-answering, and semantic analysis of PDF content.

Choose the right tool

  • Use the built-in pdf tool for summary, Q&A, extraction by meaning, or general document understanding.
  • Use scripts/extract_refs.py when the PDF already has extractable text and you need arXiv IDs or batch downloads.
  • Use scripts/ocr_pdf.py when the PDF is scanned/image-based and text extraction is poor or empty.
  • Use scripts/pdf_ops.py for repeatable local PDF operations such as merge, split, and rendering a page to an image.

Core workflows

Extract arXiv IDs from a text PDF

Run:

python3 scripts/extract_refs.py paper.pdf

If needed, download the referenced papers:

python3 scripts/extract_refs.py paper.pdf --download --out ~/papers/

OCR a scanned PDF

Run OCR on all pages:

python3 scripts/ocr_pdf.py paper.pdf --all

To OCR and immediately extract arXiv IDs from the OCR output:

python3 scripts/ocr_pdf.py paper.pdf --all --extract-refs

Dependencies

Install these before using OCR features:

brew install tesseract
brew install tesseract-lang
pip3 install pytesseract Pillow pymupdf --break-system-packages

Read more only if needed

  • Read references/usage.md for CLI examples, programmatic API notes, PDF ops usage, and known limits.
  • Read the scripts directly if you need to patch behavior or reuse helper functions.

Practical guidance

  • For very large PDFs, OCR in page ranges or batches instead of all at once.
  • For handwritten or low-resolution scans, expect OCR quality to drop.
  • If a PDF yields partial references, inspect the reference pages first instead of assuming extraction is complete.
  • For merge/split/page rendering, use scripts/pdf_ops.py first before writing one-off snippets.

Comments

Loading comments...