pdf-ocr-extraction

v1.0.3

Extract text from image-based or scanned PDFs using Tesseract OCR.

1· 265·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name, description, required binaries (tesseract, python3), and Python dependencies (pypdfium2, pytesseract, Pillow) are exactly what you'd expect for a local Tesseract-based PDF OCR tool. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md contains concrete instructions to render PDF pages to images, OCR them, and clean up temporary files in /tmp — this is appropriate for the purpose. Note: example code uses predictable filenames (/tmp/page_{i}.png) which can be vulnerable to race/symlink attacks on multi-user systems; it also assumes language packs are present and instructs not to auto-download them. The skill will read full document contents (expected for OCR) — treat sensitive PDFs accordingly.
Install Mechanism
Install metadata uses a system package for tesseract and a 'uv' entry to install pypdfium2, pytesseract, and Pillow. This is proportional to the task. 'uv' corresponds to installing Python packages (moderate trust surface because wheels/binaries may include native code), but no arbitrary downloads or unfamiliar hosts are specified.
Credentials
No environment variables, credentials, or config paths are requested. The absence of secrets is consistent with a purely local OCR tool.
Persistence & Privilege
The skill is not forced-available (always: false) and does not request persistent system-wide privileges or modify other skills. It can be invoked autonomously by the agent (platform default) — this is normal but means agents could OCR documents if given access.
Assessment
This skill appears to do what it says, but take these precautions before installing or running it: - Verify and install tesseract from your distro/vendor and make sure required language packs (e.g., eng, chi_sim) are present; the skill will not auto-download them. - Install Python packages from trusted sources (pip PyPI) and pin versions if you care about supply-chain consistency. pypdfium2 and pytesseract include native code — review wheels if you require extra assurance. - Run OCR in a restricted environment (container or dedicated VM) if processing untrusted PDFs, as OCR returns document text which may contain sensitive data. - Improve the example by using secure temporary-file APIs (tempfile.NamedTemporaryFile or mkstemp) rather than predictable filenames in /tmp to avoid symlink/race attacks. - Because the skill's source is unknown, review or run the example code locally before granting any automated agent access; do not expose sensitive documents to an unreviewed or autonomous agent without oversight.

Like a lobster shell, security has layers — review code before you run it.

latestvk973jwbgf7bjrv2xwhyb9sjwjh837d14

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

📄 Clawdis
Binstesseract, python3

Install

Install Python dependencies (pypdfium2, pytesseract, Pillow)uv tool install pypdfium2 pytesseract Pillow

Comments