Extract PDF Text
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 805 · 3 current installs · 3 all-time installs
byIván@ivangdavila
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name/description (extract text, parse tables/forms, support OCR) match the instructions and examples, which use PyMuPDF and optionally pytesseract/Tesseract. Required binary (python3) and pip install guidance for PyMuPDF are appropriate.
Instruction Scope
SKILL.md and included docs only show local operations (opening files, rendering pages, OCR). Examples reference opening user-supplied PDF paths and handling passwords for encrypted PDFs — all within the stated scope. No instructions to read unrelated system files, transmit data externally, or modify other skills.
Install Mechanism
This is instruction-only (no install spec that downloads arbitrary artifacts). The metadata suggests installing PyMuPDF via pip and the docs recommend installing Tesseract from common package managers — standard, low-risk guidance. No obscure URLs or archive extraction are present.
Credentials
The skill requests no environment variables, credentials, or config paths. Examples do show authenticating password-protected PDFs (example passwords are demonstrative only). There are no unrelated credential requests.
Persistence & Privilege
Skill is not always-enabled, is user-invocable, and is instruction-only (no code persisted or installed by the skill). It doesn't request persistent system presence or modify other skills or system-wide settings.
Assessment
This skill is an offline how-to for using PyMuPDF and (optionally) Tesseract OCR — it appears internally consistent. Before using: (1) install Python packages in a virtualenv to avoid system-wide changes; (2) install Tesseract from your OS package manager if you need OCR; (3) review example code if you plan to copy/paste (the examples open files you provide and include illustrative hardcoded passwords — do not ship secrets in code); (4) treat PDFs you process as potentially hostile content (always run on trusted hosts or sandboxes if files come from untrusted sources). If you need confirmation of any hidden behavior, request a version that includes runnable code for review (this skill is instruction-only, so nothing executes automatically).Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.2
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
📄 Clawdis
OSLinux · macOS · Windows
Binspython3
SKILL.md
When to Use
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
Quick Reference
| Topic | File |
|---|---|
| Code examples | examples.md |
| OCR setup | ocr.md |
| Troubleshooting | troubleshooting.md |
Core Rules
1. Install PyMuPDF First
pip install PyMuPDF
Import as fitz (historical name):
import fitz # PyMuPDF
2. Basic Text Extraction
import fitz
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
3. Pick the Right Method
| PDF Type | Method |
|---|---|
| Text-based | page.get_text() — fast, accurate |
| Scanned | OCR with pytesseract — slower |
| Mixed | Check each page, use OCR when needed |
4. Check for Text Before OCR
def needs_ocr(page):
text = page.get_text().strip()
return len(text) < 50 # Likely scanned if very little text
5. Handle Errors Gracefully
try:
doc = fitz.open(path)
except fitz.FileDataError:
print("Invalid or corrupted PDF")
except fitz.PasswordError:
doc = fitz.open(path, password="secret")
Extraction Traps
| Trap | What Happens | Fix |
|---|---|---|
| OCR on text PDF | Slow + worse accuracy | Check get_text() first |
| Forget to close doc | Memory leak | Use with or doc.close() |
| Assume page order | Wrong reading flow | Use sort=True in get_text() |
| Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
Scope
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
- Gives code examples for PyMuPDF
- Explains OCR setup when needed
- Troubleshoots common issues
This skill NEVER:
- Accesses files without user request
- Sends data externally
- Modifies original PDFs
Security & Privacy
All processing is local:
- PyMuPDF runs entirely on your machine
- No external API calls
- No data leaves your system
Output Formats
Plain Text
text = page.get_text()
Structured (dict)
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
for span in line["spans"]:
print(span["text"], span["size"])
JSON
import json
data = page.get_text("json")
parsed = json.loads(data)
Full Example
import fitz
def extract_pdf(path):
"""Extract text from PDF, with OCR fallback for scanned pages."""
doc = fitz.open(path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
method = "text"
# If very little text, might be scanned
if len(text.strip()) < 50:
# OCR would go here (see ocr.md)
method = "needs_ocr"
results.append({
"page": i + 1,
"text": text,
"method": method
})
doc.close()
return {
"pages": len(results),
"content": results,
"word_count": sum(len(r["text"].split()) for r in results)
}
# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")
Feedback
- Useful?
clawhub star extract-pdf-text - Stay updated:
clawhub sync
Files
4 totalSelect a file
Select a file to preview.
Comments
Loading comments…
