Install
openclaw skills install extract-pdf-textExtract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
openclaw skills install extract-pdf-textAgent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
| Topic | File |
|---|---|
| Code examples | examples.md |
| OCR setup | ocr.md |
| Troubleshooting | troubleshooting.md |
pip install PyMuPDF
Import as fitz (historical name):
import fitz # PyMuPDF
import fitz
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
| PDF Type | Method |
|---|---|
| Text-based | page.get_text() — fast, accurate |
| Scanned | OCR with pytesseract — slower |
| Mixed | Check each page, use OCR when needed |
def needs_ocr(page):
text = page.get_text().strip()
return len(text) < 50 # Likely scanned if very little text
try:
doc = fitz.open(path)
except fitz.FileDataError:
print("Invalid or corrupted PDF")
except fitz.PasswordError:
doc = fitz.open(path, password="secret")
| Trap | What Happens | Fix |
|---|---|---|
| OCR on text PDF | Slow + worse accuracy | Check get_text() first |
| Forget to close doc | Memory leak | Use with or doc.close() |
| Assume page order | Wrong reading flow | Use sort=True in get_text() |
| Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
This skill NEVER:
All processing is local:
text = page.get_text()
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
for span in line["spans"]:
print(span["text"], span["size"])
import json
data = page.get_text("json")
parsed = json.loads(data)
import fitz
def extract_pdf(path):
"""Extract text from PDF, with OCR fallback for scanned pages."""
doc = fitz.open(path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
method = "text"
# If very little text, might be scanned
if len(text.strip()) < 50:
# OCR would go here (see ocr.md)
method = "needs_ocr"
results.append({
"page": i + 1,
"text": text,
"method": method
})
doc.close()
return {
"pages": len(results),
"content": results,
"word_count": sum(len(r["text"].split()) for r in results)
}
# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")
clawhub star extract-pdf-textclawhub sync