Extract PDF Text

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 805 · 3 current installs · 3 all-time installs
byIván@ivangdavila
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (extract text, parse tables/forms, support OCR) match the instructions and examples, which use PyMuPDF and optionally pytesseract/Tesseract. Required binary (python3) and pip install guidance for PyMuPDF are appropriate.
Instruction Scope
SKILL.md and included docs only show local operations (opening files, rendering pages, OCR). Examples reference opening user-supplied PDF paths and handling passwords for encrypted PDFs — all within the stated scope. No instructions to read unrelated system files, transmit data externally, or modify other skills.
Install Mechanism
This is instruction-only (no install spec that downloads arbitrary artifacts). The metadata suggests installing PyMuPDF via pip and the docs recommend installing Tesseract from common package managers — standard, low-risk guidance. No obscure URLs or archive extraction are present.
Credentials
The skill requests no environment variables, credentials, or config paths. Examples do show authenticating password-protected PDFs (example passwords are demonstrative only). There are no unrelated credential requests.
Persistence & Privilege
Skill is not always-enabled, is user-invocable, and is instruction-only (no code persisted or installed by the skill). It doesn't request persistent system presence or modify other skills or system-wide settings.
Assessment
This skill is an offline how-to for using PyMuPDF and (optionally) Tesseract OCR — it appears internally consistent. Before using: (1) install Python packages in a virtualenv to avoid system-wide changes; (2) install Tesseract from your OS package manager if you need OCR; (3) review example code if you plan to copy/paste (the examples open files you provide and include illustrative hardcoded passwords — do not ship secrets in code); (4) treat PDFs you process as potentially hostile content (always run on trusted hosts or sandboxes if files come from untrusted sources). If you need confirmation of any hidden behavior, request a version that includes runnable code for review (this skill is instruction-only, so nothing executes automatically).

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.2
Download zip
latestvk972yzcs7rv5j8cyga2ecn6vzh81edy7

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

📄 Clawdis
OSLinux · macOS · Windows
Binspython3

SKILL.md

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

TopicFile
Code examplesexamples.md
OCR setupocr.md
Troubleshootingtroubleshooting.md

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF TypeMethod
Text-basedpage.get_text() — fast, accurate
ScannedOCR with pytesseract — slower
MixedCheck each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

TrapWhat HappensFix
OCR on text PDFSlow + worse accuracyCheck get_text() first
Forget to close docMemory leakUse with or doc.close()
Assume page orderWrong reading flowUse sort=True in get_text()
Ignore encodingGarbled charactersPyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY:

  • Gives code examples for PyMuPDF
  • Explains OCR setup when needed
  • Troubleshoots common issues

This skill NEVER:

  • Accesses files without user request
  • Sends data externally
  • Modifies original PDFs

Security & Privacy

All processing is local:

  • PyMuPDF runs entirely on your machine
  • No external API calls
  • No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []
    
    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"
        
        # If very little text, might be scanned
        if len(text.strip()) < 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"
        
        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })
    
    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

  • Useful? clawhub star extract-pdf-text
  • Stay updated: clawhub sync

Files

4 total
Select a file
Select a file to preview.

Comments

Loading comments…