Install
openclaw skills install pdf-readerExtract text from PDF files with automatic OCR fallback for scanned/image-based PDFs. Use when: (1) a user sends a PDF file and the framework did not auto-inject text content, (2) the injected text is empty or garbled, (3) a PDF file exists on disk and needs text extraction, (4) user mentions "read PDF", "extract PDF", "PDF content", "scan PDF", "OCR". Handles both text-layer PDFs (fast pdftotext) and scanned/image PDFs (tesseract OCR). Supports Chinese + English by default, configurable languages.
openclaw skills install pdf-readerExtract text from any PDF — text-layer or scanned image.
PDF received
├─ Has text layer? ──→ pdftotext (fast, high quality)
│ └─ Text too sparse? ──→ Fall back to OCR
└─ Detected as scan? ──→ Skip text, go straight to OCR
pdftoppm → tesseract
Run the bundled script via exec:
bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf
Save to file:
bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf --output /tmp/result.txt
Then read /tmp/result.txt with the read tool.
<file> text content was injected (only file path visible)/root/.openclaw/media/inbound/...)Example:
# Extract and save
bash <skill-dir>/scripts/pdf-extract.sh "/root/.openclaw/media/inbound/document.pdf" -o /tmp/pdf-text.txt
# Then use read tool on /tmp/pdf-text.txt
| Flag | Description | Default |
|---|---|---|
--lang | Tesseract languages (validated against allowlist) | chi_sim+eng |
--dpi | Image resolution for OCR | 300 |
--output / -o | Save to file instead of stdout | stdout |
--ocr-only | Force OCR, skip text extraction | off |
--text-only | Text extraction only, no OCR fallback | off |
--auto-install | Auto-install missing tools (poppler, tesseract) | off |
By default, the script does not install packages automatically. If tools are missing, it prints install instructions and exits.
To enable auto-install, pass --auto-install:
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --auto-install
This installs poppler-utils and tesseract-ocr via apt-get, yum, or brew as needed.
Pre-install recommended (run once on the server):
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-chi-sim
Default: Chinese Simplified + English (chi_sim+eng).
The --lang parameter is validated against a strict allowlist of official tesseract language codes. Invalid or malformed values are rejected.
Other languages:
# Japanese + English
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang jpn+eng
# Korean
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang kor
Tesseract language packs are auto-installed based on --lang.