Install
openclaw skills install pdf-extract-skillOpenClaw PDF extraction skill using OpenDataLoader. Use when the user wants to extract and process PDF content for RAG, embeddings, or coordinate-based citations.
openclaw skills install pdf-extract-skillTo improve maintainability and allow targeted calls to specific .md files, this skill relies on helper documents:
Usage rules:
This skill maximizes PDF reading quality for OpenClaw in ClawHub using OpenDataLoader PDF.
Pillars:
Use this skill when the user needs to:
Do not use this skill for:
Since the MCP does not exist yet, this skill must operate with CLI only:
Do not create complex wrappers or intermediate services unless strictly needed.
Always validate before conversion:
Quick checks:
If Java fails on Windows, reopen the terminal and verify PATH.
Always process multiple files in a single invocation to avoid JVM startup overhead per call.
Recommended example: opendataloader-pdf file1.pdf file2.pdf ./folder/ -o ./output -f json,markdown
Suggested response:
Template: "Processing completed. N PDFs were converted to ./output with json,markdown format. If you want, I can now extract specific pages or enable OCR for scanned files."
opendataloader-pdf ./pdfs/ -o ./output -f markdown
opendataloader-pdf ./pdfs/ -o ./output -f json,markdown
opendataloader-pdf report.pdf -o ./output -f json --pages "1,3,5-7"
opendataloader-pdf report.pdf -o ./output -f markdown --sanitize
opendataloader-pdf report.pdf -o ./output -f markdown --keep-line-breaks
opendataloader-pdf report.pdf -o ./output -f json --image-output external opendataloader-pdf report.pdf -o ./output -f json --image-output embedded
Use it when:
Standard: opendataloader-pdf-hybrid --port 5002
Forced OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr
Multi-language OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "es,en"
With image descriptions: opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
Hybrid auto mode: opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf ./folder/ -o ./output -f json,markdown
With timeout and fallback: opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 --hybrid-fallback file1.pdf ./folder/ -o ./output -f json
Image descriptions enabled (full required): opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf ./folder/ -o ./output -f json,markdown
Critical note: If the backend starts with --enrich-picture-description, the client must use --hybrid-mode full to include descriptions in output.
Problem: Java not found. Solution: install Java 11+ and verify with java -version.
Problem: Hybrid backend connection error. Solution: start opendataloader-pdf-hybrid in another terminal and verify port 5002.
Problem: Too slow. Solution: process in batches, increase hybrid timeout, and verify backend RAM.
Problem: Mixed columns. Solution: use default reading mode (xycut) and try --use-struct-tree for tagged PDFs.
Problem: Poor table quality. Solution: use json output + hybrid mode.
This skill uses and credits the excellent OpenDataLoader project: https://opendataloader.org/
Official documentation used for this version: https://opendataloader.org/docs