Install
openclaw skills install odl-pdfConvert any PDF to Markdown, JSON, and HTML using OpenDataLoader. Supports digital PDFs, scanned PDFs with OCR, and complex layouts with table extraction and reading-order detection. Use when a user shares a PDF and wants it parsed into readable text, structured data, or searchable content.
openclaw skills install odl-pdfConvert any PDF to Markdown, JSON, or HTML using OpenDataLoader PDF — the #1 ranked open-source PDF parser.
OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:
# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null
# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java
pip install opendataloader-pdf
Or use the auto-install script (handles Java + Python automatically):
curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash
# Basic — markdown + json output
pdf2md document.pdf ./output
# HTML + JSON output
pdf2md document.pdf ./output html,json
# Markdown only
pdf2md document.pdf ./output markdown
import opendataloader_pdf
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="./output",
format="markdown,json"
)
| Type | Example | OCR Needed |
|---|---|---|
| Digital PDF | Text-based PDFs | No |
| Scanned PDF | Image-only scans | Yes (built-in) |
| Tagged PDF | Accessibility PDFs | No |
| Multi-column | Academic papers | No |
| Tables | Data reports | No |
Clean text with heading hierarchy, bullet lists, and paragraph structure.
{
"file name": "document.pdf",
"number of pages": 5,
"author": "Author Name",
"kids": [
{
"type": "heading",
"level": "Doctitle",
"page number": 1,
"bounding box": [100.0, 744.5, 404.0, 773.1],
"font": "Helvetica-Bold",
"font size": 24.0,
"content": "Document Title"
},
{
"type": "paragraph",
"page number": 1,
"bounding box": [100.0, 676.8, 316.3, 713.0],
"font": "Helvetica",
"font size": 14.0,
"content": "Paragraph text..."
}
]
}
clawhub install pdf-to-markdown
# Install dependencies
pip install opendataloader-pdf
# Make script executable
chmod +x scripts/pdf2md
PDF Input
│
▼
OpenDataLoader PDF (JVM)
│
├── PDFBox ──► Text extraction + layout analysis
├── veraPDF ──► PDF validation + structure
└── Tesseract ──► OCR (scanned PDFs)
│
▼
Output: Markdown / JSON / HTML
| Metric | Score |
|---|---|
| Overall extraction accuracy | 0.90 |
| Table extraction accuracy | 0.93 |
| Processing speed (local) | 0.05s/page |
Benchmarks on 200 real-world PDFs including multi-column and scientific papers.