ODL PDF to Markdown

Data & APIs

Convert any PDF to Markdown, JSON, and HTML using OpenDataLoader — the #1 ranked open-source PDF parser. Extract text from digital PDFs, scanned PDFs with built-in OCR, complex multi-column layouts, and data-heavy tables with 0.93 accuracy. Outputs clean Markdown for RAG pipelines, bounding-box JSON for source citations, and HTML for rich rendering. No API key required — fully self-hosted. Use when extracting text from research papers, parsing scanned documents, converting tables from reports, building RAG knowledge bases, or processing batches of PDFs for AI pipelines.

duplicate of @adelpro/odl-pdf

Install

openclaw skills install odl-pdf-to-markdown

ODL PDF to Markdown

The #1 ranked open-source PDF parser. Convert any PDF to Markdown, JSON, and HTML with bounding-box coordinates for precise source citations in RAG pipelines.

Why ODL PDF?

FeatureOthersODL PDF
Benchmark accuracy0.75 avg0.90
Table accuracy0.70 avg0.93
Bounding-box JSONNoYes
Hybrid AI modeNoYes
Built-in OCRExtra setupYes
Multi-column layoutBasicXY-Cut++
Self-hostedYesYes
API key neededOftenNo

Strong Points

  • #1 in benchmarks — 0.90 overall extraction accuracy, 0.93 table accuracy across 200 real-world PDFs

  • Bounding-box JSON — every element (heading, paragraph, table, image) has pixel coordinates for RAG citations

  • Hybrid AI mode — routes complex pages to AI backend for charts, formulas, and borderless tables

  • Built-in OCR — 80+ languages, works with scanned PDFs at 300 DPI+

  • XY-Cut++ reading order — correct reading order for multi-column academic papers

  • No API key — fully self-hosted, open-source Apache 2.0

  • Multiple formats — Markdown (clean text), JSON (structured), HTML (rich layout)

  • Markdown — clean readable text with correct reading order

  • JSON — structured data with bounding boxes, font sizes, page numbers

  • HTML — rich HTML output preserving layout

  • OCR — built-in OCR for scanned PDFs (80+ languages)

  • Tables — complex table extraction (0.93 accuracy)

  • Reading order — XY-Cut++ algorithm for multi-column layouts

  • No API key — fully self-hosted, open-source

Requirements

Java 11+ (symlink setup)

OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:

# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null

# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java

Python 3.10+

pip install opendataloader-pdf

Or use the auto-install script (handles Java + Python automatically):

curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash

Usage

CLI

# Basic — markdown + json output
pdf2md document.pdf ./output

# HTML + JSON output
pdf2md document.pdf ./output html,json

# Markdown only
pdf2md document.pdf ./output markdown

Python

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="./output",
    format="markdown,json"
)

Supported Input Formats

TypeExampleOCR Needed
Digital PDFText-based PDFsNo
Scanned PDFImage-only scansYes (built-in)
Tagged PDFAccessibility PDFsNo
Multi-columnAcademic papersNo
TablesData reportsNo

Output Formats

Markdown

Clean text with heading hierarchy, bullet lists, and paragraph structure.

JSON

{
  "file name": "document.pdf",
  "number of pages": 5,
  "author": "Author Name",
  "kids": [
    {
      "type": "heading",
      "level": "Doctitle",
      "page number": 1,
      "bounding box": [100.0, 744.5, 404.0, 773.1],
      "font": "Helvetica-Bold",
      "font size": 24.0,
      "content": "Document Title"
    },
    {
      "type": "paragraph",
      "page number": 1,
      "bounding box": [100.0, 676.8, 316.3, 713.0],
      "font": "Helvetica",
      "font size": 14.0,
      "content": "Paragraph text..."
    }
  ]
}

Installation

OpenClaw Skill Install

clawhub install pdf-to-markdown

Manual Install

# Install dependencies
pip install opendataloader-pdf

# Make script executable
chmod +x scripts/pdf2md

Architecture

PDF Input
    │
    ▼
OpenDataLoader PDF (JVM)
    │
    ├── PDFBox    ──► Text extraction + layout analysis
    ├── veraPDF   ──► PDF validation + structure
    └── Tesseract ──► OCR (scanned PDFs)
    │
    ▼
Output: Markdown / JSON / HTML

Benchmark

OpenDataLoader ranked #1 against 9 other open-source and commercial PDF parsers on a test set of 200 real-world PDFs:

MetricScore
Overall extraction accuracy0.90
Table extraction accuracy0.93
Processing speed (local mode)0.05s/page

Common Use Cases

  • RAG pipelines — convert PDFs to chunkable markdown
  • Document parsing — extract text from research papers
  • Accessibility — convert PDFs to structured data
  • Data extraction — pull tables from reports
  • Content migration — PDF to markdown for wikis/docs

See Also