opendataloader-pdf

v1.0.0

Use when parsing PDFs for RAG pipelines, extracting structured data from PDFs, or converting PDFs to Markdown/JSON with bounding boxes for AI processing

⭐ 0· 233·1 current·1 all-time

byempty_4399@emptyguo

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for emptyguo/opendataloader-pdf.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "opendataloader-pdf" (emptyguo/opendataloader-pdf) from ClawHub.
Skill page: https://clawhub.ai/emptyguo/opendataloader-pdf
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install opendataloader-pdf

ClawHub CLI

Package manager switcher

npx clawhub@latest install opendataloader-pdf

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description (PDF parsing for RAG, bounding boxes, Markdown/JSON output) align with the SKILL.md: it documents CLI/Python/Node APIs, supported modes (fast/hybrid/OCR), and expected outputs. Required system dependencies (Java, Python/Node) are reasonable for PDF parsing/OCR pipelines.

✓

Instruction Scope

SKILL.md only instructs installing the package(s), running conversion commands, and configuring mode/ocr/languages. It references input file paths and output directories (expected for this purpose). It does not instruct reading unrelated system files, exporting secrets, or sending data to unexpected external endpoints. The only potential scope caveat: 'hybrid' mode and 'start server' are mentioned but not detailed — those could change data flows depending on implementation, so users should verify hybrid behavior before enabling.

✓

Install Mechanism

This is an instruction-only skill with no install spec. The SKILL.md recommends pip/npm installs (standard registries). No embedded download URLs or archive extraction steps in the skill itself. Installing from PyPI/npm is a common, low-risk approach — verify package provenance when installing.

✓

Credentials

The skill declares no required environment variables, credentials, or config paths. The SKILL.md does not reference secret env vars. This is proportionate for a local PDF-extraction tool.

✓

Persistence & Privilege

always is false and the skill does not request persistent system presence or modify other skills. It does not require elevated privileges or access to other agents' configs.

Assessment

This skill appears coherent and focused on local PDF extraction. Before installing: 1) verify the opendataloader-pdf package on PyPI/npm and confirm the upstream GitHub/source and release integrity; 2) be aware that hybrid mode or any server mode may change data flows (it could call external services or require models) — read the hybrid-mode docs and any config for remote endpoints or API keys before enabling; 3) run installations in an isolated environment (virtualenv/container) and test on non-sensitive documents first; 4) ensure Java 11+ and any OCR dependencies are installed from trusted sources; and 5) if you need guarantees about data staying local, confirm implementation details for hybrid/OCR modes in the project's docs or source code.

Like a lobster shell, security has layers — review code before you run it.

latestvk9744vx8txfb95dk7gtdrzf36s839p76

233downloads

0stars

1versions

Updated 1mo ago

v1.0.0

MIT-0

This skill is based on OpenDataLoader PDF, licensed under Apache License 2.0.

OpenDataLoader PDF

PDF parser for AI-ready data extraction. Open-source. #1 in benchmarks (0.90 overall accuracy).

Prerequisites

Java 11+ required (run java -version to verify)
Python 3.10+ or Node.js 18+

Installation

# Python
pip install -U opendataloader-pdf

# Python with hybrid AI mode (for complex documents)
pip install -U "opendataloader-pdf[hybrid]"

# Node.js
npm install @opendataloader/pdf

Quick Start

Python

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"  # Output formats: markdown, json, html
)

Node.js

import { convert } from '@opendataloader/pdf';

await convert(['file1.pdf', 'file2.pdf'], {
  outputDir: 'output/',
  format: 'markdown,json'
});

CLI

# Fast local mode (0.05s/page)
opendataloader-pdf file1.pdf file2.pdf -o output/

# With hybrid AI mode (higher accuracy for complex docs)
opendataloader-pdf --hybrid docling-fast file1.pdf

Mode Selection

Document Type	Mode	Command
Standard digital PDF	Fast (default)	`pip install opendataloader-pdf`
Complex tables	Hybrid	`pip install "opendataloader-pdf[hybrid]"` + start server
Scanned PDFs	Hybrid + OCR	`opendataloader-pdf-hybrid --force-ocr`
Multi-language scanned	Hybrid + OCR	`--ocr-lang "zh,en"`
Mathematical formulas	Hybrid + formula	`--enrich-formula`
Charts needing description	Hybrid + picture	`--enrich-picture-description`

Benchmark Comparison

Engine	Overall	Table	Speed (s/page)
opendataloader [hybrid]	0.90	0.93	0.43
opendataloader (local)	0.72	0.49	0.05
docling	0.86	0.89	0.73
marker	0.83	0.81	53.93
pymupdf4llm	0.57	0.40	0.09

Key Features

Bounding boxes for every element (for source citations in RAG)
XY-Cut++ reading order for multi-column layouts
100% local - no data sent to cloud
AI safety filters - prompt injection protection
Table extraction - borderless tables via hybrid mode
OCR - 80+ languages via hybrid mode
Formula extraction - LaTeX output

LangChain Integration

pip install langchain-opendataloader-pdf

from langchain_opendataloader_pdf import OpenDataLoaderPDF

loader = OpenDataLoaderPDF(file_path="document.pdf")
docs = loader.load()

Output Formats

markdown: Structured text with heading hierarchy
json: Element-level with bounding boxes, page numbers, types
html: Rich formatted output

JSON output includes:

type: paragraph, heading, table, image, etc.
content: text content
bbox: [left, bottom, right, top] in PDF points
page: page number
heading_level: for headings

Common Issues

Issue	Solution
"java not found"	Install JDK 11+ from adoptium.net
Slow repeated calls	Batch files in single call; each spawns JVM
Poor table accuracy	Use hybrid mode (`--hybrid docling-fast`)
Scanned PDF not extracted	Use hybrid mode with `--force-ocr`
Non-English OCR not work	Specify `--ocr-lang "zh,en"`

Resources

Comments

Loading comments...