opendataloader-pdf

v1.0.0

Use when parsing PDFs for RAG pipelines, extracting structured data from PDFs, or converting PDFs to Markdown/JSON with bounding boxes for AI processing

0· 129·1 current·1 all-time
byempty_4399@emptyguo
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (PDF parsing for RAG, bounding boxes, Markdown/JSON output) align with the SKILL.md: it documents CLI/Python/Node APIs, supported modes (fast/hybrid/OCR), and expected outputs. Required system dependencies (Java, Python/Node) are reasonable for PDF parsing/OCR pipelines.
Instruction Scope
SKILL.md only instructs installing the package(s), running conversion commands, and configuring mode/ocr/languages. It references input file paths and output directories (expected for this purpose). It does not instruct reading unrelated system files, exporting secrets, or sending data to unexpected external endpoints. The only potential scope caveat: 'hybrid' mode and 'start server' are mentioned but not detailed — those could change data flows depending on implementation, so users should verify hybrid behavior before enabling.
Install Mechanism
This is an instruction-only skill with no install spec. The SKILL.md recommends pip/npm installs (standard registries). No embedded download URLs or archive extraction steps in the skill itself. Installing from PyPI/npm is a common, low-risk approach — verify package provenance when installing.
Credentials
The skill declares no required environment variables, credentials, or config paths. The SKILL.md does not reference secret env vars. This is proportionate for a local PDF-extraction tool.
Persistence & Privilege
always is false and the skill does not request persistent system presence or modify other skills. It does not require elevated privileges or access to other agents' configs.
Assessment
This skill appears coherent and focused on local PDF extraction. Before installing: 1) verify the opendataloader-pdf package on PyPI/npm and confirm the upstream GitHub/source and release integrity; 2) be aware that hybrid mode or any server mode may change data flows (it could call external services or require models) — read the hybrid-mode docs and any config for remote endpoints or API keys before enabling; 3) run installations in an isolated environment (virtualenv/container) and test on non-sensitive documents first; 4) ensure Java 11+ and any OCR dependencies are installed from trusted sources; and 5) if you need guarantees about data staying local, confirm implementation details for hybrid/OCR modes in the project's docs or source code.

Like a lobster shell, security has layers — review code before you run it.

latestvk9744vx8txfb95dk7gtdrzf36s839p76

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Comments