Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

OpenDataLoader PDF

v1.0.0

Parse PDFs into Markdown, JSON, or HTML with OCR, table extraction, and AI-enriched descriptions for building RAG pipelines and knowledge bases.

0· 87·0 current·0 all-time
bymingyuan@zmy1006-sudo

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zmy1006-sudo/opendataloader-pdf-zmy.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "OpenDataLoader PDF" (zmy1006-sudo/opendataloader-pdf-zmy) from ClawHub.
Skill page: https://clawhub.ai/zmy1006-sudo/opendataloader-pdf-zmy
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install opendataloader-pdf-zmy

ClawHub CLI

Package manager switcher

npx clawhub@latest install opendataloader-pdf-zmy
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
Name/description match the provided instructions and examples (PDF→Markdown/JSON/HTML, OCR, table extraction, hybrid AI backend). However the registry metadata lists 'source: unknown' and no homepage while SKILL.md claims a GitHub repo and pip package names — this mismatch reduces verifiability of the package origin.
!
Instruction Scope
SKILL.md instructs installing and running a pip package and a hybrid backend (opendataloader-pdf-hybrid) that listens on a port, and examples use local file system operations (expected). But the docs reference environment variables and services (JAVA_HOME, OPENDATALOADER_HYBRID_URL, and example use of OpenAIEmbeddings) that are not declared in the skill metadata — the agent may rely on secrets or network endpoints not surfaced to the registry.
Install Mechanism
This is an instruction-only skill (no install spec in registry). The SKILL.md explicitly tells users to pip install opendataloader-pdf and related packages; that will fetch third-party code from PyPI (or another index) at runtime. While normal for a library, the registry provides no pinned source or checksum and the registry metadata doesn't link to the claimed GitHub repo, so verifying the package before installation requires manual checking.
!
Credentials
The skill declares no required env vars or credentials in registry metadata, but the documentation references JAVA_HOME, OPENDATALOADER_HYBRID_URL, and examples call OpenAIEmbeddings (which typically requires an API key). This is a mismatch: sensitive environment variables or API keys may be needed in practice but are not declared, making it unclear what secrets the agent or user must provide.
Persistence & Privilege
always is false and there are no install hooks declared. The skill does instruct starting a hybrid backend that listens on a port (network exposure) but it does not request permanent agent-level privileges in the registry metadata.
What to consider before installing
The skill looks like a legitimate PDF parser, but verify before you install or run anything: 1) Confirm the package source — find the opendataloader-pdf project on PyPI/GitHub and inspect the repository and release artifacts (the registry metadata currently lists no homepage/source). 2) Expect to run pip install which will fetch third-party code — only install from a trusted upstream and review the code if possible. 3) The hybrid backend opens a local port (default 5002); run it in a sandbox or controlled environment and ensure it does not inadvertently expose files or network access. 4) Be prepared to supply environment variables (JAVA_HOME, OPENDATALOADER_HYBRID_URL, and likely an LLM API key such as OPENAI_API_KEY) — treat those keys as sensitive and only provide them if you trust the package. 5) If you need higher assurance, ask the publisher for the canonical repository URL, versioned releases, and checksums, or run the package in an isolated VM/container and audit its network activity and files.

Like a lobster shell, security has layers — review code before you run it.

aivk9758df53v6k1hfjxdafjzdzys84jazvdocumentvk9758df53v6k1hfjxdafjzdzys84jazvlatestvk9758df53v6k1hfjxdafjzdzys84jazvparservk9758df53v6k1hfjxdafjzdzys84jazvpdfvk9758df53v6k1hfjxdafjzdzys84jazvragvk9758df53v6k1hfjxdafjzdzys84jazv
87downloads
0stars
1versions
Updated 2w ago
v1.0.0
MIT-0

OpenDataLoader PDF Skill

Quick Install

# Basic (CPU, ~20 pages/sec)
pip install -U opendataloader-pdf

# Hybrid mode (AI-enhanced, for complex docs, ~2 pages/sec)
pip install -U "opendataloader-pdf[hybrid]"

# LangChain integration
pip install langchain-opendataloader-pdf

Requirements: Java 11+ (for hybrid mode), Python 3.10+


Core Usage Patterns

1. Parse PDF → Markdown (best for RAG chunking)

from opendataloader_pdf import convert

convert(
    input_path=["file1.pdf", "folder/"],
    output_dir="output/",
    format="markdown"  # clean text, LLM-ready
)

2. Parse PDF → JSON (with bounding boxes for citations)

convert(
    input_path=["report.pdf"],
    output_dir="output/",
    format="json",           # structured data + coordinates
    image_output="embedded"  # "off" | "embedded" | "external"
)

3. LangChain + RAG Pipeline

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = OpenDataLoaderPDFLoader(file_path="document.pdf", format="text")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# → embed → vector store → RAG

CLI Commands

# Basic: single file or folder
opendataloader-pdf file1.pdf file2.pdf folder/

# Complex tables / nested structure (hybrid mode)
opendataloader-pdf --hybrid docling-fast file1.pdf

# Start hybrid backend first, then:
opendataloader-pdf-hybrid --port 5002
# (in another terminal)
opendataloader-pdf --hybrid docling-fast file1.pdf

# OCR for scanned PDFs
opendataloader-pdf-hybrid --port 5002 --force-ocr file1.pdf

# Math formula extraction (LaTeX)
opendataloader-pdf-hybrid --enrich-formula
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf

# Chart/image AI description
opendataloader-pdf-hybrid --enrich-picture-description
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf

# Security: sanitize prompt injection
opendataloader-pdf file1.pdf --sanitize

Output Format Selection Guide

Document TypeRecommended FormatMode
Standard digital PDFmarkdownBasic
Complex/nested tablesjsonHybrid
Scanned PDFsany + --force-ocrHybrid
Math formulasmarkdown + --enrich-formulaHybrid
Charts needing descriptionmarkdown + --enrich-picture-descriptionHybrid
Medical reports (cite-able)jsonHybrid
RAG knowledge basemarkdownBasic or Hybrid

Key Reference Files


Benchmark Results (v2.0)

MetricScore
Overall Accuracy0.90
Reading Order0.94
Table Accuracy0.93
Heading Accuracy0.83

License: Apache 2.0 | GitHub: opendataloader-project/opendataloader-pdf

Comments

Loading comments...