universal-pdf-vision-parser

Extract multilingual document content and language learning notes (French, German, Japanese, Spanish, etc.) from PDFs using multimodal vision (Qwen-VL-Max)....

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 249 · 5 current installs · 6 all-time installs

byM Z@MingEnsiie

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

high confidence

Purpose & Capability

The skill's name, description, SKILL.md, and code all align: converting PDF pages to images and sending them to Qwen‑VL‑Max for transcription. However, the registry metadata claims no required env vars or credentials while SKILL.md and the script require a DashScope API key (either via --api-key or DASHSCOPE_API_KEY). This metadata omission is an incoherence worth flagging.

✓

Instruction Scope

The runtime instructions and the script remain within the stated purpose: render PDF pages to PNG, base64-encode them, send them plus a transcription prompt to a multimodal API, and write Markdown. The agent is not instructed to read unrelated files or system state.

ℹ

Install Mechanism

There is no formal install spec in the registry (instruction-only), but SKILL.md tells the user to pip install pymupdf and dashscope. That is typical for a Python-based, instruction-only skill, but the lack of declared dependencies in the registry is another metadata inconsistency.

Credentials

The code expects an API key (DASHSCOPE_API_KEY or CLI --api-key) to call an external service; this is proportionate to the function. The concern is that the registry lists no required credentials. Also note that the skill transmits full-page base64 images to a third-party API — that is necessary for the stated purpose but has privacy/breach implications for sensitive documents.

✓

Persistence & Privilege

The skill does not request always:true, does not modify other skills or system-wide settings, and does not persist credentials beyond setting dashscope.api_key at runtime. No elevated or permanent privileges are requested.

What to consider before installing

This skill appears to do what it says (convert PDF pages to images and send them to Qwen‑VL‑Max for transcription), but there are two issues to consider before installing: - Metadata mismatch: The registry claims no required credentials, but the SKILL.md and script require a DashScope API key (DASHSCOPE_API_KEY or --api-key) and Python packages. Confirm the registry/provider and why credentials/dependencies were omitted. - Data exposure: The skill uploads full page images (base64 PNGs) to an external service. Do not run it on sensitive or confidential PDFs unless you trust the DashScope endpoint and have reviewed its privacy/billing/retention policies. Consider using local OCR alternatives for sensitive data. Recommended actions: - Verify the skill's source and author (no homepage and unknown source are risk indicators). - Confirm API key scope and permissions (least-privilege) and monitor billing/usage for unexpected activity. - Test with non-sensitive documents first and inspect network activity if possible. - If you need stronger assurance, ask the publisher to update registry metadata to declare required env vars and dependencies, and provide a canonical homepage or repo.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk97fdffcq9s0g6q9k39chbqrxd827x5c

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Universal PDF Vision Parser Skill

Version: 0.1

This skill is a high-end multilingual document digitizer. It uses multimodal vision to 'look' at each PDF page, making it perfect for language learning notes, bilingual documents, and complex layouts that standard OCR fails to capture.

Prerequisites

DashScope API Key: A valid key from Alibaba Cloud Bailian with qwen-vl-max access.
Environment:

pip install pymupdf dashscope

Usage

Basic Command

python scripts/vision_parse.py --pdf <path_to_pdf> --out <path_to_output.md> --api-key <YOUR_API_KEY> --max-pages 2

--max-pages: (Optional) Max pages to process. Defaults to 2. Set to -1 for all pages.

Agentic Workflow

Visual Scanning: Converts PDF pages to 300 DPI PNGs.
Expert Transcription: Qwen-VL-Max identifies the language and transcribes terms, translations, and explanations.
Markdown Structuring: Automatically formats content with bold keywords, italicized meanings, and clean tables.

Examples

User: "Convert this German-Chinese note to markdown: notes.pdf"

Agent Action:

python scripts/vision_parse.py --pdf notes.pdf --out notes.md

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…