Pdf Vision

v1.0.0

Extract text content from image-based/scanned PDFs using multiple vision APIs with automatic fallback. Supports Xflow (qwen3-vl-plus) and ZhipuAI (GLM-4.6V-F...

⭐ 0· 72·1 current·1 all-time

by@lpq6

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for lpq6/pdf-vision.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Pdf Vision" (lpq6/pdf-vision) from ClawHub.
Skill page: https://clawhub.ai/lpq6/pdf-vision
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install pdf-vision

ClawHub CLI

Package manager switcher

npx clawhub@latest install pdf-vision

Security Scan

VirusTotal

Pending

View report →

OpenClaw

Suspicious

medium confidence

Purpose & Capability

The core files and SKILL.md align with the stated purpose: converting PDF pages to images and calling vision-capable models (Xflow / ZhipuAI). However, the repository also includes scripts unrelated to PDF extraction (scripts/create_github_repo.py) which try to locate/use a GITHUB_TOKEN (including by parsing ~/.bashrc) and instruct the user about pushing code to GitHub. That GitHub-oriented functionality is not described in the skill metadata or SKILL.md and is unnecessary for PDF extraction.

Instruction Scope

SKILL.md and the main scripts only instruct reading your OpenClaw config (~/.openclaw/openclaw.json), converting PDFs to images (pypdfium2), and calling configured model endpoints—this is appropriate. But create_github_repo.py attempts to read environment variables and fallback to parsing ~/.bashrc to find a GITHUB_TOKEN, which is outside the documented scope. test_skill.sh also references a user-specific file path (/home/lpq/.openclaw/workspace/林佩权课表.pdf), indicating leftover developer-specific test artifacts. These extras expand the runtime surface beyond what the skill promises.

✓

Install Mechanism

There is no install specification (instruction-only skill) and no remote download or archive extraction. The package contains local Python and shell scripts only. No network-installed code is fetched at install time by the skill itself.

Credentials

The skill legitimately reads API keys from ~/.openclaw/openclaw.json for the vision providers, which is proportional. However, create_github_repo.py looks for GITHUB_TOKEN (env or inside ~/.bashrc) without that credential being declared or documented in SKILL.md. That is unexpected for an OCR skill and could lead to accidental use of a shell-stored token if the helper script is executed.

✓

Persistence & Privilege

The skill does not request always:true and does not modify other skills or system-wide settings. The scripts create temporary files under /tmp (documented) and otherwise run on demand. There is no autonomous persistence or privilege escalation requested.

What to consider before installing

This skill's main OCR functionality appears coherent and reasonable: it converts PDF pages to images, reads your OpenClaw config for provider baseUrls/apiKeys, and posts image data to those endpoints. However, two red flags should be addressed before installing or running it: 1) Unrelated GitHub helper script: scripts/create_github_repo.py is unrelated to PDF extraction and will attempt to find and use a GITHUB_TOKEN (from the environment or by parsing ~/.bashrc) to call the GitHub API. Only run that script if you understand and trust it; otherwise remove or ignore it. Storing tokens in shell RC files is risky—prefer a dedicated credential store. 2) Residual test artifacts: test_skill.sh contains a hardcoded user-specific PDF path (author-local); review and update or delete it to avoid accidental execution that references your filesystem. Recommended actions before installation: - Inspect and (if not needed) delete or move scripts/create_github_repo.py from the skill directory. - Search the skill for any other helper scripts that access credentials or user files and remove or sandbox them. - Run the main extraction script in a sandboxed environment (isolated user account or container) first and verify it only reads ~/.openclaw/openclaw.json and /tmp files. - Ensure your OpenClaw config stores only the credentials you intend to use and is not world-readable. If the repository owner clarifies that the GitHub script is intentionally included (e.g., convenience for packaging) and documents it in SKILL.md, and if you plan to use it only in a controlled way, this lowers the concern. If you cannot get that clarification, treat the extra scripts as suspicious and remove them before use.

Like a lobster shell, security has layers — review code before you run it.

latestvk978thg4zhg88achbxdrxxedzn841w3c

72downloads

0stars

1versions

Updated 3w ago

v1.0.0

MIT-0

PDF Vision Extraction Skill (Enhanced)

Overview

This skill handles image-based or scanned PDFs that contain no selectable text. It supports multiple vision APIs with automatic fallback:

Primary Models

Xflow: qwen3-vl-plus (your primary vision model)
ZhipuAI: glm-4.6v-flash (free vision model with fallback support)
Fallback: glm-5 (text-only, but may work with some image prompts)

Unlike traditional PDF text extraction tools (pdftotext, pdfplumber) which only work on text-based PDFs, this skill can process:

Scanned documents
Image-only PDFs
Photographed documents
Handwritten notes (with limitations)
Complex layouts with tables and formatting

Supported Models

Vision-Capable Models

Provider	Model	Type	Context	Free
Xflow	`qwen3-vl-plus`	Vision + Text	131K	❌
ZhipuAI	`glm-4.6v-flash`	Vision + Text	32K	✅
ZhipuAI	`glm-5`	Text-only*	128K	❌

Additional Text Models (for fallback)

Provider	Model	Context	Free
ZhipuAI	`glm-4-flash-250414`	128K	✅
ZhipuAI	`cogview-3-flash`	32K	✅

*Note: glm-5 is primarily text-only but may handle image prompts in some cases.

Prerequisites

1. API Configuration

Your OpenClaw must be configured with both providers:

Xflow Configuration (already set up):

models.providers.openai.baseUrl: https://apis.iflow.cn/v1
models.providers.openai.apiKey: Your Xflow API key

ZhipuAI Configuration (update token):

models.providers.zhipuai.baseUrl: https://open.bigmodel.cn/api/paas/v4
models.providers.zhipuai.apiKey: Your ZhipuAI API token

2. Required System Tools

pypdfium2 Python library (for PDF to image conversion)
curl (for API calls)
base64 (for image encoding)

3. Python Libraries (already installed)

pypdfium2

Usage

Automatic Fallback Mode (Default)

Uses Xflow first, falls back to ZhipuAI if needed:

./scripts/pdf_vision.py --pdf-path /path/to/document.pdf

Specific Model Selection

Force a specific model for cost or performance reasons:

# Use free GLM-4.6V-Flash model
./scripts/pdf_vision.py --pdf-path document.pdf --model zhipuai/glm-4.6v-flash

# Use specific Xflow model  
./scripts/pdf_vision.py --pdf-path document.pdf --model openai/qwen3-vl-plus

# Short form (auto-detects provider)
./scripts/pdf_vision.py --pdf-path document.pdf --model glm-4.6v-flash

Structured Data Extraction

./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt "Extract as JSON: vendor, date, total" --model glm-4.6v-flash

Multi-page PDF Handling

# Process page 3 specifically
./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --output page3.txt

Configuration

Environment Variables

The skill reads configuration from your OpenClaw config file (~/.openclaw/openclaw.json):

models.providers.openai.baseUrl & apiKey
models.providers.zhipuai.baseUrl & apiKey

Output Format

Returns extracted text content as a string. For structured data requests, the AI model will format output according to your prompt instructions.

Examples

Cost-Optimized Extraction (Free Model)

Command: --model glm-4.6v-flash Use case: When you want to use free vision capabilities Result: Good quality extraction at no cost

High-Quality Extraction (Premium Model)

Command: --model qwen3-vl-plus Use case: When you need maximum accuracy and complex layout understanding Result: Best possible extraction quality

Automatic Fallback (Recommended)

Command: No --model flag Use case: Production environments where reliability is key Result: Uses best available model, falls back gracefully

Model Comparison

GLM-4.6V-Flash (Free)

✅ Completely free
✅ Good Chinese text recognition
✅ Decent table structure preservation
⚠️ Lower context window (32K vs 131K)
⚠️ May struggle with very complex layouts

Qwen3-VL-Plus (Premium)

✅ Superior image understanding
✅ Excellent table and structure recognition
✅ Larger context window (131K)
✅ Better handling of mixed languages
❌ Requires paid API access

Limitations

Single page processing: Currently processes one page at a time
Image quality: Better results with higher resolution scans
Complex layouts: May struggle with very dense or overlapping text
Handwriting: Limited accuracy with handwritten content
File size: Large PDFs may exceed API token limits

Technical Implementation

The skill follows this workflow:

PDF to Image: Converts specified PDF page to PNG using pypdfium2
Model Selection: Chooses model based on user preference or fallback logic
API Call: Sends image + prompt to selected vision API endpoint
Response Parsing: Extracts and returns the AI-generated text content
Fallback: If primary model fails, tries alternative models

For debugging, temporary files are created in /tmp/:

/tmp/pdf_vision_page.png - converted image
/tmp/pdf_vision_payload_*.json - API request payload
/tmp/pdf_vision_response_*.json - API response

Integration Notes

This skill complements the standard pdf skill:

Use pdf skill for text-based PDFs (faster, no API cost)
Use pdf-vision skill for image-based/scanned PDFs (requires vision API)

Both skills can be used together in a fallback pattern:

Try pdf skill first
If no text extracted, fall back to pdf-vision skill

Cost Optimization Tips

Use GLM-4.6V-Flash for routine tasks - it's free and quite capable
Reserve Qwen3-VL-Plus for complex documents - when you need maximum accuracy
Test both models on your document types - choose based on your quality requirements
Monitor API usage - track which models you're using most

Update Your GLM API Token

Replace the placeholder token in your config:

# Replace YOUR_ACTUAL_GLM_TOKEN with your real token
sed -i 's/YOUR_GLM_API_TOKEN_HERE/YOUR_ACTUAL_GLM_TOKEN/g' ~/.openclaw/openclaw.json

Comments

Loading comments...