Install
openclaw skills install pdf-visionExtract text content from image-based/scanned PDFs using multiple vision APIs with automatic fallback. Supports Xflow (qwen3-vl-plus) and ZhipuAI (GLM-4.6V-Flash, GLM-5) vision models. This skill converts PDF pages to images and uses AI vision capabilities to extract structured text, tables, and content from scanned documents that cannot be processed with traditional text extraction methods.
openclaw skills install pdf-visionThis skill handles image-based or scanned PDFs that contain no selectable text. It supports multiple vision APIs with automatic fallback:
qwen3-vl-plus (your primary vision model)glm-4.6v-flash (free vision model with fallback support)glm-5 (text-only, but may work with some image prompts)Unlike traditional PDF text extraction tools (pdftotext, pdfplumber) which only work on text-based PDFs, this skill can process:
| Provider | Model | Type | Context | Free |
|---|---|---|---|---|
| Xflow | qwen3-vl-plus | Vision + Text | 131K | ❌ |
| ZhipuAI | glm-4.6v-flash | Vision + Text | 32K | ✅ |
| ZhipuAI | glm-5 | Text-only* | 128K | ❌ |
| Provider | Model | Context | Free |
|---|---|---|---|
| ZhipuAI | glm-4-flash-250414 | 128K | ✅ |
| ZhipuAI | cogview-3-flash | 32K | ✅ |
*Note: glm-5 is primarily text-only but may handle image prompts in some cases.
Your OpenClaw must be configured with both providers:
Xflow Configuration (already set up):
models.providers.openai.baseUrl: https://apis.iflow.cn/v1models.providers.openai.apiKey: Your Xflow API keyZhipuAI Configuration (update token):
models.providers.zhipuai.baseUrl: https://open.bigmodel.cn/api/paas/v4models.providers.zhipuai.apiKey: Your ZhipuAI API tokenpypdfium2 Python library (for PDF to image conversion)curl (for API calls)base64 (for image encoding)pypdfium2
Uses Xflow first, falls back to ZhipuAI if needed:
./scripts/pdf_vision.py --pdf-path /path/to/document.pdf
Force a specific model for cost or performance reasons:
# Use free GLM-4.6V-Flash model
./scripts/pdf_vision.py --pdf-path document.pdf --model zhipuai/glm-4.6v-flash
# Use specific Xflow model
./scripts/pdf_vision.py --pdf-path document.pdf --model openai/qwen3-vl-plus
# Short form (auto-detects provider)
./scripts/pdf_vision.py --pdf-path document.pdf --model glm-4.6v-flash
./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt "Extract as JSON: vendor, date, total" --model glm-4.6v-flash
# Process page 3 specifically
./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --output page3.txt
The skill reads configuration from your OpenClaw config file (~/.openclaw/openclaw.json):
models.providers.openai.baseUrl & apiKeymodels.providers.zhipuai.baseUrl & apiKeyReturns extracted text content as a string. For structured data requests, the AI model will format output according to your prompt instructions.
Command: --model glm-4.6v-flash
Use case: When you want to use free vision capabilities
Result: Good quality extraction at no cost
Command: --model qwen3-vl-plus
Use case: When you need maximum accuracy and complex layout understanding
Result: Best possible extraction quality
Command: No --model flag
Use case: Production environments where reliability is key
Result: Uses best available model, falls back gracefully
The skill follows this workflow:
pypdfium2For debugging, temporary files are created in /tmp/:
/tmp/pdf_vision_page.png - converted image/tmp/pdf_vision_payload_*.json - API request payload/tmp/pdf_vision_response_*.json - API responseThis skill complements the standard pdf skill:
pdf skill for text-based PDFs (faster, no API cost)pdf-vision skill for image-based/scanned PDFs (requires vision API)Both skills can be used together in a fallback pattern:
pdf skill firstpdf-vision skillReplace the placeholder token in your config:
# Replace YOUR_ACTUAL_GLM_TOKEN with your real token
sed -i 's/YOUR_GLM_API_TOKEN_HERE/YOUR_ACTUAL_GLM_TOKEN/g' ~/.openclaw/openclaw.json