Install
openclaw skills install pdf-ocr-layout基于智谱 GLM-OCR、GLM-4.7 及 GLM-4.6V 的多模态文档深度解析工具。 Use when: - 需要高精度提取文档(PDF/图片)中的表格并转换为 Markdown 格式 - 需要从文档页面中自动裁剪并提取插图、图表为独立文件 - 需要对提取的图表进行深度语义理解(基于 GLM-4.6V 视觉分析) - 需要对提取的表格数据进行逻辑分析(基于 GLM-4.7 文本分析) 核心架构: 1. 视觉提取:GLM-OCR 2. 语义理解:GLM-4.7 (纯文本/表格) + GLM-4.6V (多模态/图像)
openclaw skills install pdf-ocr-layoutThis tool builds a high-precision document parsing pipeline: using GLM-OCR for layout element extraction, calling GLM-4.7 for logical interpretation of table data, and calling GLM-4.6V for multimodal visual interpretation of images and charts.
This Skill consists of two core script stages, orchestrated through glm_ocr_pipeline.py:
scripts/glm_ocr_extract.py)scripts/glm_understanding.py)# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
--file_path "/data/report_page.jpg" \
--output_dir "/data/output"
| Parameter | Type | Required | Description |
|---|---|---|---|
| file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) |
| output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |
The tool returns a list containing layout elements and their deep understanding:
[
{
"type": "table",
"bbox": [100, 200, 500, 600],
"content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
"deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
},
{
"type": "image",
"bbox": [100, 700, 500, 900],
"content_info": "/data/output/images/report_page_img_2.png",
"deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
}
]
ZHIPU_API_KEY must be configuredzhipuai, pillow, beautifulsoup4All understanding is based on the complete layout logic of the document (Markdown Context), not isolated fragment analysis.
Multi-page PDFs default to processing the first page. For batch processing, please extend the loop logic at the script level.