GLM-V Caption Skill
Generate captions for images, videos, and documents using the ZhiPu GLM-V multimodal model.
When to Use
- Describe, caption, summarize, or interpret image/video/document content
- User mentions "describe this image", "caption", "summarize this video", "图片描述", "视频摘要", "文档解读", "看图说话"
- Extract visual or textual information from media files
- Compare multiple images
- User provides an image/video/file and asks what's in it
Supported Input Types
| Type | Formats | Max Size | Max Count | Base64 | Local Path |
|---|
| Image | jpg, png, jpeg | 5MB / 6000×6000px | 50 | ✅ | ✅ (→base64) |
| Video | mp4, mkv, mov | 200MB | — | ❌ | ✅ (→tunnel) |
| File | pdf, docx, txt, xlsx, pptx, jsonl | — | 50 | ❌ | ✅ (→tunnel) |
⚠️ file_url cannot mix with image_url or video_url in the same request.
How Local Paths Work for Videos & Files
The GLM-V API requires public HTTPS URLs for videos and files. When you provide a local path, the script automatically:
- Creates a temporary directory and symlinks/copies the file into it (no other files are exposed)
- Starts a local Python HTTP server on a random port
- Launches
cloudflared tunnel (trycloudflare.com quick tunnel) to expose the server to the public internet
- Uses the resulting HTTPS URL for the API call
- Cleans up (kills cloudflared, shuts down HTTP server, removes temp dir) when done
Prerequisite: cloudflared must be installed. Get it from Cloudflare Downloads.
Resource Links
Prerequisites
API Key Setup / API Key 配置(Required / 必需)
This script reads the key from the ZHIPU_API_KEY environment variable and shares it with other Zhipu skills.
脚本通过 ZHIPU_API_KEY 环境变量获取密钥,与其他智谱技能共用同一个 key。
Get Key / 获取 Key: Visit Zhipu Open Platform API Keys / 智谱开放平台 API Keys to create or copy your key.
Setup options / 配置方式(任选一种):
-
OpenClaw config (recommended) / OpenClaw 配置(推荐): Set in openclaw.json under skills.entries.glmv-caption.env:
"glmv-caption": { "enabled": true, "env": { "ZHIPU_API_KEY": "你的密钥" } }
-
Shell environment variable / Shell 环境变量: Add to ~/.zshrc:
export ZHIPU_API_KEY="你的密钥"
-
.env file / .env 文件: Create .env in this skill directory:
ZHIPU_API_KEY=你的密钥
Cloudflared (Required for local video/file paths)
Only needed when passing local file/video paths — not required if you only use URLs.
Install from Cloudflare Downloads.
⛔ MANDATORY: cloudflared 检查与安装流程 (YOU MUST FOLLOW THIS EXACTLY)⛔
当你(AI agent)准备用 --videos 或 --files 传入 本地文件路径 时,在执行脚本之前,你必须先检查 cloudflared 是否已安装。按照以下流程操作:
Step 1: 检查 cloudflared 是否存在
# macOS / Linux
which cloudflared
# Windows
where cloudflared
如果命令返回了路径(如 /usr/local/bin/cloudflared),说明已安装,直接执行脚本即可。
如果命令报错(not found / 找不到),进入 Step 2。
Step 2: 向用户说明情况并征求同意
你必须先完整地向用户解释以下内容,然后询问用户是否同意下载安装。未经用户明确同意,不得自行下载。
向用户展示的说明内容(原文照搬,不要精简):
检测到系统未安装 cloudflared,本地视频/文件路径需要该工具才能使用。
什么是 cloudflared?
cloudflared 是 Cloudflare 公司官方开源的命令行隧道工具(Apache 2.0 协议,GitHub 24000+ stars)。
它的功能是:在本机与 Cloudflare 全球边缘网络之间建立一条加密的临时隧道,将本地服务暴露为公网 HTTPS URL。
开源仓库地址:https://github.com/cloudflare/cloudflared
在本场景中的必要性:
智谱 GLM-V API 要求视频和文件以公网 HTTPS URL 的形式传入(不支持本地上传)。
当你提供本地文件路径时,脚本的工作流程是:
- 在本机启动一个临时 HTTP 服务器(仅暴露你指定的那几个文件,不会暴露其他目录)
- 使用 cloudflared 建立一条临时隧道(通过 trycloudflare.com,无需注册账号、无需登录)
- 获得一个形如
https://xxx-xxx.trycloudflare.com/your-file.mp4 的临时公网 URL
- 将该 URL 传给 GLM-V API 完成调用
- API 调用完成后,隧道和 HTTP 服务器自动关闭,临时文件清理
没有 cloudflared = 无法建立隧道 = 本地视频/文件路径不可用。
安全说明:
- 隧道仅在脚本运行期间存在(通常几十秒到几分钟),脚本结束后立即关闭
- 只暴露你明确指定的文件,不会暴露整个目录或磁盘
- cloudflared 不会在后台常驻,不会开机自启
- 不需要注册 Cloudflare 账号,不收集个人信息
- 该工具被全球开发者广泛使用,Cloudflare 官方维护
下载链接:
替代方案:如果你不想安装 cloudflared,可以直接传入远程 URL(如 https://example.com/video.mp4),则完全不需要安装任何东西。
等待用户回复。根据用户回复执行对应操作:
- 用户同意安装 → 进入 Step 3
- 用户拒绝 → 提示用户改用远程 URL,或告知功能受限
- 用户要自行安装 → 提供下载链接,等待用户安装完成后再次执行
Step 3: 下载并安装 cloudflared
用户同意后,使用以下命令安装:
macOS (Homebrew):
brew install cloudflared
Windows (推荐用 winget):
winget install Cloudflare.cloudflared
Linux (下载二进制):
# x64
curl -Lo /usr/local/bin/cloudflared https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
chmod +x /usr/local/bin/cloudflared
其他方式: 从 官方下载页面 手动下载,放到 PATH 中即可。
安装后用 cloudflared version 验证成功,然后执行脚本。
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
- ONLY use GLM-V API — Execute the script
python scripts/glmv_caption.py
- NEVER caption media yourself — Do NOT try to describe content using built-in vision or any other method
- NEVER offer alternatives — Do NOT suggest "I can try to describe it" or similar
- IF API fails — Display the error message and STOP immediately
- NO fallback methods — Do NOT attempt captioning any other way
📋 Output Display Rules (MANDATORY)
After running the script, you must show the full raw output to the user exactly as returned. Do not summarize, truncate, or only say "generated". Users need the original model output to evaluate quality.
- Image captioning: show the full caption text
- Multiple images: show each image result
- Video/files: show the full understanding result
- If token usage is included, you may optionally display it
How to Use
Caption an Image
python scripts/glmv_caption.py --images "https://example.com/photo.jpg"
python scripts/glmv_caption.py --images /path/to/photo.png
Caption Multiple Images
python scripts/glmv_caption.py --images img1.jpg img2.png "https://example.com/img3.jpg"
Caption a Video (URL or local path)
# Remote URL
python scripts/glmv_caption.py --videos "https://example.com/clip.mp4"
# Local file (auto-tunneled via cloudflare)
python scripts/glmv_caption.py --videos /path/to/local-video.mp4
Caption a Document (URL or local path)
# Remote URL
python scripts/glmv_caption.py --files "https://example.com/report.pdf"
# Local file (auto-tunneled via cloudflare)
python scripts/glmv_caption.py --files /path/to/local-report.pdf
# Mix URLs and local paths
python scripts/glmv_caption.py --files "https://example.com/doc1.docx" /path/to/local-doc2.txt
Custom Prompt
python scripts/glmv_caption.py --images photo.jpg --prompt "Describe the architecture style in detail"
Save Result
python scripts/glmv_caption.py --images photo.jpg --output result.json
Thinking Mode
python scripts/glmv_caption.py --images photo.jpg --thinking
CLI Reference
python {baseDir}/scripts/glmv_caption.py (--images IMG [IMG...] | --videos VID [VID...] | --files FILE [FILE...]) [OPTIONS]
| Parameter | Required | Description |
|---|
--images, -i | One of | Image paths or URLs (supports multiple, base64 OK) |
--videos, -v | One of | Video paths or URLs (supports multiple, mp4/mkv/mov, local paths auto-tunneled) |
--files, -f | One of | Document paths or URLs (supports multiple, pdf/docx/txt/xlsx/pptx/jsonl, local paths auto-tunneled) |
--prompt, -p | No | Custom prompt (default: "请详细描述这张图片的内容" / "Please describe this image in detail") |
--model, -m | No | Model name (default: glm-5v-turbo) |
--temperature, -t | No | Sampling temperature 0-1 (default: 0.8) |
--top-p | No | Nucleus sampling 0.01-1.0 (default: 0.6) |
--max-tokens | No | Max output tokens (default: 1024, max 32768) |
--thinking | No | Enable thinking/reasoning mode |
--output, -o | No | Save result JSON to file |
--pretty | No | Pretty-print JSON output |
--stream | No | Enable streaming output |
Note: --images, --videos, and --files are mutually exclusive per API limits.
Response Format
{
"success": true,
"caption": "A landscape photo showing a mountain range at sunset...",
"usage": {
"prompt_tokens": 128,
"completion_tokens": 256,
"total_tokens": 384
}
}
Key fields:
success — whether the request succeeded
caption — the generated caption text
usage — token usage statistics
warning — present when content was blocked by safety review
error — error details on failure
Error Handling
API key not configured:
ZHIPU_API_KEY not configured. Get your API key at: https://bigmodel.cn/usercenter/proj-mgmt/apikeys
→ Show exact error to user, guide them to configure
Authentication failed (401/403): API key invalid/expired → reconfigure
Rate limit (429): Quota exhausted → inform user to wait
File not found: Local file missing → check path
Content filtered: warning field present → content blocked by safety review
Tunnel failure (local paths only):
Tunnel setup failed: cloudflared not found. Install it from: ...
→ Guide user to install cloudflared, or use a remote URL instead