Visual models analyze video to generate reports and highlight frames, provided by the Vidu API.
v1.0.1Extract and analyze keyframes from MP4, MOV, AVI videos to identify themes, generate reports, and provide 3 representative screenshots.
Like a lobster shell, security has layers — review code before you run it.
Video Analyzer
Overview
Extract keyframes from videos, analyze content with vision models, and generate comprehensive reports with 3 representative screenshots. Optimized for token efficiency using I-frame detection.
Workflow
Video Input → Extract Keyframes → Vision Analysis → Select Top 3 → Generate Report → Send Output
Step-by-Step Process
1. Download Video (if from Feishu)
When user sends video via Feishu, the file is auto-saved to:
~/.openclaw/media/inbound/<filename>.mp4
2. Extract Video Metadata
ffmpeg -i <video_path> 2>&1 | grep -E "(Duration|Video)"
Returns: duration, resolution, bitrate, codec info.
3. Extract Keyframes
Use the provided script for optimal keyframe extraction:
bash ~/.openclaw/workspace/skills/video-analyzer/scripts/extract_keyframes.sh <video_path> [output_dir]
Parameters:
video_path: Path to video file (required)output_dir: Output directory (optional, defaults to~/.openclaw/media/keyframes/)
Output: JPEG images at 640px width, named keyframe_XX.jpg
Token efficiency: Uses I-frame detection to extract only meaningful frames, reducing token consumption by ~7% vs uniform sampling.
4. Analyze with Vision Model
Use the image tool with all extracted keyframes:
prompt: "Analyze these keyframes from a video. Please:
1. Describe the video's theme and content
2. Select 3 most representative frames (explain why)"
5. Generate Report
Structure the analysis report:
## 📌 Video Theme
[Description]
## 🖼️ Representative Screenshots
| Frame | Reason |
|-------|--------|
| frame_XX | [Why representative] |
6. Send Output
Send via Feishu:
- Analysis report (text message)
- 3 representative screenshots (image messages)
Token Consumption Reference
| Video Length | Keyframes | Estimated Tokens |
|---|---|---|
| 5 seconds | 5-8 | ~8,000-14,000 |
| 15 seconds | 12-16 | ~20,000-28,000 |
| 30 seconds | 20-30 | ~35,000-50,000 |
Optimization tips:
- Images account for 95%+ of tokens
- Shorter videos = fewer tokens
- Low-motion videos produce fewer keyframes
Resources
scripts/
extract_keyframes.sh- Extract keyframes using ffmpeg I-frame detection
references/
ffmpeg_reference.md- Advanced ffmpeg commands for video processing
Comments
Loading comments...
