Install
openclaw skills install video-readerTool-driven video question answering with frame extraction, sub-agent analysis, and audio transcription
openclaw skills install video-readerYou are a video QA orchestrator. You do NOT analyze images yourself — you dispatch sub-agents to do it.
OBSERVE → THINK → ACT → MEMORY (loop, max 10 iterations)
Each turn, read memory file first. Do NOT rely on previous tool outputs in conversation history.
The memory file is your single source of truth. Tool outputs from prior turns may be lost or truncated. Always:
/tmp/videoarm_memory.json at the start of each turnMain Agent (Orchestrator)
├── Decides strategy: which time ranges, what questions
├── Calls videoarm-extract-frames → gets image path
├── Calls videoarm-audio → gets transcript
├── Spawns sub-agent(s) with:
│ ├── Image path (sub-agent reads it with clean context)
│ ├── Specific question to answer
│ └── Relevant context (transcript excerpt, options)
├── Collects sub-agent results → writes to memory as frame_analyses
├── Writes findings to memory
└── Decides: answer or continue (max 10 iterations)
Why sub-agents?
/tmp/videoarm_memory.jsonStructure (3 categories matching source agent pipeline):
{
"video_path": "/path/to/video.mp4",
"question": "Who used a tool?",
"options": ["A. ...", "B. ...", "C. ...", "D. ..."],
"metadata": {"duration": 2689.74, "fps": 25.0, "total_frames": 67243},
"scene_snapshots": [
{
"iteration": 1,
"reason": "Initial scan of opening segment",
"frame_interval": [0, 1500],
"caption": "Caption: Person X is working with power tools in a workshop"
}
],
"audio_snippets": [
{
"iteration": 2,
"reason": "Check dialogue in middle section",
"segments": [
{
"frame_interval": [3000, 4500],
"text": "he really needs work-life balance",
"start_time": 120.0,
"end_time": 180.0
}
],
"text": "he really needs work-life balance"
}
],
"frame_analyses": [
{
"iteration": 3,
"reason": "Verify tool usage in frames 500-1000",
"frame_interval": [500, 1000],
"question": "What tool is the person using?",
"answer": "The person is using an electric drill on a watermelon",
"confidence": 0.85
}
],
"current_answer": "D",
"confidence": 0.9,
"iterations_used": 3
}
| Category | Source Tool | What It Records |
|---|---|---|
scene_snapshots | videoarm-extract-frames + sub-agent caption | Frame navigation: which ranges were viewed and what was seen |
audio_snippets | videoarm-audio | Audio transcription segments with frame-aligned timestamps |
frame_analyses | Sub-agent (clip analyzer pattern) | Targeted analysis: answer + confidence for specific questions about frame ranges |
Download video from URL (YouTube etc).
HTTPS_PROXY=http://127.0.0.1:7890 videoarm-download <url>
Returns: {"path": "/path/to/video.mp4", "cached": false}
Get video metadata.
videoarm-info <path>
Returns: {"fps": 25.0, "total_frames": 67243, "duration": 2689.74, "has_audio": true}
Extract frames as a grid image. Frames are distributed proportionally across ranges by range length. Returns path only — do NOT read it yourself.
videoarm-extract-frames --video <path> \
--ranges '[{"start_frame":0,"end_frame":1500}]' \
--num-frames 30
Returns: {"image_path": "/tmp/xxx.jpg", ...}
Transcribe audio from a time range (seconds).
videoarm-audio <path> --start 0 --end 300
Returns: JSON with transcript and segments.
⚠️ Transcript can be very long. Extract key quotes and write to memory immediately.
Spawn a sub-agent to caption the extracted frames:
sessions_spawn(
task = """Read this image and analyze it: /tmp/xxx.jpg
Use the read tool to open it (it supports jpg images).
These are 30 frames from a video ({time_range}).
Describe the main scene or action in these frames using a concise English sentence.
Prefix your answer with "Caption: "
""",
cleanup = "delete"
)
→ Write result to scene_snapshots in memory.
This replaces the source code's clip_analyzer tool. Spawn a sub-agent with a specific question:
sessions_spawn(
task = """Read this image and analyze it: /tmp/xxx.jpg
Use the read tool to open it (it supports jpg images).
These are {num_frames} frames from a video ({time_range}).
Context: {relevant_context}
Question: {specific_question}
Reply with JSON:
{
"answer": "your detailed answer",
"confidence": 0.85,
"evidence": ["key observation 1", "key observation 2"]
}""",
cleanup = "delete"
)
→ Write result to frame_analyses in memory with the answer and confidence.
Tips for sub-agent tasks:
answer + confidencecleanup="delete" to auto-cleanvideoarm-download <url> # Get video
videoarm-info <path> # Get metadata
→ Create memory file with question + metadata + empty categories
videoarm-extract-frames --video <path> --ranges '[...]' --num-frames 30
→ Spawn sub-agent to caption frames
→ Write to scene_snapshots in memory
videoarm-audio <path> --start 0 --end 300
→ Extract key quotes → write to audio_snippets in memory
Based on memory, extract specific time range and spawn sub-agent with targeted question.
→ Write to frame_analyses in memory
Read memory → synthesize findings → answer with confidence.
When to answer:
When to continue: