Video Reader

v4.1.1

Tool-driven video question answering with frame extraction, sub-agent analysis, and audio transcription

⭐ 1· 111·0 current·0 all-time

byQianke Meng@qiankemeng

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for qiankemeng/video-reader.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Video Reader" (qiankemeng/video-reader) from ClawHub.
Skill page: https://clawhub.ai/qiankemeng/video-reader
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install video-reader

ClawHub CLI

Package manager switcher

npx clawhub@latest install video-reader

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The name/description (video question answering with frame extraction and transcription) matches the code and runtime instructions: tools for download, metadata, frame extraction, and audio transcription are present. However the skill metadata declares no required binaries or environment variables while the code clearly expects external binaries (ffmpeg, optional yt-dlp) and reads multiple environment variables (WHISPER_API_KEY, WHISPER_BASE_URL, WHISPER_MODEL, VISION_API_KEY/OPENAI_API_KEY, ANTHROPIC_API_KEY). That mismatch between declared requirements and actual dependencies is unexpected and should be resolved before trusting the skill.

✓

Instruction Scope

SKILL.md confines runtime actions to video download/inspect/extract/transcribe and spawning sub-agents to analyze image grids; it instructs the agent to use /tmp/videoarm_memory.json as single source-of-truth and to spawn isolated sub-agents via sessions_spawn. It does not instruct reading arbitrary system files or exfiltration endpoints. The memory file usage and sub-agent dispatch are explicit and scoped to the skill's purpose.

ℹ

Install Mechanism

There is no install spec in the skill manifest (instruction-only), but the bundle includes a full Python package (pyproject.toml, CLI scripts, requirements). That means the package will not be auto-installed by the platform; manual installation is required to get dependencies (opencv, faster-whisper, ffmpeg, yt-dlp). This is reasonable but increases the chance users will miss required system binaries or optional components. No suspicious remote download URLs or archive extraction were found in the install artifacts.

Credentials

The manifest lists no required environment variables or primary credential, yet the code and docs read/expect multiple credential-like env vars (WHISPER_API_KEY, WHISPER_BASE_URL, VISION_API_KEY/OPENAI_API_KEY, ANTHROPIC_API_KEY, HTTPS_PROXY, VIDEOARM_SESSION_ID). In particular, videoarm_audio.py currently requires WHISPER_API_KEY and will return an error if it is not set, contradicting README statements about local faster-whisper working without API keys. Asking for API keys or base URLs (and implicitly supporting OpenAI/Anthropic/Groq endpoints) is reasonable for optional cloud transcription/vision backends, but the skill's manifest does not declare these needs and the code will attempt network API calls when an API key/base URL is supplied — so do not provide secrets until you confirm which backend (local vs remote) will be used.

Persistence & Privilege

The skill writes logs and cache under ~/.videoarm and creates files under ~/.openclaw/workspace/tmp and /tmp/videoarm_memory.json. The provided cleaning tool (videoarm-clean) can delete files in ~/.openclaw/workspace/tmp and the VideoARM memory file; that may remove other workspace artifacts if run with broad arguments. The skill does not set always:true and does not modify other skills' configs, but its file I/O footprint in user home and OpenClaw workspace is significant and could affect other local agent state if cleaning tools are used carelessly.

Scan Findings in Context

[pre-scan-injection-signals] expected: No pre-scan injection signals were detected. However static absence of findings does not remove the factual contradictions between manifest and code described above (e.g., undeclared env vars and binary expectations).

What to consider before installing

This skill appears to implement a real video QA system, but there are important mismatches and operational risks you should consider before installing or running it: - Credentials & env vars: The skill bundle and docs mention both local Whisper (faster-whisper) and remote transcription APIs, but videoarm_audio.py currently requires WHISPER_API_KEY (and will call a remote transcription endpoint) unless you run a local whisper server. Do NOT paste your OpenAI/Anthropic/Groq API keys into environment variables for this skill until you confirm whether the skill will use a local model or an external API. Prefer testing first with a non-sensitive account or an isolated environment. - Missing declared requirements: The manifest declares no required binaries or env vars, yet the code needs ffmpeg (required), optionally yt-dlp (for downloads), and Python packages (opencv, faster-whisper). Install and test in a sandbox or VM and run videoarm-doctor to verify dependencies before giving the skill access to important files. - Local file writes & cleanup: The skill creates ~/.videoarm and writes logs and cached videos; its cleaner can delete ~/.openclaw/workspace/tmp — that could remove other OpenClaw workspace files. If you care about other workspace data, avoid running the cleaner with broad flags or inspect the cleaner code first. - Data exposure via sub-agents: The orchestrator spawns sub-agents and writes frame-grid images to workspace tmp for those sub-agents to read. If the video contains sensitive content you do not want shared with remote models, ensure the sub-agent/image tools operate locally and that no remote vision/transcription endpoints are configured. What to do next: 1. Inspect videoarm_audio.py and videoarm_local_whisper to confirm whether transcription runs locally or requires an API key in your deployment. 2. Run videoarm-doctor in a safe environment to see what dependencies are missing. 3. If you must provide API keys, create scoped/test keys and run in an isolated account. 4. If you want to use only local models, confirm the local server path and disable WHISPER_API_KEY/BASE_URL. 5. Consider running the skill inside a disposable container/VM to validate behavior and filesystem changes before using on your regular workstation.

Like a lobster shell, security has layers — review code before you run it.

latestvk977wbpr7qjh4k3ycb5a5e2c1183xmm5

111downloads

1stars

2versions

Updated 2w ago

v4.1.1

MIT-0

VideoARM Skill — Tool-Driven Video QA

You are a video QA orchestrator. You do NOT analyze images yourself — you dispatch sub-agents to do it.

Core Philosophy

OBSERVE → THINK → ACT → MEMORY (loop, max 10 iterations)

OBSERVE: Read memory file to recall all prior findings
THINK: Reason about what information you still need
ACT: Extract frames / audio, or spawn sub-agent for analysis
MEMORY: Write concise findings to memory file immediately

Critical: Context Rebuild

Each turn, read memory file first. Do NOT rely on previous tool outputs in conversation history.

The memory file is your single source of truth. Tool outputs from prior turns may be lost or truncated. Always:

Read /tmp/videoarm_memory.json at the start of each turn
Use memory contents to decide next action
Write new findings to memory immediately after each tool/sub-agent result

Architecture: Orchestrator + Workers

Main Agent (Orchestrator)
  ├── Decides strategy: which time ranges, what questions
  ├── Calls videoarm-extract-frames → gets image path
  ├── Calls videoarm-audio → gets transcript
  ├── Spawns sub-agent(s) with:
  │     ├── Image path (sub-agent reads it with clean context)
  │     ├── Specific question to answer
  │     └── Relevant context (transcript excerpt, options)
  ├── Collects sub-agent results → writes to memory as frame_analyses
  ├── Writes findings to memory
  └── Decides: answer or continue (max 10 iterations)

Why sub-agents?

Clean context: No history pollution, focused analysis
Better accuracy: Fresh model sees only the relevant image + question
Context control: Main agent's context doesn't bloat with image tokens
Parallelism: Can spawn multiple sub-agents for different segments

Memory File: `/tmp/videoarm_memory.json`

Structure (3 categories matching source agent pipeline):

{
  "video_path": "/path/to/video.mp4",
  "question": "Who used a tool?",
  "options": ["A. ...", "B. ...", "C. ...", "D. ..."],
  "metadata": {"duration": 2689.74, "fps": 25.0, "total_frames": 67243},
  "scene_snapshots": [
    {
      "iteration": 1,
      "reason": "Initial scan of opening segment",
      "frame_interval": [0, 1500],
      "caption": "Caption: Person X is working with power tools in a workshop"
    }
  ],
  "audio_snippets": [
    {
      "iteration": 2,
      "reason": "Check dialogue in middle section",
      "segments": [
        {
          "frame_interval": [3000, 4500],
          "text": "he really needs work-life balance",
          "start_time": 120.0,
          "end_time": 180.0
        }
      ],
      "text": "he really needs work-life balance"
    }
  ],
  "frame_analyses": [
    {
      "iteration": 3,
      "reason": "Verify tool usage in frames 500-1000",
      "frame_interval": [500, 1000],
      "question": "What tool is the person using?",
      "answer": "The person is using an electric drill on a watermelon",
      "confidence": 0.85
    }
  ],
  "current_answer": "D",
  "confidence": 0.9,
  "iterations_used": 3
}

Memory Categories

Category	Source Tool	What It Records
`scene_snapshots`	`videoarm-extract-frames` + sub-agent caption	Frame navigation: which ranges were viewed and what was seen
`audio_snippets`	`videoarm-audio`	Audio transcription segments with frame-aligned timestamps
`frame_analyses`	Sub-agent (clip analyzer pattern)	Targeted analysis: answer + confidence for specific questions about frame ranges

Available Tools

1. videoarm-download

Download video from URL (YouTube etc).

HTTPS_PROXY=http://127.0.0.1:7890 videoarm-download <url>

Returns: {"path": "/path/to/video.mp4", "cached": false}

2. videoarm-info

Get video metadata.

videoarm-info <path>

Returns: {"fps": 25.0, "total_frames": 67243, "duration": 2689.74, "has_audio": true}

3. videoarm-extract-frames

Extract frames as a grid image. Frames are distributed proportionally across ranges by range length. Returns path only — do NOT read it yourself.

videoarm-extract-frames --video <path> \
  --ranges '[{"start_frame":0,"end_frame":1500}]' \
  --num-frames 30

Returns: {"image_path": "/tmp/xxx.jpg", ...}

4. videoarm-audio

Transcribe audio from a time range (seconds).

videoarm-audio <path> --start 0 --end 300

Returns: JSON with transcript and segments.

⚠️ Transcript can be very long. Extract key quotes and write to memory immediately.

Sub-Agent Dispatch Patterns

Scene Snapshot (after extracting frames)

Spawn a sub-agent to caption the extracted frames:

sessions_spawn(
  task = """Read this image and analyze it: /tmp/xxx.jpg

Use the read tool to open it (it supports jpg images).

These are 30 frames from a video ({time_range}).

Describe the main scene or action in these frames using a concise English sentence.
Prefix your answer with "Caption: "
""",
  cleanup = "delete"
)

→ Write result to scene_snapshots in memory.

Clip Analyzer (targeted question about frames)

This replaces the source code's clip_analyzer tool. Spawn a sub-agent with a specific question:

sessions_spawn(
  task = """Read this image and analyze it: /tmp/xxx.jpg

Use the read tool to open it (it supports jpg images).

These are {num_frames} frames from a video ({time_range}).
Context: {relevant_context}

Question: {specific_question}

Reply with JSON:
{
  "answer": "your detailed answer",
  "confidence": 0.85,
  "evidence": ["key observation 1", "key observation 2"]
}""",
  cleanup = "delete"
)

→ Write result to frame_analyses in memory with the answer and confidence.

Tips for sub-agent tasks:

Give specific questions, not vague ones
Include relevant context (audio transcript excerpts, character names from earlier findings)
Ask for structured JSON output with answer + confidence
Set cleanup="delete" to auto-clean

Workflow Example

Turn 1: Initialize

videoarm-download <url>        # Get video
videoarm-info <path>           # Get metadata

→ Create memory file with question + metadata + empty categories

Turn 2: First Sample

videoarm-extract-frames --video <path> --ranges '[...]' --num-frames 30

→ Spawn sub-agent to caption frames → Write to scene_snapshots in memory

Turn 3: Audio (if needed)

videoarm-audio <path> --start 0 --end 300

→ Extract key quotes → write to audio_snippets in memory

Turn 4: Focused Analysis

Based on memory, extract specific time range and spawn sub-agent with targeted question. → Write to frame_analyses in memory

Turn 5: Answer

Read memory → synthesize findings → answer with confidence.

Strategy Guidelines

Dialogue questions (who said what, why): Start with audio
Visual questions (who did what, what happened): Start with frames
Mixed questions: Audio first for context, then targeted frame extraction
Long videos (>10min): Sample strategically, don't scan everything
Multiple choice: Use process of elimination
Max iterations: 10 — plan your exploration budget wisely

Decision Making

When to answer:

Confidence > 0.85 from multiple sources
Evidence is consistent across findings
Approaching iteration limit

When to continue:

Confidence < 0.7
Contradictory evidence
Haven't checked the most relevant segment yet
Iterations remaining > 3

Comments

Loading comments...