Tom Video Understanding

v1.0.0

Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.

⭐ 0· 166·0 current·0 all-time

by@tomuiv

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for tomuiv/tom-video-understanding.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Tom Video Understanding" (tomuiv/tom-video-understanding) from ClawHub.
Skill page: https://clawhub.ai/tomuiv/tom-video-understanding
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install tom-video-understanding

ClawHub CLI

Package manager switcher

npx clawhub@latest install tom-video-understanding

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

medium confidence

✓

Purpose & Capability

The name/description (local video comprehension) matches the instructions: ffmpeg for audio/frames extraction, FunASR for Chinese ASR, and qwen3-vl via Ollama for image understanding. None of the required actions or tools appear unrelated to the stated purpose.

ℹ

Instruction Scope

SKILL.md confines actions to extracting audio/frames, running FunASR in a conda env, and querying a local Ollama model. It does reference a specific ModelScope cache path (C:/Users/TOM/.cache/modelscope) and suggests copying files if paths contain Chinese characters — this is Windows- and user-specific and may need adjustment. The doc also allows optional "Summary/Analysis → Cloud LLM API (if needed)", which would send derived data externally if used; that is outside the local-only flow and should be considered separately.

✓

Install Mechanism

This is instruction-only with no install spec or packaged downloads in the skill itself. That reduces risk. Note: models (FunASR/ModelScope and qwen3-vl) and Ollama are expected to be pulled/downloaded at runtime by the user, which involves network activity and large binary downloads but is not performed by the skill bundle itself.

ℹ

Credentials

The skill declares no required env vars or credentials. The instructions set MODELSCOPE_CACHE to a specific, user-named path (C:/Users/TOM/...) which is an implementation detail and not a request for secrets, but it may reveal or assume a specific user environment. The skill also mentions optionally using a cloud LLM for summaries; that would require credentials/configuration provided by the user but are not requested by the skill itself.

✓

Persistence & Privilege

The skill does not request always-on presence and makes no claims to modify other skills or system-wide configs. It is user-invocable and can be run locally as-needed.

Assessment

This skill appears to do what it says: extract audio/frames and run local models. Before using it, ensure: (1) you have ffmpeg, conda, and Ollama installed and you trust the sources from which models will be downloaded (model downloads will use network and can be large); (2) if you plan to use the optional cloud LLM step, confirm which endpoint and credentials you'll use and avoid sending sensitive video/audio to unknown cloud services; (3) update the Windows-specific ModelScope cache path and any example mirrors (README's OLLAMA_BASE_URL example is a placeholder) to suit your environment; (4) verify Ollama's model provenance (qwen3-vl) and FunASR model identifiers before pulling. If the skill ever asked to read unrelated system files, request unrelated credentials, or included opaque download URLs or install scripts, treat it as suspicious and do not run without deeper review.

Like a lobster shell, security has layers — review code before you run it.

latestvk970ydcv4dksdwv6tr691bksex84ntn5

166downloads

0stars

1versions

Updated 2w ago

v1.0.0

MIT-0

Video Understanding

Use this skill when you need to understand the content of a video.

Prerequisites

FunASR conda environment (asr-local) must be activated for audio processing
Ollama must be running with qwen3-vl:8b model available
ffmpeg must be in PATH

Workflow

Step 1: Extract Audio

ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y

Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.

Step 2: Extract Key Frames

mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y

Step 3: Speech Recognition (FunASR)

conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
    model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision='v2.0.4',
    disable_update=True,
    ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"

Step 4: Image Understanding (qwen3-vl)

ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"

Step 5: Combine Results

Audio transcription → FunASR (local, Chinese speech recognition)
Key frames → qwen3-vl:8b via Ollama (local image understanding)
Summary/Analysis → Cloud LLM API (if needed)

Important Notes

Image reading via Read tool does NOT provide image understanding - always use qwen3-vl
For Chinese audio, FunASR is preferred over Whisper
Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
Modelscope cache at C:/Users/TOM/.cache/modelscope for FunASR models

Comments

Loading comments...