Install
openclaw skills install tom-video-understandingLocal video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.
openclaw skills install tom-video-understandingUse this skill when you need to understand the content of a video.
asr-local) must be activated for audio processingffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y
Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.
mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y
conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
model_revision='v2.0.4',
disable_update=True,
ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"
ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"
Read tool does NOT provide image understanding - always use qwen3-vlC:/Users/TOM/.cache/modelscope for FunASR models