Install
openclaw skills install video-captionsGenerate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.
openclaw skills install video-captionsUser needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.
| Topic | File |
|---|---|
| Transcription engines | engines.md |
| Output formats | formats.md |
| Styling presets | styling.md |
| Platform requirements | platforms.md |
| Scenario | Engine | Why |
|---|---|---|
| Default (recommended) | Whisper local | 100% offline, no data leaves machine |
| Apple Silicon | MLX Whisper | Native acceleration, still local |
| Word timestamps | whisper-timestamped | DTW alignment, still local |
Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.
| Platform | Format | Notes |
|---|---|---|
| YouTube | VTT or SRT | VTT preferred |
| Netflix/Pro | TTML | Strict timing rules |
| Social (TikTok, IG) | Burn-in (ASS) | Embedded in video |
| General | SRT | Universal compatibility |
| Karaoke/effects | ASS | Advanced styling |
Ask user's target platform if not specified.
Netflix-compliant (default):
Social media:
Break lines:
Never separate:
Use word timestamps for:
Enable with --word-timestamps flag.
For multi-speaker content:
[Speaker 1] or [Name] if knownJOHN: What do you think?Before delivering:
# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt
# Specify language
whisper video.mp4 --model turbo --language es --output_format srt
# Multiple formats
whisper video.mp4 --model turbo --output_format all
# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt
# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate
# Generate SRT first, then convert with style
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
# TikTok/Instagram style (centered, bold)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Montserrat-Bold,FontSize=32,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=3,Shadow=0,Alignment=10,MarginV=50'" output.mp4
# Netflix style (bottom, clean)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Netflix Sans,FontSize=48,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
# Transcribe + translate to English
whisper video.mp4 --model turbo --task translate --output_format srt
# SRT to VTT
ffmpeg -i video.srt video.vtt
# SRT to ASS (for styling)
ffmpeg -i video.srt video.ass
--language explicitly for mixed content--max_line_width 42 for Netflix compliance-b:v 8M)whisper video.mp4 --output_format vttffmpeg -i video.mp4 -vf "subtitles=video.ass" -c:a copy output.mp4[SPEAKER]: text[music], [laughter] descriptions--task translate for EnglishDefault: 100% LOCAL processing. No network calls.
| Endpoint | Data Sent | When Used |
|---|---|---|
| Whisper (local) | None (local) | Default — always |
| api.assemblyai.com | Audio file | Only if user sets ASSEMBLYAI_API_KEY |
| api.deepgram.com | Audio file | Only if user sets DEEPGRAM_API_KEY |
Cloud APIs are documented as alternatives but never used unless user explicitly provides API keys and requests cloud processing. By default, all processing stays on your machine.
Default workflow is 100% offline:
Cloud APIs are OPTIONAL and OPT-IN:
ASSEMBLYAI_API_KEY or DEEPGRAM_API_KEYThis skill does NOT:
Install with clawhub install <slug> if user confirms:
ffmpeg — video/audio processingvideo — general video tasksvideo-edit — video editingaudio — audio processingclawhub star video-captionsclawhub sync