Install
openclaw skills install video-subtitle-extractorCross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/medium/large-v3), and applies LLM-based text calibration for Chinese financial/technical content. Use when: (1) extracting subtitles from Bilibili, YouTube, or any yt-dlp-supported platform, (2) the video has no built-in subtitles, (3) users say "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写", (4) needing to transcribe audio files to text, (5) working with Chinese-language video content requiring high-accuracy transcription. Automatically handles dependency installation (ffmpeg, yt-dlp, openai-whisper) and model downloads.
openclaw skills install video-subtitle-extractorCross-platform ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with openai-whisper, and applies LLM-based text calibration for Chinese content.
Tested & verified on Windows 11 with real Bilibili videos (medium model, ~95% accuracy for Chinese).
# One-command full pipeline
python scripts/run.py <video_url> --model medium --language zh --output-dir ./output
# Download audio only
python scripts/download_audio.py <video_url> <output_dir>
# Transcribe existing audio
python scripts/transcribe.py <audio_file> --model medium --language zh
Use this skill when:
python scripts/install_deps.py
Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.
Run scripts/download_audio.py <url> [output_dir].
Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.
If download fails: the video may require cookies. Try:
yt-dlp --cookies-from-browser chrome <url>
Run scripts/transcribe.py <audio> --model <size> --language <lang>.
Models are auto-downloaded on first use (disk space required):
| Model | RAM | Disk | Speed | Quality | Best For |
|---|---|---|---|---|---|
small | ~2GB | 461MB | ~475 fps | ~90% | Quick tests |
medium | ~5GB | 1.42GB | ~165 fps | ~95% ✅ | Recommended |
large-v3 | ~10GB | 2.88GB | ~80 fps | ~97% | Best accuracy |
large-v3-turbo | ~6GB | 1.6GB | ~120 fps | ~96% | Good balance |
⚠️ Windows note: With <16GB RAM,
large-v3may be killed (SIGKILL). Fall back tomedium.
Output formats: txt, srt, vtt, json (default: all).
See references/asr_models.md for full model comparison.
After transcription, read the .txt output and apply corrections. Key calibration categories:
See references/calibration_guide.md for the full 30+ pattern library.
Present the calibrated text. Always include:
| Platform | Status | Notes |
|---|---|---|
| Bilibili | ✅ | Audio-only streams available without login. 720P+ video needs cookies. |
| YouTube | ✅ | Full support. Cookies may improve format selection. |
| Douyin/TikTok | ✅ | Via yt-dlp |
| All yt-dlp sites | ✅ | 1800+ supported platforms |
scripts/transcribe.py is designed for backend extensibility:
MODEL_SIZES dicttranscribe_<backend>() functionPlanned backends: faster-whisper (CTranslate2), whisper.cpp (native C++), Cloud APIs (AssemblyAI, iFlytek).
| Problem | Solution |
|---|---|
| SIGKILL during transcription | Model too large. Use --model medium or --model small. |
| yt-dlp download fails | Update yt-dlp: pip install -U yt-dlp. Try with cookies. |
| "No subtitles found" | Expected. This skill uses ASR, not built-in captions. |
| ffmpeg not found | Run install_deps.py (handles Windows non-PATH detection). |
| GPU not utilized | openai-whisper CPU-only by default. Install faster-whisper for GPU. |
| Video Duration | Model | Time | RAM Peak | Accuracy |
|---|---|---|---|---|
| 6 min (Bilibili) | small | ~1m 17s | ~2.5GB | ~90% |
| 6 min (Bilibili) | medium | ~4m 30s | ~6GB | ~95% |
| 13 min (Bilibili) | medium | ~8m | ~6.5GB | ~95% |
Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.