Video Subtitle Extractor

Other

Cross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/medium/large-v3), and applies LLM-based text calibration for Chinese financial/technical content. Use when: (1) extracting subtitles from Bilibili, YouTube, or any yt-dlp-supported platform, (2) the video has no built-in subtitles, (3) users say "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写", (4) needing to transcribe audio files to text, (5) working with Chinese-language video content requiring high-accuracy transcription. Automatically handles dependency installation (ffmpeg, yt-dlp, openai-whisper) and model downloads.

Install

openclaw skills install video-subtitle-extractor

Video Subtitle Extractor 🎬→📝

Cross-platform ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with openai-whisper, and applies LLM-based text calibration for Chinese content.

Tested & verified on Windows 11 with real Bilibili videos (medium model, ~95% accuracy for Chinese).

Quick Start

# One-command full pipeline
python scripts/run.py <video_url> --model medium --language zh --output-dir ./output

# Download audio only
python scripts/download_audio.py <video_url> <output_dir>

# Transcribe existing audio
python scripts/transcribe.py <audio_file> --model medium --language zh

When to Use This Skill

Use this skill when:

The video has no built-in subtitles (Bilibili, YouTube, etc.)
You need high-accuracy Chinese transcription (~95% with medium model)
You want multiple output formats (TXT, SRT, VTT, JSON)
You need LLM-assisted text calibration for financial/technical terms
The user says: "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写"

Workflow

Step 0: Install Dependencies (once)

python scripts/install_deps.py

Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.

Step 1: Download Audio

Run scripts/download_audio.py <url> [output_dir].

Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.

If download fails: the video may require cookies. Try:

yt-dlp --cookies-from-browser chrome <url>

Step 2: ASR Transcription

Run scripts/transcribe.py <audio> --model <size> --language <lang>.

Models are auto-downloaded on first use (disk space required):

Model	RAM	Disk	Speed	Quality	Best For
`small`	~2GB	461MB	~475 fps	~90%	Quick tests
`medium`	~5GB	1.42GB	~165 fps	~95% ✅	Recommended
`large-v3`	~10GB	2.88GB	~80 fps	~97%	Best accuracy
`large-v3-turbo`	~6GB	1.6GB	~120 fps	~96%	Good balance

⚠️ Windows note: With <16GB RAM, large-v3 may be killed (SIGKILL). Fall back to medium.

Output formats: txt, srt, vtt, json (default: all).

See references/asr_models.md for full model comparison.

Step 3: LLM Text Calibration

After transcription, read the .txt output and apply corrections. Key calibration categories:

Homophone fixes (同音字): 硬钢→硬扛, 模→磨, 骨→股
Company/product names: Deepseat→DeepSeek, 中繼續創→中际旭创, HPM→HBM
Financial terms: 抛押→抛压, 护盘 (not 互盘), 筹码, K线收十字星 (not 14星)
Common substitutions: 跟锋→跟风, 微转→微赚, 落带为安→落袋为安
Traditional→Simplified: If model outputs traditional Chinese, convert to simplified
Structural cleanup: Add paragraph breaks at topic shifts, format as prose

See references/calibration_guide.md for the full 30+ pattern library.

Step 4: Deliver Results

Present the calibrated text. Always include:

Model used (small/medium/large) and quality notes
Any sections with low confidence or unclear audio
Summary of corrections applied (counts by category)

Platform Support

Platform	Status	Notes
Bilibili	✅	Audio-only streams available without login. 720P+ video needs cookies.
YouTube	✅	Full support. Cookies may improve format selection.
Douyin/TikTok	✅	Via yt-dlp
All yt-dlp sites	✅	1800+ supported platforms

Extending with New ASR Models

scripts/transcribe.py is designed for backend extensibility:

Add model info to MODEL_SIZES dict
Implement transcribe_<backend>() function
Add CLI flag in argparse

Planned backends: faster-whisper (CTranslate2), whisper.cpp (native C++), Cloud APIs (AssemblyAI, iFlytek).

Troubleshooting

Problem	Solution
SIGKILL during transcription	Model too large. Use `--model medium` or `--model small`.
yt-dlp download fails	Update yt-dlp: `pip install -U yt-dlp`. Try with cookies.
"No subtitles found"	Expected. This skill uses ASR, not built-in captions.
ffmpeg not found	Run `install_deps.py` (handles Windows non-PATH detection).
GPU not utilized	openai-whisper CPU-only by default. Install `faster-whisper` for GPU.

Performance Benchmarks (Tested)

Video Duration	Model	Time	RAM Peak	Accuracy
6 min (Bilibili)	small	~1m 17s	~2.5GB	~90%
6 min (Bilibili)	medium	~4m 30s	~6GB	~95%
13 min (Bilibili)	medium	~8m	~6.5GB	~95%

Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.