Video Subtitle Extractor

Other

Cross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/medium/large-v3), and applies LLM-based text calibration for Chinese financial/technical content. Use when: (1) extracting subtitles from Bilibili, YouTube, or any yt-dlp-supported platform, (2) the video has no built-in subtitles, (3) users say "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写", (4) needing to transcribe audio files to text, (5) working with Chinese-language video content requiring high-accuracy transcription. Automatically handles dependency installation (ffmpeg, yt-dlp, openai-whisper) and model downloads.

Install

openclaw skills install video-subtitle-extractor

Video Subtitle Extractor 🎬→📝

Cross-platform ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with openai-whisper, and applies LLM-based text calibration for Chinese content.

Tested & verified on Windows 11 with real Bilibili videos (medium model, ~95% accuracy for Chinese).

Quick Start

# One-command full pipeline
python scripts/run.py <video_url> --model medium --language zh --output-dir ./output

# Download audio only
python scripts/download_audio.py <video_url> <output_dir>

# Transcribe existing audio
python scripts/transcribe.py <audio_file> --model medium --language zh

When to Use This Skill

Use this skill when:

  1. The video has no built-in subtitles (Bilibili, YouTube, etc.)
  2. You need high-accuracy Chinese transcription (~95% with medium model)
  3. You want multiple output formats (TXT, SRT, VTT, JSON)
  4. You need LLM-assisted text calibration for financial/technical terms
  5. The user says: "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写"

Workflow

Step 0: Install Dependencies (once)

python scripts/install_deps.py

Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.

Step 1: Download Audio

Run scripts/download_audio.py <url> [output_dir].

Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.

If download fails: the video may require cookies. Try:

yt-dlp --cookies-from-browser chrome <url>

Step 2: ASR Transcription

Run scripts/transcribe.py <audio> --model <size> --language <lang>.

Models are auto-downloaded on first use (disk space required):

ModelRAMDiskSpeedQualityBest For
small~2GB461MB~475 fps~90%Quick tests
medium~5GB1.42GB~165 fps~95%Recommended
large-v3~10GB2.88GB~80 fps~97%Best accuracy
large-v3-turbo~6GB1.6GB~120 fps~96%Good balance

⚠️ Windows note: With <16GB RAM, large-v3 may be killed (SIGKILL). Fall back to medium.

Output formats: txt, srt, vtt, json (default: all).

See references/asr_models.md for full model comparison.

Step 3: LLM Text Calibration

After transcription, read the .txt output and apply corrections. Key calibration categories:

  1. Homophone fixes (同音字): 硬钢→硬扛, 模→磨, 骨→股
  2. Company/product names: Deepseat→DeepSeek, 中繼續創→中际旭创, HPM→HBM
  3. Financial terms: 抛押→抛压, 护盘 (not 互盘), 筹码, K线收十字星 (not 14星)
  4. Common substitutions: 跟锋→跟风, 微转→微赚, 落带为安→落袋为安
  5. Traditional→Simplified: If model outputs traditional Chinese, convert to simplified
  6. Structural cleanup: Add paragraph breaks at topic shifts, format as prose

See references/calibration_guide.md for the full 30+ pattern library.

Step 4: Deliver Results

Present the calibrated text. Always include:

  • Model used (small/medium/large) and quality notes
  • Any sections with low confidence or unclear audio
  • Summary of corrections applied (counts by category)

Platform Support

PlatformStatusNotes
BilibiliAudio-only streams available without login. 720P+ video needs cookies.
YouTubeFull support. Cookies may improve format selection.
Douyin/TikTokVia yt-dlp
All yt-dlp sites1800+ supported platforms

Extending with New ASR Models

scripts/transcribe.py is designed for backend extensibility:

  1. Add model info to MODEL_SIZES dict
  2. Implement transcribe_<backend>() function
  3. Add CLI flag in argparse

Planned backends: faster-whisper (CTranslate2), whisper.cpp (native C++), Cloud APIs (AssemblyAI, iFlytek).

Troubleshooting

ProblemSolution
SIGKILL during transcriptionModel too large. Use --model medium or --model small.
yt-dlp download failsUpdate yt-dlp: pip install -U yt-dlp. Try with cookies.
"No subtitles found"Expected. This skill uses ASR, not built-in captions.
ffmpeg not foundRun install_deps.py (handles Windows non-PATH detection).
GPU not utilizedopenai-whisper CPU-only by default. Install faster-whisper for GPU.

Performance Benchmarks (Tested)

Video DurationModelTimeRAM PeakAccuracy
6 min (Bilibili)small~1m 17s~2.5GB~90%
6 min (Bilibili)medium~4m 30s~6GB~95%
13 min (Bilibili)medium~8m~6.5GB~95%

Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.