YouTube Transcribe

v1.0.0

Transcribe YouTube videos with smart fallback: extracts captions first (fast, free), falls back to local Whisper transcription when no captions available. Au...

0· 333· 1 versions· 3 current· 3 all-time· Updated 13h ago· MIT-0
byLogan@iml885203

Install

openclaw skills install youtube-transcribe

YouTube Transcribe

Smart YouTube video transcription with automatic fallback:

  1. Captions first — extracts existing subtitles (manual or auto-generated) via yt-dlp. Fast, free, no compute.
  2. Whisper fallback — when no captions exist, downloads audio and transcribes locally with the best available Whisper backend.

When to Use

Use this skill when the user wants to:

  • Get a transcript or text version of a YouTube video
  • Understand what a YouTube video says without watching it
  • Summarize, analyze, or take notes from a YouTube video
  • Extract subtitles or captions from a video

Triggers

  • "transcribe this YouTube video"
  • "what does this video say"
  • "get the transcript of [YouTube URL]"
  • "summarize this YouTube video" (transcribe first, then process)
  • Any YouTube URL shared with a request to understand its content

Requirements

Required:

  • yt-dlp — for caption extraction and audio download
  • python3

For Whisper fallback (when no captions available):

  • ffmpeg — for audio processing
  • One of these Whisper backends (auto-detected in priority order):
    1. mlx-whisper — Apple Silicon native, fastest on Mac (pip install mlx-whisper)
    2. faster-whisper — CTranslate2 backend, fast on CUDA/CPU (pip install faster-whisper)
    3. openai-whisper — Original Whisper, universal fallback (pip install openai-whisper)

Usage

Basic — transcribe a video

python3 {baseDir}/scripts/transcribe.py "https://www.youtube.com/watch?v=VIDEO_ID"

Specify language for captions

python3 {baseDir}/scripts/transcribe.py "URL" --language zh

Force Whisper (skip caption check)

python3 {baseDir}/scripts/transcribe.py "URL" --force-whisper

JSON output

python3 {baseDir}/scripts/transcribe.py "URL" --format json

Save to file

python3 {baseDir}/scripts/transcribe.py "URL" --output transcript.txt

Options

FlagDefaultDescription
--languageautoPreferred subtitle/transcription language (e.g. zh, en, ja)
--formattextOutput format: text, json, srt, vtt
--outputstdoutSave transcript to file
--force-whisperfalseSkip caption extraction, go straight to Whisper
--backendautoWhisper backend: auto, mlx, faster-whisper, whisper
--modelautoWhisper model size: auto, large-v3, medium, small, base, tiny

Environment Variables

VariableDescription
YT_WHISPER_BACKENDOverride Whisper backend selection
YT_WHISPER_MODELOverride Whisper model size

Auto-Detection

Whisper Backend (priority order)

  1. MLX Whisper — detected via import mlx_whisper. Best for Apple Silicon.
  2. faster-whisper — detected via import faster_whisper. Best for CUDA GPU, good on CPU.
  3. OpenAI Whisper — detected via import whisper. Universal fallback.

Model Size (based on available RAM)

RAMModelVRAM/RAM Usage
≥16GBlarge-v3~6-10GB
≥8GBmedium~5GB
≥4GBsmall~2.5GB
<4GBbase~1.5GB

Caption Language Priority

When --language is not specified, captions are searched in this order:

  1. Video's original language
  2. Chinese variants: zh-Hant, zh-Hans, zh-TW, zh-CN, zh
  3. English: en
  4. Any available language

Output Formats

text (default)

Plain text transcript, one continuous block.

json

{
  "video_id": "ZSnYlbIYpjs",
  "title": "Video Title",
  "channel": "Channel Name",
  "duration": 708,
  "language": "zh",
  "method": "captions",
  "transcript": [
    {"start": 0.0, "end": 5.2, "text": "..."},
    ...
  ],
  "full_text": "Complete transcript as single string"
}

srt / vtt

Standard subtitle formats with timestamps.

Version tags

latestvk97a5fq2d8xdrp1777f2tmk4j982gbcd

Runtime requirements

🎬 Clawdis
Binspython3, yt-dlp

Install

Install yt-dlp (brew)
Bins: yt-dlp
brew install yt-dlp
Install ffmpeg (brew)
Bins: ffmpeg
brew install ffmpeg