Install
openclaw skills install youtube-transcript-native-nodeFetch a clean plain-text transcript from a YouTube video — native Node.js, zero npm dependencies, auditable in 5 minutes. Use when the user asks to transcribe, summarize, or extract captions from a YouTube URL. Wraps the `yt-dlp` binary (must be on PATH); writes subtitles to a temp dir, parses the .vtt, strips timestamps and HTML tags, prints clean text. No API keys required.
openclaw skills install youtube-transcript-native-nodeMinimal, auditable YouTube transcript fetcher.
Native Node.js. Zero npm dependencies. Two files. Small enough to audit in a few minutes.
Wraps the external yt-dlp binary, which must be installed and on PATH.
http(s) YouTube URLs on youtube.com, www.youtube.com, m.youtube.com, or youtu.be.--lang as a simple subtitle language code before invoking yt-dlp.yt-dlp with an argv array and no shell; it does not execute user-provided commands.yt-dlp subprocess with a 120-second timeout.child_process warnings are expected because this skill intentionally wraps the trusted yt-dlp binary. The operator owns the PATH/binary supply-chain trust boundary.Trigger phrases: "transcribe this YouTube video", "get the transcript", "pull the captions", "summarize this video" (paired with a follow-up summarization step).
Use this when:
Do NOT use this when:
The script is in scripts/fetch.mjs.
Basic transcript (English, plain text):
node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID"
Different language:
node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --lang es
Keep timestamps in the output:
node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --timestamps
JSON output (title + transcript + metadata):
node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --json
(Where <skill-dir> is typically workspace/skills/youtube-transcript-native-node/.)
| Flag | Values | Default | Purpose |
|---|---|---|---|
--url | YouTube URL | (required) | Video to fetch the transcript for |
--lang | language code | en | Subtitle language (e.g. en, es, de) |
--timestamps | flag | off | Keep [hh:mm:ss] prefixes in plain-text output |
--json | flag | off | Output JSON: { url, title, lang, auto, timestamps, transcript } |
--no-dedup | flag | off | Disable the rolling-window phrase dedup that runs on auto-captions. Use when the speaker deliberately repeats 3+ word phrases verbatim and you don't want those collapsed. |
-h, --help | flag | — | Show help |
None. This skill uses no API keys, no env vars, and reads no secrets.
It does require the yt-dlp binary to be installed and on PATH.
Install yt-dlp:
winget install yt-dlpbrew install yt-dlpVerify with:
yt-dlp --version
YouTube's auto-generated captions emit a 3-line scrolling window where each phrase appears across multiple overlapping cues. Concatenating the cues yields literal triplicate spam (e.g. "I'm about to show you I'm about to show you I'm about to show you...").
When auto: true and --timestamps is off, this skill runs a multi-pass dedup that collapses any consecutive identical 3- to 15-word phrase down to one copy. Typically reduces transcript size 60-70% with no loss of information.
The dedup is intentionally conservative:
auto: true (manual captions don't have the artifact)--timestamps is off (timestamps mode keeps cues separate so dedup would scramble alignment)"critical critical critical") is preserved (minimum match is 3 words)If a speaker deliberately repeats a 3+ word phrase verbatim and you want it preserved, use --no-dedup to skip the pass entirely.
Default (plain text): the cleaned transcript, one cue per line, timestamps and HTML tags stripped. Suitable for piping into a summarizer or saving to a file.
With --timestamps: each line is prefixed with [hh:mm:ss] so the user can locate moments in the video.
With --json: a single JSON object on stdout:
{
"url": "https://www.youtube.com/watch?v=...",
"title": "Video title from yt-dlp",
"lang": "en",
"auto": false,
"timestamps": false,
"transcript": "full cleaned transcript as a single string"
}
auto is true when only auto-generated captions were available.
When invoking this skill:
--lang en unless the user is clearly working in another language.--json when you plan to feed the result into another tool; use plain text when surfacing to the user.--lang en or check the video on YouTube's UI.os.tmpdir() via fs.mkdtempSyncyt-dlp with --write-subs, --write-auto-subs, --sub-lang, --skip-download, --print-json writing into the temp dir.vtt file: strips WEBVTT header, cue-id lines, timing lines, HTML tags, and consecutive duplicates[hh:mm:ss] prefixes, or JSON1.0.2: Public-release hardening: add a 120-second yt-dlp timeout and a 2,000,000-character transcript output guard.1.0.1: Security/audit polish: document the yt-dlp trust boundary, YouTube host allowlist, no-shell spawn behavior, and language-code validation.