YouTube Transcript Native Node

Fetch a clean plain-text transcript from a YouTube video — native Node.js, zero npm dependencies, auditable in 5 minutes. Use when the user asks to transcribe, summarize, or extract captions from a YouTube URL. Wraps the `yt-dlp` binary (must be on PATH); writes subtitles to a temp dir, parses the .vtt, strips timestamps and HTML tags, prints clean text. No API keys required.

Audits

Pass

Install

openclaw skills install youtube-transcript-native-node

YouTube Transcript (Native Node)

Minimal, auditable YouTube transcript fetcher.

Native Node.js. Zero npm dependencies. Two files. Small enough to audit in a few minutes.

Wraps the external yt-dlp binary, which must be installed and on PATH.

Security behavior

  • Accepts only http(s) YouTube URLs on youtube.com, www.youtube.com, m.youtube.com, or youtu.be.
  • Validates --lang as a simple subtitle language code before invoking yt-dlp.
  • Spawns yt-dlp with an argv array and no shell; it does not execute user-provided commands.
  • Bounds the yt-dlp subprocess with a 120-second timeout.
  • Creates and removes a temporary subtitle directory under the OS temp path.
  • Refuses to print transcripts larger than 2,000,000 characters.
  • Reads no API keys, env secrets, credential files, or OpenClaw config.
  • Static-analysis note: child_process warnings are expected because this skill intentionally wraps the trusted yt-dlp binary. The operator owns the PATH/binary supply-chain trust boundary.

When to use

Trigger phrases: "transcribe this YouTube video", "get the transcript", "pull the captions", "summarize this video" (paired with a follow-up summarization step).

Use this when:

  • The user gives you a YouTube URL and wants the spoken text
  • You need clean plain text for downstream summarization, search, or quoting
  • The video has either creator-uploaded subtitles or auto-generated captions

Do NOT use this when:

  • The video has no subtitles in any language (this skill won't transcribe audio — it only extracts existing captions)
  • The user wants a different platform (Vimeo, TikTok, podcasts) — yt-dlp may support some, but this skill is YouTube-targeted
  • Live streams that haven't ended yet
  • Privacy-sensitive content (yt-dlp talks to youtube.com)

How to run

The script is in scripts/fetch.mjs.

Basic transcript (English, plain text):

node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID"

Different language:

node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --lang es

Keep timestamps in the output:

node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --timestamps

JSON output (title + transcript + metadata):

node "<skill-dir>/scripts/fetch.mjs" --url "https://www.youtube.com/watch?v=VIDEO_ID" --json

(Where <skill-dir> is typically workspace/skills/youtube-transcript-native-node/.)

All flags

FlagValuesDefaultPurpose
--urlYouTube URL(required)Video to fetch the transcript for
--langlanguage codeenSubtitle language (e.g. en, es, de)
--timestampsflagoffKeep [hh:mm:ss] prefixes in plain-text output
--jsonflagoffOutput JSON: { url, title, lang, auto, timestamps, transcript }
--no-dedupflagoffDisable the rolling-window phrase dedup that runs on auto-captions. Use when the speaker deliberately repeats 3+ word phrases verbatim and you don't want those collapsed.
-h, --helpflagShow help

Credentials

None. This skill uses no API keys, no env vars, and reads no secrets.

It does require the yt-dlp binary to be installed and on PATH.

Install yt-dlp:

  • Windows: winget install yt-dlp
  • macOS: brew install yt-dlp
  • Cross-platform fallback: install from the official yt-dlp project instructions if package managers are unavailable

Verify with:

yt-dlp --version

Auto-caption rolling-window dedup

YouTube's auto-generated captions emit a 3-line scrolling window where each phrase appears across multiple overlapping cues. Concatenating the cues yields literal triplicate spam (e.g. "I'm about to show you I'm about to show you I'm about to show you...").

When auto: true and --timestamps is off, this skill runs a multi-pass dedup that collapses any consecutive identical 3- to 15-word phrase down to one copy. Typically reduces transcript size 60-70% with no loss of information.

The dedup is intentionally conservative:

  • Only fires when auto: true (manual captions don't have the artifact)
  • Only fires when --timestamps is off (timestamps mode keeps cues separate so dedup would scramble alignment)
  • Only collapses consecutive repeats; non-adjacent repetition (a phrase used at minute 1 and again at minute 5) is preserved
  • Single-word repetition ("critical critical critical") is preserved (minimum match is 3 words)

If a speaker deliberately repeats a 3+ word phrase verbatim and you want it preserved, use --no-dedup to skip the pass entirely.

Output format

Default (plain text): the cleaned transcript, one cue per line, timestamps and HTML tags stripped. Suitable for piping into a summarizer or saving to a file.

With --timestamps: each line is prefixed with [hh:mm:ss] so the user can locate moments in the video.

With --json: a single JSON object on stdout:

{
  "url": "https://www.youtube.com/watch?v=...",
  "title": "Video title from yt-dlp",
  "lang": "en",
  "auto": false,
  "timestamps": false,
  "transcript": "full cleaned transcript as a single string"
}

auto is true when only auto-generated captions were available.

Agent usage pattern

When invoking this skill:

  1. Pass the full YouTube URL — don't try to construct a different URL form.
  2. Default to --lang en unless the user is clearly working in another language.
  3. Use --json when you plan to feed the result into another tool; use plain text when surfacing to the user.
  4. Save long transcripts to a file when useful; quote sparingly and summarize before pasting unless the user asked for the raw text.

Troubleshooting

  • "yt-dlp not found on PATH" → install yt-dlp (see Credentials section above), then re-open your shell so PATH refreshes.
  • "no subtitles available for lang=<x>" → the video doesn't have captions in that language. Try --lang en or check the video on YouTube's UI.
  • "yt-dlp exited with code N" → the URL may be private, region-locked, age-restricted, or removed. The yt-dlp stderr is forwarded so you can see what went wrong.
  • HTTP 429 from yt-dlp → YouTube rate-limited the IP; wait a few minutes and retry.
  • ".vtt file not produced" → yt-dlp ran but didn't write subtitles; usually means no captions exist. Re-run with the same args; if it persists, the video has no captions.
  • Garbled or interleaved auto-caption lines → YouTube's auto-captions emit overlapping cues; this script de-duplicates consecutive identical text, but unusual cases may still look choppy. That's a YouTube data issue, not a script bug.

What this skill does

  • Validates the URL and flags
  • Creates a fresh temp directory under os.tmpdir() via fs.mkdtempSync
  • Spawns yt-dlp with --write-subs, --write-auto-subs, --sub-lang, --skip-download, --print-json writing into the temp dir
  • Parses the resulting .vtt file: strips WEBVTT header, cue-id lines, timing lines, HTML tags, and consecutive duplicates
  • Prints either plain text, plain text with [hh:mm:ss] prefixes, or JSON
  • Removes the temp directory (best-effort) on exit

What this skill does NOT do

  • Does not download audio or video
  • Does not transcribe audio (no Whisper, no STT) — captions only
  • Does not modify any configuration
  • Does not write any files outside the temp directory it creates and removes
  • Does not call any web API directly (only yt-dlp talks to YouTube)
  • Does not auto-update yt-dlp

Changelog

  • 1.0.2: Public-release hardening: add a 120-second yt-dlp timeout and a 2,000,000-character transcript output guard.
  • 1.0.1: Security/audit polish: document the yt-dlp trust boundary, YouTube host allowlist, no-shell spawn behavior, and language-code validation.