Install
openclaw skills install video-clip-skillClips a YouTube video locally using yt-dlp and ffmpeg. Supports auto-highlight detection, translation, and CapCut-style karaoke subtitle burning. Triggers wh...
openclaw skills install video-clip-skillRequires yt-dlp, ffmpeg, and python3. Check with command -v.
The ASS karaoke generator is bundled with this plugin. Locate it once at the start (this only searches for the plugin's own bundled file):
ASS_SCRIPT=$(find ~/.claude/plugins -path '*/clip-local/*/scripts/ass-karaoke.py' 2>/dev/null | head -1)
When the user does NOT specify start/end times (e.g., "幫我剪這個影片的精華" or "clip the best parts"):
yt-dlp --print title --print duration_string --print language \
--no-playlist --no-warnings --force-ipv4 "<URL>"
The third line is the original language code (e.g., en, en-US, ja, zh-Hant). Use the base code (before -) for subtitle download.
yt-dlp --write-auto-sub --sub-lang "<LANG>*" --sub-format vtt --skip-download \
--no-playlist --no-warnings --force-ipv4 \
--extractor-args 'youtube:player-client=default,mweb' \
-o "subs" "<URL>"
Replace <LANG> with the base language code from step 1 (e.g., en, ja). The * wildcard matches variants like en-orig. Do NOT use YouTube's auto-translated subs — they are low quality. All translation is done by you.
When clipping a portion (e.g., 10–130s), filter the VTT to only include cues whose timestamps fall within the range. Keep the original absolute timestamps — do NOT adjust them. The --offset flag in ass-karaoke.py handles the time shift.
When filtering, strip any extra metadata from timestamp lines (e.g., align:start position:0%) — keep only HH:MM:SS.mmm --> HH:MM:SS.mmm. The ASS parser regex expects clean timestamp lines.
Write and execute a Python script that:
HH:MM:SS.mmm --> HH:MM:SS.mmm + text lines)Example structure:
import re
# Parse original VTT
with open("clip.vtt") as f:
content = f.read()
cues = re.findall(r'(\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3})\n((?:(?!\d{2}:\d{2}).+\n?)*)', content)
# Translations — fill this list with your translations, one per cue
translations = [
"translated line 1",
"translated line 2",
# ...
]
# Write translated VTT
with open("clip_translated.vtt", "w") as f:
f.write("WEBVTT\n\n")
for (timestamp, _), translation in zip(cues, translations):
f.write(f"{timestamp}\n{translation.strip()}\n\n")
Generate this script with the translations list filled in, then execute it. This ensures timestamps stay exact and VTT format is correct.
python3 "$ASS_SCRIPT" <original.vtt> -o subs.ass -t <translated.vtt> --offset <START_SECONDS>
-t = translated VTT (shown below karaoke line)--offset = clip start time (adjusts timestamps relative to clip start)Handles YouTube rolling caption dedup, CJK per-char splitting, and bilingual layout.
Get both video + audio URLs in one call:
URLS=$(yt-dlp --get-url -f 'bv[height<=720]+ba/b[height<=720]' \
--no-playlist --no-warnings --force-ipv4 \
--extractor-args 'youtube:player-client=default,mweb' "<URL>")
VIDEO_URL=$(echo "$URLS" | head -1)
AUDIO_URL=$(echo "$URLS" | tail -1)
Then clip with ffmpeg:
-vf "ass=subs.ass" -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -movflags +faststart-c copy -avoid_negative_ts make_zero-ss <START> before each -i-map 0:v:0 -map 1:a:0If yt-dlp finds no auto-subs and user has GROQ_API_KEY set:
yt-dlp -f ba -x --audio-format mp3 --postprocessor-args 'ffmpeg:-ac 1 -ar 16000 -b:a 64k'POST https://api.groq.com/openai/v1/audio/transcriptions with model=whisper-large-v3, response_format=verbose_jsonIf GROQ_API_KEY is not set, inform the user that no subtitles are available and ask how to proceed (clip without subs, or set the key).
--cookies cookies.txtbrew install font-noto-sans-cjk-tc (macOS)