Install
openclaw skills install remotion-word-highlight-subtitlesAdd word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.
openclaw skills install remotion-word-highlight-subtitlesThis skill turns a local video into a subtitled video using the reusable "fine version": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.
Detailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles
Use this skill for requests like:
If the user only gives one video path, write the output next to the source video.
.mp4, .mov, .m4v, or audio/video file with usable audio.<source-stem>_remotion逐词高亮字幕.mp4 in the same directory unless the user names another target.--language zh --word_timestamps True --output_format json.startMs and endMs. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering.height * 0.28 bottom padding, then adjust only if the source framing clearly needs it.ffprobe for width, height, fps, duration, and audio presence.turbo is a good default when available. Use --initial_prompt when the glossary contains high-risk names or English terms.public/captions.json using scripts/whisper_json_to_captions.py, passing corrections with --replace or --replace-phrase. If Whisper returns long paragraphs, keep the helper's caption splitting enabled.public/input.mp4 or encode an H.264 compatibility copy as public/input-h264.mp4.ffprobe; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.Use this shape, adjusting model and paths as needed:
whisper "/absolute/path/video.mp4" --model turbo --language zh --word_timestamps True --output_format json --output_dir "/absolute/path"
For names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:
whisper "/absolute/path/video.mp4" \
--model turbo \
--language zh \
--word_timestamps True \
--output_format json \
--initial_prompt "本视频可能出现这些词:Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。" \
--output_dir "/absolute/path"
If Whisper fails on the video container, extract audio first:
ffmpeg -y -i "/absolute/path/video.mp4" -vn -ac 1 -ar 16000 "/absolute/path/video_audio.wav"
Then run Whisper on the WAV and keep the output naming clear.
After Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.
Check especially:
Cursor, Kimi 2.5, ChatGPT, Claude, OpenAICreate a short correction map and apply it before rendering. Use --replace for single-token fixes and --replace-phrase for words split across adjacent Whisper tokens.
Example:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--replace-phrase "科舍=Cursor" \
--replace-phrase "KMI 2.5=Kimi 2.5" \
--replace-phrase "Kimi 2.5=Kimi 2.5" \
--replace-phrase "AI却=AI圈" \
--replace-phrase "死腿=转推" \
--merge-term "Cursor" \
--merge-term "Kimi 2.5" \
--keyword "AI" \
--keyword "Kimi 2.5"
If a correction is uncertain, prefer rerunning Whisper with a better --initial_prompt or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.
Remotion should load public/captions.json with this shape:
[
{
"text": "我们用手机随便拍张照片",
"startMs": 0,
"endMs": 1600,
"tokens": [
{ "text": "我们", "startMs": 0, "endMs": 300, "keyword": false },
{ "text": "手机", "startMs": 440, "endMs": 700, "keyword": true }
]
}
]
Use the helper script:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--keyword "提示词" \
--keyword "Codex" \
--replace-phrase "错识别词=正确词" \
--max-caption-chars 28 \
--max-caption-duration-ms 4200 \
--split-gap-ms 260 \
--min-punctuation-caption-ms 900
Run the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with --merge-term or a same-text --replace-phrase so the highlight appears as a clean word instead of broken characters.
The helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.
The caption layer should:
OffthreadVideo for the source video.captions.json with delayRender, continueRender, and staticFile.currentMs >= startMs && currentMs < endMs.currentMs is within the token's timing.letterSpacing: 0.WebkitTextStroke as a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-direction textShadow; if WebkitTextStroke is used at all, keep it at or below 1.5px and verify a still.<= 0.16, tight padding, and verify it does not look like a dark banner.Reject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, or placement over the mouth/chin in a talking-head video.
Use these style constants as the baseline:
const captionBottom = Math.round(height * 0.28);
const captionFontSize = Math.round(height * 0.032);
const captionMaxWidth = Math.round(width * 0.88);
const activeColor = "#FFE45C";
const keywordColor = "#D6FFF8";
const outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));
Use a clean shadow outline rather than a heavy stroke:
const textShadow = [
`${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,
].join(", ");
For a 720x1280 vertical video, this is roughly paddingBottom: 358, fontSize: 41, maxWidth: 634, and a 2px outline. For the original 2160x2974 source that inspired this skill, the equivalent bottom padding was about 830.
If the user does not specify keywords, infer a small set from the transcript:
Codex, Remotion, Whisper提示词, 照片, 手机, 封面Keep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.
Prefer:
<source-stem>_remotion逐词高亮字幕.mp4
If iterating for the same source, add _v2, _v3, etc. Do not overwrite the user's source file.
Tell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.