{"skill":{"slug":"remotion-word-highlight-subtitles","displayName":"Remotion Word Highlight Subtitles","summary":"Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.","description":"---\nname: remotion-word-highlight-subtitles\ndescription: Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.\nversion: 0.1.4\nmetadata:\n  openclaw:\n    requires:\n      bins:\n        - python3\n        - ffmpeg\n        - ffprobe\n        - whisper\n        - node\n        - npm\n    emoji: \"🎬\"\n---\n\n# Remotion Word Highlight Subtitles\n\n## Overview\n\nThis skill turns a local video into a subtitled video using the reusable \"fine version\": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.\n\nDetailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles\n\n## Trigger Phrases\n\nUse this skill for requests like:\n\n- \"给这个视频添加字幕：/path/to/video.mp4\"\n- \"给这个视频加逐词高亮字幕\"\n- \"按之前 Remotion 那个字幕方案处理这个视频\"\n- \"重新转写 word timestamps，然后加当前词高亮字幕\"\n\nIf the user only gives one video path, write the output next to the source video.\n\n## Defaults\n\n- Source: local `.mp4`, `.mov`, `.m4v`, or audio/video file with usable audio.\n- Output: `<source-stem>_remotion逐词高亮字幕.mp4` in the same directory unless the user names another target.\n- Transcription: Whisper with word timestamps, Chinese by default: `--language zh --word_timestamps True --output_format json`.\n- Transcript QA: mandatory before rendering. Build a tiny glossary from the user prompt, folder/file names, visible context, and topic; correct obvious ASR mistakes in product/model/person/place names, English terms, numbers, and homophones.\n- Caption data: Remotion caption chunks built from Whisper words, with token-level `startMs` and `endMs`. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering.\n- Caption position: keep all captions in a screenshot-friendly lower band, slightly above the bottom UI area. Start around `height * 0.28` bottom padding, then adjust only if the source framing clearly needs it.\n- Visual style: bold Chinese UI font, white base text, current token yellow, optional sparse keyword accent, and black shadow/outline for readability.\n- Subtitle sizing: treat font size as a hard visual gate, not a loose style hint. For landscape 1080p footage, target `height * 0.056` (about `60px`) and never render below `56px`; for vertical `720x1280`, target `40-46px`. For other sizes, scale from height but enforce the same perceived size in still QA.\n- Display punctuation: default subtitles are clean Chinese display text. Strip visible sentence punctuation from caption text and token text before rendering, including `。 ， 、 ？ ！ ： ； , . ? ! : ;`. Keep punctuation only for splitting/timing unless the user explicitly asks for verbatim punctuation.\n- Verification: check the rendered file exists, keeps audio, matches the source duration closely, and visually inspect stills before accepting the style.\n\n## Workflow\n\n1. Inspect the source with `ffprobe` for width, height, fps, duration, and audio presence.\n2. Build a small domain glossary before transcription. Include likely product names, model names, people, places, mixed Chinese/English terms, numbers, and terms hinted by the folder or filename.\n3. Run Whisper word timestamp transcription. Prefer a cached/local model that works on the machine; `turbo` is a good default when available. Use `--initial_prompt` when the glossary contains high-risk names or English terms.\n4. Do a mandatory transcript QA pass before rendering. Read every segment, compare it with the video's topic/context, and create a correction map for obvious ASR mistakes. Do not render while known mistakes remain.\n5. Convert the Whisper JSON to `public/captions.json` using `scripts/whisper_json_to_captions.py`, passing corrections with `--replace` or `--replace-phrase`. Apply transcript corrections, merge terms, strip display punctuation, and keep the helper's caption splitting enabled if Whisper returns long paragraphs.\n6. Build or reuse a small Remotion project in the video's folder. Copy the video to `public/input.mp4` or encode an H.264 compatibility copy as `public/input-h264.mp4`.\n7. Set the Remotion composition to the exact source width, height, fps, and duration in frames.\n8. Before the full render, render and inspect at least two stills from the Remotion composition: one long/two-line caption and one dense keyword/current-token caption. Fix muddy text, excessive outline, large dark backing, bad line breaks, too-small text, visible display punctuation, poor placement, or remaining typo/wrong-character subtitles before rendering the final video.\n9. Render with Remotion to the output path next to the source.\n10. Verify the final output with `ffprobe`; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.\n\n## Whisper Command\n\nUse this shape, adjusting model and paths as needed:\n\n```bash\nwhisper \"/absolute/path/video.mp4\" --model turbo --language zh --word_timestamps True --output_format json --output_dir \"/absolute/path\"\n```\n\nFor names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:\n\n```bash\nwhisper \"/absolute/path/video.mp4\" \\\n  --model turbo \\\n  --language zh \\\n  --word_timestamps True \\\n  --output_format json \\\n  --initial_prompt \"本视频可能出现这些词：Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。\" \\\n  --output_dir \"/absolute/path\"\n```\n\nIf Whisper fails on the video container, extract audio first:\n\n```bash\nffmpeg -y -i \"/absolute/path/video.mp4\" -vn -ac 1 -ar 16000 \"/absolute/path/video_audio.wav\"\n```\n\nThen run Whisper on the WAV and keep the output naming clear.\n\n## Mandatory Transcript QA\n\nAfter Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.\n\nCheck especially:\n\n- product, model, company, person, and place names\n- mixed Chinese/English terms such as `Cursor`, `Kimi 2.5`, `ChatGPT`, `Claude`, `OpenAI`\n- numbers, dates, model versions, and acronyms\n- Chinese homophones that are plausible but wrong in context, such as \"却\" versus \"圈\" or \"死腿\" versus \"转推\"\n- low-confidence words if the Whisper JSON includes probabilities\n\nCreate a short correction map and apply it before rendering. Use `--replace` for single-token fixes and `--replace-phrase` for words split across adjacent Whisper tokens.\n\nExample:\n\n```bash\npython3 scripts/whisper_json_to_captions.py \\\n  \"/absolute/path/transcript.json\" \\\n  \"/absolute/path/remotion-project/public/captions.json\" \\\n  --replace-phrase \"科舍=Cursor\" \\\n  --replace-phrase \"KMI 2.5=Kimi 2.5\" \\\n  --replace-phrase \"Kimi 2.5=Kimi 2.5\" \\\n  --replace-phrase \"AI却=AI圈\" \\\n  --replace-phrase \"死腿=转推\" \\\n  --merge-term \"Cursor\" \\\n  --merge-term \"Kimi 2.5\" \\\n  --keyword \"AI\" \\\n  --keyword \"Kimi 2.5\"\n```\n\nIf a correction is uncertain, prefer rerunning Whisper with a better `--initial_prompt` or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.\n\n## Caption JSON\n\nRemotion should load `public/captions.json` with this shape:\n\n```json\n[\n  {\n    \"text\": \"我们用手机随便拍张照片\",\n    \"startMs\": 0,\n    \"endMs\": 1600,\n    \"tokens\": [\n      { \"text\": \"我们\", \"startMs\": 0, \"endMs\": 300, \"keyword\": false },\n      { \"text\": \"手机\", \"startMs\": 440, \"endMs\": 700, \"keyword\": true }\n    ]\n  }\n]\n```\n\nUse the helper script:\n\n```bash\npython3 scripts/whisper_json_to_captions.py \\\n  \"/absolute/path/transcript.json\" \\\n  \"/absolute/path/remotion-project/public/captions.json\" \\\n  --keyword \"提示词\" \\\n  --keyword \"Codex\" \\\n  --replace-phrase \"错识别词=正确词\" \\\n  --max-caption-chars 28 \\\n  --max-caption-duration-ms 4200 \\\n  --split-gap-ms 260 \\\n  --min-punctuation-caption-ms 900\n```\n\nRun the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with `--merge-term` or a same-text `--replace-phrase` so the highlight appears as a clean word instead of broken characters.\n\nBy default, `scripts/whisper_json_to_captions.py` strips visible display punctuation from final caption text and tokens while still using punctuation as a split signal. Do not pass `--keep-display-punctuation` unless the user explicitly asks for verbatim transcript punctuation. If building JSON manually, remove punctuation-only tokens and trim trailing punctuation before rendering.\n\nThe helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.\n\n## Remotion Caption Layer Requirements\n\nThe caption layer should:\n\n- Use `OffthreadVideo` for the source video.\n- Load `captions.json` with `delayRender`, `continueRender`, and `staticFile`.\n- Find the active caption by `currentMs >= startMs && currentMs < endMs`.\n- Highlight the active token when `currentMs` is within the token's timing.\n- Use keyword coloring only as a secondary accent; the current spoken token is the main effect.\n- Keep `letterSpacing: 0`.\n- Render clean display subtitles without sentence-final punctuation. Do not show comma, period, question mark, exclamation mark, colon, semicolon, or dunhao tokens unless the user asked for verbatim punctuation.\n- Treat caption size as a hard gate. For 1080p landscape, the rendered subtitle should be around `60px` and never below `56px`; if a still looks like small UI annotation text, enlarge and rerender.\n- Keep all captions at the normal position; do not include special screenshot-sentence placement unless the user explicitly asks.\n- Do not use `WebkitTextStroke` as a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-direction `textShadow`; if `WebkitTextStroke` is used at all, keep it at or below `1.5px` and verify a still.\n- Do not use a large semi-transparent rounded black rectangle behind the whole caption by default. If the footage truly needs backing, use a very subtle per-line backing, opacity `<= 0.16`, tight padding, and verify it does not look like a dark banner.\n\nReject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, visible sentence punctuation, text that is too small for the frame, or placement over the mouth/chin in a talking-head video.\n\nUse these style constants as the baseline:\n\n```tsx\nconst captionBottom = Math.round(height * 0.28);\nconst isLandscape = width >= height;\nconst baseCaptionFontSize = Math.round(height * (isLandscape ? 0.056 : 0.032));\nconst minCaptionFontSize = isLandscape && height >= 1000 ? 56 : Math.round(height * 0.032);\nconst captionFontSize = Math.max(baseCaptionFontSize, minCaptionFontSize);\nconst captionMaxWidth = Math.round(width * 0.88);\nconst activeColor = \"#FFE45C\";\nconst keywordColor = \"#D6FFF8\";\nconst outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));\n```\n\nUse a clean shadow outline rather than a heavy stroke:\n\n```tsx\nconst textShadow = [\n  `${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,\n  `-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,\n  `0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,\n  `0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,\n  `0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,\n].join(\", \");\n```\n\nFor a `1920x1080` landscape video, this yields about `fontSize: 60`, and it must not fall below `56`. For a `720x1280` vertical video, this is roughly `paddingBottom: 358`, `fontSize: 41`, `maxWidth: 634`, and a `2px` outline. For the original `2160x2974` source that inspired this skill, the equivalent bottom padding was about `830`.\n\n## Keyword Handling\n\nIf the user does not specify keywords, infer a small set from the transcript:\n\n- product/tool names: `Codex`, `Remotion`, `Whisper`\n- content objects: `提示词`, `照片`, `手机`, `封面`\n- place names or nouns that carry the video's hook\n- numbers or outcomes that are important to the promise\n\nKeep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.\n\n## Output Naming\n\nPrefer:\n\n```text\n<source-stem>_remotion逐词高亮字幕.mp4\n```\n\nIf iterating for the same source, add `_v2`, `_v3`, etc. Do not overwrite the user's source file.\n\n## Final Response\n\nTell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.\n","tags":{"latest":"0.1.4","remotion":"0.1.4","short-video":"0.1.4","subtitles":"0.1.4","video":"0.1.4","whisper":"0.1.4"},"stats":{"comments":0,"downloads":547,"installsAllTime":20,"installsCurrent":2,"stars":1,"versions":5},"createdAt":1777821015082,"updatedAt":1778763410822},"latestVersion":{"version":"0.1.4","createdAt":1778763193796,"changelog":"Enforce larger subtitle sizing and strip display punctuation by default.","license":"MIT-0"},"metadata":{"setup":[],"os":null,"systems":null},"owner":{"handle":"0x00000003","userId":"s170qhqbpmn4wvbe5p927d7hd183mmwq","displayName":"0x00000003","image":"https://avatars.githubusercontent.com/u/7299193?v=4"},"moderation":null}