Install
openclaw skills install @hitjcl/ai-science-video-studioAI科普视频全流程自动化制作技能。将数字人形象(Google Flow / SadTalker)、AI语音克隆(F5-TTS MLX)、Pillow内容幻灯片、逐字卡拉OK字幕(Pillow + FFmpeg)、以及专业级音视频QA整合为8阶段自动化流水线。覆盖:脚本策划 → 数字人生成 → TTS语音克隆 → 幻灯片渲染 → 字幕渲染 → 音频修复 → 最终合成(FFmpeg xfade + acrossfade + alimiter)→ 专业QA审查。触发词:AI科普视频、制作科普视频、做个AI讲解视频、生成科技短视频。**推荐硬件:Mac mini M系列 16GB内存**,利用Apple Silicon MLX加速语音克隆和本地渲染。
openclaw skills install @hitjcl/ai-science-video-studioFull pipeline for producing explainer / educational videos that combine a digital human avatar (intro/outro) with animated content slides (body), voice-narrated by a cloned personal voice (F5-TTS MLX), with karaoke-style subtitles throughout.
The pipeline follows an 8-stage workflow:
Script Planning → Digital Human → TTS Voice → Content Slides
→ Subtitles → Audio Repair → Final Compositing → QA Review
Default configuration is tuned for 1280×720 (16:9), 24fps, CRF 20 encoding, a single presenter avatar, and Mandarin Chinese narration. All parameters are adjustable.
Trigger on any of the following intents:
Do NOT use this skill for:
ai-short-film-studiosadtalker-pip-compositinggoogle-flow-automationCreate a script.json file defining the video structure with exactly 5 paragraphs:
{
"intro": { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "开场旁白...", "flow_prompt": "..." },
"content_1": { "type": "slides", "engine": "pillow", "duration": 30, "narration": "正文第一段旁白..." },
"content_2": { "type": "slides", "engine": "pillow", "duration": 25, "narration": "正文第二段旁白..." },
"content_3": { "type": "slides", "engine": "pillow", "duration": 29, "narration": "正文第三段旁白..." },
"outro": { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "结尾旁白...", "flow_prompt": "..." }
}
Rules:
digital_human type (talking avatar)slides type (animated content screens)For detailed script format specification, see references/script_format.md.
Two approaches are available. Prefer Google Flow for standalone talking-head segments; use SadTalker for picture-in-picture overlay on content slides.
Use the google-flow-automation skill to generate intro/outro videos:
intro.mp4 and outro.mp4Key parameters:
Use the sadtalker-pip-compositing skill when the digital human should appear as a
circular picture-in-picture overlay on content slides.
Steps:
scripts/fix_sadtalker_numpy.py for numpy 2.x compatibilitydevice='mps'PiP Parameters:
| Parameter | Value |
|---|---|
| Size | 120×120 (final) |
| Position | bottom-left, 20px margin |
| Mask | PIL circular, radius 60px |
| Overlay | overlay=20:H-h-20:shortest=1 |
Use F5-TTS MLX on Apple Silicon for personal voice cloning:
from f5_tts_mlx.generate import generate
# For content narration (MUST use estimate_duration=True!)
audio = generate(
text="旁白文本...",
ref_audio_path="/path/to/ref_voice.mp3",
ref_audio_text="参考音频的文本内容",
steps=64,
cfg_strength=2.5,
speed=1.0,
estimate_duration=True, # CRITICAL for Chinese!
)
CRITICAL — estimate_duration=True:
Without this parameter, F5-TTS generates extremely short audio for Chinese text
(0.5-0.9 seconds per sentence). With it, the model estimates target duration and
generates properly-length audio.
Parameter table:
| Parameter | Intro/Outro | Content |
|---|---|---|
| steps | 64 | 64 |
| cfg_strength | 2.5 | 2.5 |
| speed | 0.45 | 1.0 |
| estimate_duration | No | Yes (critical!) |
Post-processing:
After generation, compute the actual-vs-target duration ratio and apply atempo
to fine-tune timing:
# Example: actual 11.98s, target 10.0s → atempo=1.198
ffmpeg -i generated.wav -filter:a "atempo=1.198" output.wav
When F5-TTS is unavailable or produces garbled output:
edge-tts --voice zh-CN-YunxiNeural --text "旁白文本" --write-media output.wav
Voice selection:
| Purpose | Voice |
|---|---|
| Content narration (male) | zh-CN-YunxiNeural |
| Patch/correction (female) | zh-CN-XiaoxiaoNeural |
Render animated content slides using Pillow frame-by-frame rendering + FFmpeg pipe.
Use scripts/render_slides.py as the template. The script should:
Key rendering parameters:
The script is at scripts/render_slides.py. Customize the content per video topic
while keeping the rendering engine intact.
Generate karaoke-style subtitles as transparent PNG frames overlayed on the final video.
The process:
Audio (.wav)
→ Whisper small/medium transcription
→ Word-level timestamps (segments + words)
→ Text correction mapping (fix Whisper mis-transcriptions)
→ Pillow frame-by-frame PNG rendering (transparent BG)
→ FFmpeg overlay onto video
Use scripts/render_subtitles.py as the rendering engine.
Subtitle style specification (intro and outro MUST match):
| Property | Value |
|---|---|
| Font | STHeiti Medium (macOS: /System/Library/Fonts/STHeiti Medium.ttc) |
| Size | 44px |
| Spoken text color | Orange (#FF6B2B) |
| Unspoken text color | White (#FFFFFF) |
| Outline | 2px black |
| Background bar | Semi-transparent black rgba(0,0,0,160) |
| Display mode | Per-sentence (each sentence appears and disappears independently) |
| Highlight mode | Word-by-word (karaoke-style progressive highlight) |
Text correction mapping: Always maintain a correction dictionary to fix Whisper mis-transcriptions of technical terms and proper names:
corrections = {
"材领": "才林",
"Anthropy": "Anthropic",
"Cloud Code": "Claude Code",
}
CRITICAL — Consistency rule: Intro and outro subtitles MUST use the exact same rendering engine (Pillow) with identical style properties. Never mix Pillow and ASS/other formats — FFmpeg on macOS lacks libass support.
Common audio issues and their fixes. Run scripts/audio_analyzer.py for automated
detection before proceeding.
| Issue | Symptom | Root Cause | Fix |
|---|---|---|---|
| Right channel dropout | Crunching noise at specific timestamps | Source right channel flickers 20+ times | `channelmap=FL-FL |
| Silence gaps | Sudden "click" in music | AI-generated BGM has gaps (100-400ms) | 250ms fade-out/in at each gap boundary |
| Audio truncation | Sound stops abruptly | Segment extracted from wrong time range | Use original source file, re-extract |
| Channel mismatch | Concat fails or silent segments | Mono vs stereo mismatch | Unify all to 48000Hz stereo |
| Clipping | Peak near 32768 (16-bit max) | Volume stacking at concatenation points | alimiter with limit=-0.9dB |
| TTS mispronunciation | Garbled Chinese characters | TTS engine multi-phoneme errors | Re-generate with F5-TTS or edge-tts patch |
Channel fix command:
ffmpeg -i input.wav -af "channelmap=map=FL-FL|FL-FR" -ar 48000 -ac 2 output.wav
Gap smoothing approach:
For each audio gap >80ms detected by scripts/audio_analyzer.py:
# Split at gap, apply fade-out/fade-in, re-concatenate
ffmpeg -i audio.wav -af "afade=t=out:st=GAP_START-0.25:d=0.25,afade=t=in:st=GAP_END:d=0.25" patched.wav
Assemble all segments with professional transitions using FFmpeg.
Compositing order:
intro.mp4 → content_1.mp4 → content_2.mp4 → content_3.mp4 → outro.mp4
Use scripts/compose_final.py for automated assembly.
Encoding parameters:
| Parameter | Value | Rationale |
|---|---|---|
| Codec | libx264 | Maximum compatibility |
| Resolution | 1280×720 | 16:9 standard |
| Frame rate | 24fps | Cinematic feel |
| Rate control | CRF 20 | High-quality unified encoding |
| Pixel format | yuv420p | Universal compatibility |
| Audio codec | AAC 192kbps | 48000Hz stereo |
| Limiter | alimiter limit=-0.9dB | Prevent clipping |
Transition effects:
xfade=transition=fade:duration=0.5:offset=<time> (cross-fade, eliminates hard cuts)acrossfade=d=0.5:curve=tri (triangular cross-fade, smooth audio joins)Pre-compositing checklist:
Run a systematic quality review before delivering the final video.
QA dimensions and inspection methods:
| Dimension | Check | Method |
|---|---|---|
| Video transitions | Hard cuts at boundaries? | Extract transition zone, frame-by-frame review |
| Video encoding | Consistent bitrate across segments? | ffprobe bitrate check |
| Audio artifacts | Noise, pops, silence gaps? | Second-by-second mean/peak analysis |
| Audio joins | Smooth at concatenation points? | Acrossfade spectral analysis |
| Audio clipping | Peaks near 32768? | Peak detection (>32000 = danger) |
| Subtitle sync | Subtitles aligned with speech? | Whisper word-level timestamp verification |
| Subtitle consistency | Intro and outro styles match? | Visual comparison of 7 style properties |
| Pronunciation | Chinese pronunciation accurate? | Whisper transcription cross-validation |
Verification commands:
# Audio per-second analysis
python3 scripts/audio_analyzer.py output.mp4
# Video quality check
ffprobe -v error -select_streams v:0 \
-show_entries stream=codec_name,width,height,r_frame_rate,bit_rate \
output.mp4
# Subtitle sync verification
whisper model small --language zh output.mp4
For the complete QA checklist, see references/qa_checklist.md.
AI科普第{N}期_{主题}_v{N}.mp4
Example: AI科普第一期_SadTalker画中画_v11.mp4
Intermediate files:
| File | Purpose |
|---|---|
content_video.mp4 / content_with_pip_v{N}.mp4 | Content with optional PiP |
sadtalker_output.mp4 / intro.mp4 / outro.mp4 | Digital human outputs |
content_audio.wav / ref_audio_24k.wav | Audio files |
subs_s{N}/frame_{N}.png | Subtitle frames |
build_v{N}.py / build_v{N}_fixed.py | Build scripts |
merge_final.sh / concat_v{N}.txt | Merge scripts |
| Metric | Target |
|---|---|
| Video resolution | 1280×720 (16:9) |
| Frame rate | 24fps |
| Video bitrate | CRF 20 (~200-400 kbps) |
| Audio sample rate | 48000Hz stereo |
| Audio bitrate | AAC 192kbps |
| Audio peak | < -0.9dB (no clipping) |
| Segment transition | 0.5s xfade + acrossfade |
| Subtitle alignment | Whisper word-level timestamps |
For the complete pitfalls reference, see references/pitfalls.md. Key highlights:
F5-TTS Chinese too short: Always set estimate_duration=True for Chinese content narration. Without it, audio is only 0.5-0.9s per sentence.
Alpha channel compositing: When using alphamerge, the human RGBA video is the color source (first input), and the circular mask PNG is the alpha (second input). Reversing them produces a white circle with no human visible.
Concat format mismatch: Different segments may have different sample rates (16000 vs 48000Hz) or channel counts (mono vs stereo). Unify all segments to 48000Hz stereo before concatenation.
Subtitle rendering engine inconsistency: Always use Pillow for both intro and outro subtitles. FFmpeg on macOS lacks libass, making ASS-subtitle filters unavailable.
AI-generated BGM gaps: Google Flow's AI-generated background music may contain silence gaps (100-400ms). Smooth them with 250ms crossfades at each gap boundary.
scripts/render_slides.py — Pillow-based content slide frame renderer (1280×720, dark IDE theme, progressive text reveal)scripts/render_subtitles.py — Karaoke-style subtitle PNG renderer (word-by-word orange highlight, transparent BG, STHeiti 44px)scripts/compose_final.py — End-to-end FFmpeg compositing (xfade + acrossfade + alimiter + CRF20 unified encoding)scripts/audio_analyzer.py — Audio QA analysis tool (second-by-second mean/peak detection, gap finder, clipping detector)references/script_format.md — Complete script.json format specification and examplesreferences/qa_checklist.md — Detailed 8-dimension QA review checklistreferences/pitfalls.md — Comprehensive list of known pitfalls with root causes and fixes