AI科普视频工作室（Mac mini 16G适用）

AI科普视频全流程自动化制作技能。将数字人形象（Google Flow / SadTalker）、AI语音克隆（F5-TTS MLX）、Pillow内容幻灯片、逐字卡拉OK字幕（Pillow + FFmpeg）、以及专业级音视频QA整合为8阶段自动化流水线。覆盖：脚本策划 → 数字人生成 → TTS语音克隆 → 幻灯片渲染 → 字幕渲染 → 音频修复 → 最终合成（FFmpeg xfade + acrossfade + alimiter）→ 专业QA审查。触发词：AI科普视频、制作科普视频、做个AI讲解视频、生成科技短视频。**推荐硬件：Mac mini M系列 16GB内存**，利用Apple Silicon MLX加速语音克隆和本地渲染。

寒武纪智能Cambrian Intelligence@hitjcl

Install

openclaw skills install @hitjcl/ai-science-video-studio

AI Science Video Studio — AI科普视频自动化制作技能

Overview

Full pipeline for producing explainer / educational videos that combine a digital human avatar (intro/outro) with animated content slides (body), voice-narrated by a cloned personal voice (F5-TTS MLX), with karaoke-style subtitles throughout.

The pipeline follows an 8-stage workflow:

text

Script Planning → Digital Human → TTS Voice → Content Slides
    → Subtitles → Audio Repair → Final Compositing → QA Review

Default configuration is tuned for 1280×720 (16:9), 24fps, CRF 20 encoding, a single presenter avatar, and Mandarin Chinese narration. All parameters are adjustable.

When to Use This Skill

Trigger on any of the following intents:

User asks to create an "AI科普" (AI science explainer) video
User wants an educational/explainer video with digital human + slides format
User mentions combining a talking avatar with content slides
User needs the full pipeline: script → voice clone → slides → subtitles → compositing
User says "做一个讲解视频", "生成科普视频", "制作AI讲解类视频"

Do NOT use this skill for:

Pure short films / drama without educational content → use ai-short-film-studio
Only SadTalker PiP compositing without slides → use sadtalker-pip-compositing
Only Google Flow video generation → use google-flow-automation

Pipeline Stages

Stage 1: Script Planning

Create a script.json file defining the video structure with exactly 5 paragraphs:

json

{
  "intro":      { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "开场旁白...", "flow_prompt": "..." },
  "content_1":  { "type": "slides",        "engine": "pillow",      "duration": 30, "narration": "正文第一段旁白..." },
  "content_2":  { "type": "slides",        "engine": "pillow",      "duration": 25, "narration": "正文第二段旁白..." },
  "content_3":  { "type": "slides",        "engine": "pillow",      "duration": 29, "narration": "正文第三段旁白..." },
  "outro":      { "type": "digital_human", "engine": "google_flow", "duration": 10, "narration": "结尾旁白...", "flow_prompt": "..." }
}

Rules:

intro and outro use digital_human type (talking avatar)
content segments use slides type (animated content screens)
Each segment must specify: type, engine, duration (seconds), narration text
Narration text should be ≤15 seconds worth of speech per segment (~60 Chinese characters)
Duration field is the target video length (not TTS length — TTS naturally sets the pace)

For detailed script format specification, see references/script_format.md.

Stage 2: Digital Human Generation (Intro & Outro)

Two approaches are available. Prefer Google Flow for standalone talking-head segments; use SadTalker for picture-in-picture overlay on content slides.

Option A: Google Flow CDP Automation

Use the google-flow-automation skill to generate intro/outro videos:

Launch Chrome with remote debugging on port 9222
Navigate to labs.google/fx/tools/flow
Upload avatar reference image
Enter the Chinese prompt from script.json
Wait ~3-5 minutes for 10-second video generation
Download as intro.mp4 and outro.mp4

Key parameters:

Avatar: upload the user's preferred reference image (portrait photo)
Prompt: in Chinese, describe the scene and delivery style
Model: Omni Flash, 16:9 aspect ratio, 10s duration
Account: the user's Google account credentials (handled by Chrome profile)

Option B: SadTalker MPS (for PiP on content)

Use the sadtalker-pip-compositing skill when the digital human should appear as a circular picture-in-picture overlay on content slides.

Steps:

Run scripts/fix_sadtalker_numpy.py for numpy 2.x compatibility
Extract avatar image + TTS audio
Run SadTalker 3-stage inference with device='mps'
Create circular mask (120×120) with PIL
FFmpeg overlay onto content at bottom-left corner

PiP Parameters:

Parameter	Value
Size	120×120 (final)
Position	bottom-left, 20px margin
Mask	PIL circular, radius 60px
Overlay	`overlay=20:H-h-20:shortest=1`

Stage 3: Voice Generation (TTS)

Primary: F5-TTS MLX Voice Cloning

Use F5-TTS MLX on Apple Silicon for personal voice cloning:

python

from f5_tts_mlx.generate import generate

# For content narration (MUST use estimate_duration=True!)
audio = generate(
    text="旁白文本...",
    ref_audio_path="/path/to/ref_voice.mp3",
    ref_audio_text="参考音频的文本内容",
    steps=64,
    cfg_strength=2.5,
    speed=1.0,
    estimate_duration=True,  # CRITICAL for Chinese!
)

CRITICAL — estimate_duration=True: Without this parameter, F5-TTS generates extremely short audio for Chinese text (0.5-0.9 seconds per sentence). With it, the model estimates target duration and generates properly-length audio.

Parameter table:

Parameter	Intro/Outro	Content
steps	64	64
cfg_strength	2.5	2.5
speed	0.45	1.0
estimate_duration	No	Yes (critical!)

Post-processing: After generation, compute the actual-vs-target duration ratio and apply atempo to fine-tune timing:

bash

# Example: actual 11.98s, target 10.0s → atempo=1.198
ffmpeg -i generated.wav -filter:a "atempo=1.198" output.wav

Fallback: edge-tts

When F5-TTS is unavailable or produces garbled output:

bash

edge-tts --voice zh-CN-YunxiNeural --text "旁白文本" --write-media output.wav

Voice selection:

Purpose	Voice
Content narration (male)	zh-CN-YunxiNeural
Patch/correction (female)	zh-CN-XiaoxiaoNeural

Stage 4: Content Slide Rendering

Render animated content slides using Pillow frame-by-frame rendering + FFmpeg pipe.

Use scripts/render_slides.py as the template. The script should:

Accept narration text split into lines
Render each frame with progressively "typed" text (one new line per frame)
Use terminal/IDE aesthetic: dark background (#1a1a2e), green/white text, monospace font
Output 1280×720, 24fps PNG frames via FFmpeg pipe
Sync frame count to the TTS audio duration

Key rendering parameters:

Resolution: 1280×720
Frame rate: 24fps
Background: dark (#1a1a2e or pure black for terminal look)
Text: green (#00ff41) for code, white for explanatory text
Font: SF Mono or Menlo for code sections; STHeiti for Chinese text

The script is at scripts/render_slides.py. Customize the content per video topic while keeping the rendering engine intact.

Stage 5: Subtitle Rendering

Generate karaoke-style subtitles as transparent PNG frames overlayed on the final video.

The process:

text

Audio (.wav)
  → Whisper small/medium transcription
  → Word-level timestamps (segments + words)
  → Text correction mapping (fix Whisper mis-transcriptions)
  → Pillow frame-by-frame PNG rendering (transparent BG)
  → FFmpeg overlay onto video

Use scripts/render_subtitles.py as the rendering engine.

Subtitle style specification (intro and outro MUST match):

Property	Value
Font	STHeiti Medium (macOS: `/System/Library/Fonts/STHeiti Medium.ttc`)
Size	44px
Spoken text color	Orange (#FF6B2B)
Unspoken text color	White (#FFFFFF)
Outline	2px black
Background bar	Semi-transparent black `rgba(0,0,0,160)`
Display mode	Per-sentence (each sentence appears and disappears independently)
Highlight mode	Word-by-word (karaoke-style progressive highlight)

Text correction mapping: Always maintain a correction dictionary to fix Whisper mis-transcriptions of technical terms and proper names:

python

corrections = {
    "材领": "才林",
    "Anthropy": "Anthropic",
    "Cloud Code": "Claude Code",
}

CRITICAL — Consistency rule: Intro and outro subtitles MUST use the exact same rendering engine (Pillow) with identical style properties. Never mix Pillow and ASS/other formats — FFmpeg on macOS lacks libass support.

Stage 6: Audio Repair

Common audio issues and their fixes. Run scripts/audio_analyzer.py for automated detection before proceeding.

Issue	Symptom	Root Cause	Fix
Right channel dropout	Crunching noise at specific timestamps	Source right channel flickers 20+ times	`channelmap=FL-FL
Silence gaps	Sudden "click" in music	AI-generated BGM has gaps (100-400ms)	250ms fade-out/in at each gap boundary
Audio truncation	Sound stops abruptly	Segment extracted from wrong time range	Use original source file, re-extract
Channel mismatch	Concat fails or silent segments	Mono vs stereo mismatch	Unify all to 48000Hz stereo
Clipping	Peak near 32768 (16-bit max)	Volume stacking at concatenation points	`alimiter` with `limit=-0.9dB`
TTS mispronunciation	Garbled Chinese characters	TTS engine multi-phoneme errors	Re-generate with F5-TTS or edge-tts patch

Channel fix command:

bash

ffmpeg -i input.wav -af "channelmap=map=FL-FL|FL-FR" -ar 48000 -ac 2 output.wav

Gap smoothing approach: For each audio gap >80ms detected by scripts/audio_analyzer.py:

bash

# Split at gap, apply fade-out/fade-in, re-concatenate
ffmpeg -i audio.wav -af "afade=t=out:st=GAP_START-0.25:d=0.25,afade=t=in:st=GAP_END:d=0.25" patched.wav

Stage 7: Final Compositing

Assemble all segments with professional transitions using FFmpeg.

Compositing order:

text

intro.mp4 → content_1.mp4 → content_2.mp4 → content_3.mp4 → outro.mp4

Use scripts/compose_final.py for automated assembly.

Encoding parameters:

Parameter	Value	Rationale
Codec	libx264	Maximum compatibility
Resolution	1280×720	16:9 standard
Frame rate	24fps	Cinematic feel
Rate control	CRF 20	High-quality unified encoding
Pixel format	yuv420p	Universal compatibility
Audio codec	AAC 192kbps	48000Hz stereo
Limiter	alimiter limit=-0.9dB	Prevent clipping

Transition effects:

Video: xfade=transition=fade:duration=0.5:offset=<time> (cross-fade, eliminates hard cuts)
Audio: acrossfade=d=0.5:curve=tri (triangular cross-fade, smooth audio joins)

Pre-compositing checklist:

All segments re-encoded to CRF 20 (unified quality)
All audio normalized to 48000Hz stereo
Subtitle overlays applied to each segment
Transitions prepared: 0.5s offset for each segment boundary

Stage 8: Professional QA Review

Run a systematic quality review before delivering the final video.

QA dimensions and inspection methods:

Dimension	Check	Method
Video transitions	Hard cuts at boundaries?	Extract transition zone, frame-by-frame review
Video encoding	Consistent bitrate across segments?	`ffprobe` bitrate check
Audio artifacts	Noise, pops, silence gaps?	Second-by-second mean/peak analysis
Audio joins	Smooth at concatenation points?	Acrossfade spectral analysis
Audio clipping	Peaks near 32768?	Peak detection (>32000 = danger)
Subtitle sync	Subtitles aligned with speech?	Whisper word-level timestamp verification
Subtitle consistency	Intro and outro styles match?	Visual comparison of 7 style properties
Pronunciation	Chinese pronunciation accurate?	Whisper transcription cross-validation

Verification commands:

bash

# Audio per-second analysis
python3 scripts/audio_analyzer.py output.mp4

# Video quality check
ffprobe -v error -select_streams v:0 \
  -show_entries stream=codec_name,width,height,r_frame_rate,bit_rate \
  output.mp4

# Subtitle sync verification
whisper model small --language zh output.mp4

For the complete QA checklist, see references/qa_checklist.md.

File Naming Convention

text

AI科普第{N}期_{主题}_v{N}.mp4

Example: AI科普第一期_SadTalker画中画_v11.mp4

Intermediate files:

File	Purpose
`content_video.mp4` / `content_with_pip_v{N}.mp4`	Content with optional PiP
`sadtalker_output.mp4` / `intro.mp4` / `outro.mp4`	Digital human outputs
`content_audio.wav` / `ref_audio_24k.wav`	Audio files
`subs_s{N}/frame_{N}.png`	Subtitle frames
`build_v{N}.py` / `build_v{N}_fixed.py`	Build scripts
`merge_final.sh` / `concat_v{N}.txt`	Merge scripts

Quality Targets

Metric	Target
Video resolution	1280×720 (16:9)
Frame rate	24fps
Video bitrate	CRF 20 (~200-400 kbps)
Audio sample rate	48000Hz stereo
Audio bitrate	AAC 192kbps
Audio peak	< -0.9dB (no clipping)
Segment transition	0.5s xfade + acrossfade
Subtitle alignment	Whisper word-level timestamps

Critical Pitfalls

For the complete pitfalls reference, see references/pitfalls.md. Key highlights:

F5-TTS Chinese too short: Always set estimate_duration=True for Chinese content narration. Without it, audio is only 0.5-0.9s per sentence.
Alpha channel compositing: When using alphamerge, the human RGBA video is the color source (first input), and the circular mask PNG is the alpha (second input). Reversing them produces a white circle with no human visible.
Concat format mismatch: Different segments may have different sample rates (16000 vs 48000Hz) or channel counts (mono vs stereo). Unify all segments to 48000Hz stereo before concatenation.
Subtitle rendering engine inconsistency: Always use Pillow for both intro and outro subtitles. FFmpeg on macOS lacks libass, making ASS-subtitle filters unavailable.
AI-generated BGM gaps: Google Flow's AI-generated background music may contain silence gaps (100-400ms). Smooth them with 250ms crossfades at each gap boundary.

Bundled Resources

Scripts

scripts/render_slides.py — Pillow-based content slide frame renderer (1280×720, dark IDE theme, progressive text reveal)
scripts/render_subtitles.py — Karaoke-style subtitle PNG renderer (word-by-word orange highlight, transparent BG, STHeiti 44px)
scripts/compose_final.py — End-to-end FFmpeg compositing (xfade + acrossfade + alimiter + CRF20 unified encoding)
scripts/audio_analyzer.py — Audio QA analysis tool (second-by-second mean/peak detection, gap finder, clipping detector)

References

references/script_format.md — Complete script.json format specification and examples
references/qa_checklist.md — Detailed 8-dimension QA review checklist
references/pitfalls.md — Comprehensive list of known pitfalls with root causes and fixes