PPT Audio To Video

Other

Convert narration audio plus slide decks into a narrated video. Use when the user has an audio-only `mp4/m4a/mp3/wav` and a `ppt/pptx/pdf` deck, and needs slide images, transcript extraction, slide timing planning, or final `mp4` rendering with `whisper-cpp` and `ffmpeg`.

Install

openclaw skills install ppt-audio-to-video

PPT Audio To Video

Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.

Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.

Core workflow

Inventory inputs.
- Confirm which of these exist: audio-only mp4/m4a/mp3/wav, ppt/pptx, pdf, and any pre-rendered slide images.
- Prefer an existing pdf or image directory for rendering. Treat pptx as the source of slide text and as a fallback for export.
Prepare tools.
- Required for deterministic steps: ffmpeg, ffprobe, pdftoppm.
- Required for transcription: whisper-cli from whisper-cpp plus a multilingual model such as ggml-small.bin.
- If only pptx exists and no pdf/images exist, prefer Keynote or PowerPoint export on macOS. Use soffice only as fallback because profile or rendering issues are common.
Produce slide images.
- If pdf exists, render it to images:
```
pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide"
```
- If only pptx exists, export to pdf or slide images with Keynote or PowerPoint, then continue from pdf.
- Keep slide filenames ordered and stable, such as slide-01.png, slide-02.png, ...
Extract slide text.
- Run:
```
python3 scripts/extract_slide_outline.py \
  --pptx "$PPTX" \
  --out "$WORKDIR/slide_outline.csv"
```
- Use the output to identify slide titles, distinctive keywords, and section changes.
Extract clean audio for ASR.
- For audio-only mp4, extract mono wav:
```
ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav"
```
- If the source is already wav/mp3/m4a, convert to the same mono wav form if needed.
Transcribe with whisper-cli.
- Example:
```
whisper-cli -ng \
  -m "$MODEL" \
  -f "$WORKDIR/audio.wav" \
  -l zh \
  -ocsv -osrt -of "$WORKDIR/transcript"
```
- Prefer transcript.csv for downstream parsing. transcript.srt is useful for manual review.
- If GPU allocation fails on macOS, retry with -ng to force CPU mode.
Build slide_timings.csv.
- Do not average slide durations unless the user explicitly asks for it.
- Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
- Use this schema:
```
slide,start_sec,end_sec,duration_sec,reason
1,0.000,15.000,15.000,opening title and agenda
2,15.000,100.000,85.000,architecture overview starts here
```
- Keep slide numbers sequential and ensure duration_sec = end_sec - start_sec.
- Validate that the last end_sec matches the audio duration or is within a small tolerance.

Render the final video.

Run:

python3 scripts/render_from_timing_csv.py \
  --images "$SLIDE_IMAGES_DIR" \
  --timings "$WORKDIR/slide_timings.csv" \
  --audio "$WORKDIR/audio.wav" \
  --output "$OUT_VIDEO"

The script generates an ffconcat file, validates timing continuity, and calls ffmpeg to encode the final mp4.

Verify and iterate.
- Check output duration with ffprobe.
- If a slide cuts too early or too late, edit only the affected rows in slide_timings.csv and rerun the render script.
- Keep the transcript, outline, and timing CSV as reproducible working files.

Heuristics for timing alignment

Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
Use the first segment that clearly switches topic as the next slide start.
Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
Let the model infer timings, but keep the render step deterministic through slide_timings.csv.
When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.

Common commands

Install dependencies on macOS if missing:

brew install ffmpeg poppler whisper-cpp

Typical multilingual model download:

mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin

Bundled scripts

scripts/extract_slide_outline.py Extract slide text from pptx into CSV or JSON for timing analysis.
scripts/render_from_timing_csv.py Validate a timing CSV, generate an ffconcat, and render the final video with ffmpeg.