PPT Audio To Video

v0.1.0

Convert narration audio plus slide decks into a narrated video. Use when the user has an audio-only `mp4/m4a/mp3/wav` and a `ppt/pptx/pdf` deck, and needs sl...

0· 207·0 current·0 all-time
byZhaofeng@lzfxxx
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The skill's name/description match the included scripts and SKILL.md: extracting text from PPTX, producing slide images, transcribing audio, building a timing CSV, and rendering via ffmpeg is expected. One minor inconsistency: the registry metadata lists no required binaries/env vars, but the SKILL.md explicitly requires ffmpeg/ffprobe/pdftoppm and whisper-cpp (and a whisper model). This is reasonable for the task but the metadata omission means the platform won't automatically surface those runtime dependencies.
Instruction Scope
SKILL.md stays on task: it references only slide files, audio files, timing CSVs, local model downloads, and standard tools. The instructions do suggest downloading a model from Hugging Face and using platform-specific exporters (Keynote/PowerPoint/soffice) which are expected for slide rendering; there are no instructions to read unrelated system files or to transmit data to unexpected endpoints beyond the model download URL.
Install Mechanism
This is instruction-only with two small helper scripts; there is no install spec. The SKILL.md suggests installing via brew and downloading a model with curl from Hugging Face — those are normal but are network operations performed at runtime by the user/agent, not packaged installers. No arbitrary or obfuscated remote code downloads are embedded in the skill files.
Credentials
The skill requests no environment variables or credentials. The files and tools it uses (local PPTX/PDF/images/audio, whisper model binary, ffmpeg) are proportional to the stated functionality. There are no requests for API keys, cloud credentials, or unrelated secrets.
Persistence & Privilege
always is false and the skill does not request special persistent privileges or modify other skills/system configs. It runs local scripts and spawns ffmpeg as expected; autonomous invocation is allowed by platform default but is not combined with other red flags here.
Assessment
This skill appears to do what it says and operates on local files. Before installing or running it: 1) ensure your environment has the required tools (ffmpeg, ffprobe, pdftoppm/poppler, whisper-cpp) since metadata does not declare them; 2) expect to download a Whisper model binary from Hugging Face (check license, size, and trust the URL before downloading); 3) scripts operate on files you provide and will write outputs (CSV, ffconcat, MP4) to working directories — review/output paths before running; 4) if you need to run on sensitive audio, note that transcription happens locally if you use whisper-cpp, but if you substitute a cloud ASR you should review that service's privacy; and 5) review the two bundled Python scripts (they are short and readable) if you want full assurance — they do not perform network calls or access unrelated credentials.

Like a lobster shell, security has layers — review code before you run it.

latestvk973y881d2a798frkx5yp7ymh582r7zy
207downloads
0stars
1versions
Updated 1mo ago
v0.1.0
MIT-0

PPT Audio To Video

Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.

Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.

Core workflow

  1. Inventory inputs.

    • Confirm which of these exist: audio-only mp4/m4a/mp3/wav, ppt/pptx, pdf, and any pre-rendered slide images.
    • Prefer an existing pdf or image directory for rendering. Treat pptx as the source of slide text and as a fallback for export.
  2. Prepare tools.

    • Required for deterministic steps: ffmpeg, ffprobe, pdftoppm.
    • Required for transcription: whisper-cli from whisper-cpp plus a multilingual model such as ggml-small.bin.
    • If only pptx exists and no pdf/images exist, prefer Keynote or PowerPoint export on macOS. Use soffice only as fallback because profile or rendering issues are common.
  3. Produce slide images.

    • If pdf exists, render it to images:
      pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide"
      
    • If only pptx exists, export to pdf or slide images with Keynote or PowerPoint, then continue from pdf.
    • Keep slide filenames ordered and stable, such as slide-01.png, slide-02.png, ...
  4. Extract slide text.

    • Run:
      python3 scripts/extract_slide_outline.py \
        --pptx "$PPTX" \
        --out "$WORKDIR/slide_outline.csv"
      
    • Use the output to identify slide titles, distinctive keywords, and section changes.
  5. Extract clean audio for ASR.

    • For audio-only mp4, extract mono wav:
      ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav"
      
    • If the source is already wav/mp3/m4a, convert to the same mono wav form if needed.
  6. Transcribe with whisper-cli.

    • Example:
      whisper-cli -ng \
        -m "$MODEL" \
        -f "$WORKDIR/audio.wav" \
        -l zh \
        -ocsv -osrt -of "$WORKDIR/transcript"
      
    • Prefer transcript.csv for downstream parsing. transcript.srt is useful for manual review.
    • If GPU allocation fails on macOS, retry with -ng to force CPU mode.
  7. Build slide_timings.csv.

    • Do not average slide durations unless the user explicitly asks for it.
    • Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
    • Use this schema:
      slide,start_sec,end_sec,duration_sec,reason
      1,0.000,15.000,15.000,opening title and agenda
      2,15.000,100.000,85.000,architecture overview starts here
      
    • Keep slide numbers sequential and ensure duration_sec = end_sec - start_sec.
    • Validate that the last end_sec matches the audio duration or is within a small tolerance.
  8. Render the final video.

    • Run:
      python3 scripts/render_from_timing_csv.py \
        --images "$SLIDE_IMAGES_DIR" \
        --timings "$WORKDIR/slide_timings.csv" \
        --audio "$WORKDIR/audio.wav" \
        --output "$OUT_VIDEO"
      
    • The script generates an ffconcat file, validates timing continuity, and calls ffmpeg to encode the final mp4.
  9. Verify and iterate.

    • Check output duration with ffprobe.
    • If a slide cuts too early or too late, edit only the affected rows in slide_timings.csv and rerun the render script.
    • Keep the transcript, outline, and timing CSV as reproducible working files.

Heuristics for timing alignment

  • Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
  • Use the first segment that clearly switches topic as the next slide start.
  • Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
  • Let the model infer timings, but keep the render step deterministic through slide_timings.csv.
  • When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.

Common commands

Install dependencies on macOS if missing:

brew install ffmpeg poppler whisper-cpp

Typical multilingual model download:

mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin

Bundled scripts

  • scripts/extract_slide_outline.py Extract slide text from pptx into CSV or JSON for timing analysis.
  • scripts/render_from_timing_csv.py Validate a timing CSV, generate an ffconcat, and render the final video with ffmpeg.

Comments

Loading comments...