video-audio-replace

Replace video audio with TTS voice while preserving original timing. Includes subtitle generation from video using Whisper. Uses ElevenLabs or Edge TTS, alig...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 247 · 1 current installs · 1 all-time installs
bymarc@synthere
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description match the included scripts: generating subtitles (Whisper), creating TTS segments (ElevenLabs or Edge), aligning and replacing audio with ffmpeg. Required libraries (requests, faster-whisper, edge-tts) and ffmpeg usage are proportionate to the stated purpose.
Instruction Scope
Runtime instructions and scripts operate on local video/audio files, call ffmpeg/sox, and send text to the declared TTS APIs (api.elevenlabs.io / Edge TTS). They do not read unrelated system files or attempt broad environment discovery, but SKILL.md and code assume presence of ELEVENLABS_API_KEY when using the ElevenLabs engine (the registry metadata listed no required env vars — this is inconsistent).
Install Mechanism
This is an instruction-only skill with bundled Python scripts and no installer that downloads arbitrary code. _meta.json lists pip packages; all dependencies are standard public packages. No downloads from untrusted URLs or extracted archives are present.
Credentials
The only credential used is ELEVENLABS_API_KEY (optional if you use Edge TTS). That is appropriate for the ElevenLabs integration. However, the registry metadata reported no required env vars while the code clearly checks ELEVENLABS_API_KEY and will exit if ElevenLabs is selected — metadata should be corrected. Also the default ElevenLabs voice constant is a long alphanumeric string (likely a voice ID) which could be confusing; ensure it is not a misplaced secret.
Persistence & Privilege
always is false, skill does not request persistent presence or modify other skills or system-wide settings. It runs as a local tool operating on user-supplied files.
Assessment
This skill appears to do what it says: it uses Whisper for subtitles, and ElevenLabs or Edge for TTS, then aligns/merges audio using ffmpeg. Before installing or running: 1) If you plan to use ElevenLabs set ELEVENLABS_API_KEY in your environment (or run with --engine edge to avoid sending text to an external API). 2) Review and install the listed Python packages in a virtualenv; run in an isolated environment if you’re concerned about dependencies. 3) Verify the default ElevenLabs voice constant is not a leaked secret (it looks like a voice ID but confirm). 4) Be aware that using ElevenLabs will send your subtitle text to their API — do not upload sensitive content. 5) Update the skill metadata if you maintain it so required env vars are declared accurately.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk974hsesw6qsgwqhbfv4pqdys581yt0r

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Video Audio Replace

Replace a video's original audio with TTS-generated voice while maintaining precise timing alignment. Also supports generating subtitles from video using Whisper.

Full Workflow

Step 1: Generate subtitles from video (optional)

If you don't have an SRT file, generate one from the video using the included script:

# Generate subtitles from video (uses faster-whisper, free, local)
generate_subtitles.py video.mp4 -o subtitles.srt -l zh

Or manually with Python:

# Using faster-whisper (recommended, local, free)
pip install faster-whisper srt

python3 << 'EOF'
from faster_whisper import WhisperModel
import srt
from datetime import timedelta

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("input_video.mp4", language="zh")

# Generate SRT
def format_time(seconds):
    td = timedelta(seconds=seconds)
    return f"{td.seconds//3600:02d}:{(td.seconds%3600)//60:02d}:{td.seconds%60:02d},{td.microseconds//1000:03d}"

srt_content = ""
for i, seg in enumerate(segments, 1):
    start = format_time(seg.start)
    end = format_time(seg.end)
    srt_content += f"{i}\n{start} --> {end}\n{seg.text.strip()}\n\n"

with open("subtitles.srt", "w", encoding="utf-8") as f:
    f.write(srt_content)
EOF

Step 2: Replace audio with TTS

Use the generated SRT to create a new video with TTS voice.

When to use

  • Dubbing videos with AI-generated voice
  • Converting subtitle files to voice-over
  • Creating multilingual video versions

Requirements

API Keys (choose one)

  • ElevenLabs: Set ELEVENLABS_API_KEY environment variable
  • Edge TTS (free, no key needed): Use --engine edge

System dependencies

  • ffmpeg
  • sox (optional, for advanced processing)

Usage

Basic usage (ElevenLabs)

video-audio-replace --video input.mp4 --srt subtitles.srt --output output.mp4 --voice "Liam"

Using Edge TTS (free, no API key)

video-audio-replace --video input.mp4 --srt subtitles.srt --output output.mp4 --engine edge --voice "zh-CN-YunxiNeural"

Options

OptionDescriptionDefault
--videoInput video fileRequired
--srtSRT subtitle fileRequired
--outputOutput video fileinput_tts.mp4
--voiceVoice ID or nameLiam (ElevenLabs)
--engineTTS engine: elevenlabs, edgeelevenlabs
--speed-rangeSpeed adjustment range0.85-1.15

Examples

English voice (ElevenLabs)

video-audio-replace --video 2028.mp4 --srt 2028.srt --output 2028_final.mp4 --voice "Liam"

Chinese voice (Edge TTS)

video-audio-replace --video video.mp4 --srt subs.srt --output result.mp4 --engine edge --voice "zh-CN-YunxiNeural"

How it works

  1. Extract original audio from video
  2. Split audio into segments based on subtitle timestamps
  3. Generate TTS audio for each subtitle segment
  4. Adjust TTS speed (within 0.85-1.15x) to match original segment duration
  5. Add silence padding to fill any remaining time gap
  6. Merge all segments preserving original timing gaps
  7. Replace video audio with aligned TTS audio

Available Voices

ElevenLabs (require API key)

  • Liam - Energetic male (recommended)
  • Sarah - Professional female
  • Brian - Deep resonant male
  • Run curl with your API key to list all voices

Edge TTS (free)

  • Chinese: zh-CN-XiaoxiaoNeural, zh-CN-YunxiNeural, zh-CN-YunyangNeural
  • English: en-US-JennyNeural, en-US-GuyNeural
  • Many more languages available

Files

5 total
Select a file
Select a file to preview.

Comments

Loading comments…