VibeVoice TTS
Local Spanish TTS using Microsoft VibeVoice. Generate natural voice audio from text, optimized for WhatsApp voice messages.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 498 · 0 current installs · 0 all-time installs
byHoddix@javier887
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name/description (local Spanish TTS using Microsoft VibeVoice) match the provided scripts and README: it expects a local VibeVoice repo, Python + torch, and ffmpeg to produce .ogg/.mp3/.wav audio.
Instruction Scope
SKILL.md and scripts instruct cloning the official Microsoft VibeVoice repo and creating a venv. The runtime python snippet calls from_pretrained('microsoft/VibeVoice-Realtime-0.5B') which will attempt to download model weights from Hugging Face if not present — this network activity and large download is not explicitly documented in SKILL.md. Otherwise the script stays within the TTS scope and only reads provided text and local voice .pt files.
Install Mechanism
There is no automated install spec; the manual install steps clone the official GitHub repo and pip-install dependencies. This is a low-risk, expected install pattern (no obscure URLs or archives). Note: pip installing torch/torchaudio can be heavyweight and may pull CUDA-specific packages depending on environment.
Credentials
The skill requests no credentials or special env vars. It uses optional env vars (VIBEVOICE_DIR, VIBEVOICE_VOICE, VIBEVOICE_SPEED) which are appropriate for configuration. No unrelated secrets or system paths are requested.
Persistence & Privilege
Skill does not request always:true and does not modify other skills or system-wide settings. It's instruction-only plus a CLI script that runs locally — no elevated persistence or privilege escalations are apparent.
Assessment
This skill is internally consistent with its stated purpose, but before installing consider: (1) The runtime will likely download large model weights from Hugging Face (microsoft/VibeVoice-Realtime-0.5B) unless you already have them locally — expect heavy network use and large disk usage. (2) Installing torch/torchaudio can be large and may require CUDA/tooling matching your GPU; follow official install docs for your environment. (3) The skill runs local Python code which will execute on your machine — only install from trusted sources and inspect the VibeVoice repo you clone. (4) No credentials are required, but ensure you have sufficient GPU/VRAM, disk space, and bandwidth. If you want to be stricter, clone and verify the upstream microsoft/VibeVoice repository yourself and run the script in an isolated environment (container or dedicated VM).Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
🎙️ Clawdis
Binsffmpeg, python3
SKILL.md
VibeVoice TTS
Local text-to-speech using Microsoft's VibeVoice model. Generates natural Spanish voice audio, perfect for WhatsApp voice messages.
Quick Start
# Basic usage
{baseDir}/scripts/vv.sh "Hola, esto es una prueba" -o /tmp/audio.ogg
# From file
{baseDir}/scripts/vv.sh -f texto.txt -o /tmp/audio.ogg
# Different voice
{baseDir}/scripts/vv.sh "Texto" -v en-Wayne -o /tmp/audio.ogg
# Adjust speed (0.5-2.0)
{baseDir}/scripts/vv.sh "Texto" -s 1.2 -o /tmp/audio.ogg
Configuration
| Setting | Default | Description |
|---|---|---|
| Voice | sp-Spk1_man | Spanish male voice (slight Mexican accent) |
| Speed | 1.15 | 15% faster than normal |
| Format | .ogg | Opus codec for WhatsApp |
Available Voices
Spanish:
sp-Spk1_man- Male, slight Mexican accent (default)
English:
en-Wayne- Maleen-Denise- Female- Other voices in
~/VibeVoice/demo/voices/streaming_model/
Output Formats
.ogg- Opus codec (WhatsApp compatible, recommended).mp3- MP3 format.wav- Uncompressed WAV
For WhatsApp
Always use .ogg format with asVoice=true in the message tool:
# Generate
{baseDir}/scripts/vv.sh "Tu mensaje aquí" -o /tmp/mensaje.ogg
# Send via message tool
message action=send channel=whatsapp to="+34XXXXXXXXX" filePath=/tmp/mensaje.ogg asVoice=true
Requirements
- GPU: NVIDIA with ~2GB VRAM
- VibeVoice: Installed at
~/VibeVoice - ffmpeg: For audio conversion
- Python 3.10+: With torch, torchaudio
Performance
- RTF: ~0.24x (generates faster than realtime)
- 1 minute of audio ≈ 15 seconds to generate
Notes
- First run loads model (~10s), subsequent runs are faster
- Audio rule: Only send voice if user requests it or speaks via audio
- Keep text under 1500 chars for best quality
Files
3 totalSelect a file
Select a file to preview.
Comments
Loading comments…
