Install
openclaw skills install local-piper-tts-multilang-secureLocal offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection, per-call voice selection. Extensible to any language. Writes...
openclaw skills install local-piper-tts-multilang-secureLocal (offline) text-to-speech via Piper.
Purpose: generate audio files (OGG/Opus by default) from text, fully offline. No sending is performed by the skill — sending is handled by the agent after the file is ready.
setup() — installs Piper into an isolated venv, no system-wide changesvoice parameterdownloadVoices() — no models bundled, choose what you needremoveVoice() — clean up voices you no longer want.onnx modelFollow this sequence exactly when the user asks to use TTS for the first time in a setup context.
const s = await status();
If s.stage is not-setup or no-piper:
setup().status() again after setup completes.If s.stage is no-model (Piper installed but no .onnx files):
3a. Offer English defaults: Explain that two English voices are available as defaults (~65 MB each):
en_US-ryan-medium — male, Americanen_US-amy-medium — female, AmericanAsk which they want, or both: "Which English voice(s) should I download? Ryan (male), Amy (female), or both?"
3b. Ask about other languages: After the English choice, ask: "Do you need any other languages? For example German, French, Spanish, Polish, Italian, Portuguese, Russian… Just tell me and I'll check what's available."
If the user names a language, look up the available models at https://github.com/rhasspy/piper/blob/master/VOICES.md and list the options. Download whatever the user picks using the same downloadVoices() call.
3c. Download everything at once:
const result = await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium', /* + any others */]);
// result.downloaded — succeeded
// result.failed — [{stem, error}] if any failed
Each voice requires internet access. Download takes ~1–2 min per voice on a typical connection.
If any downloads fail:
After downloading, generate a short audio sample for each downloaded voice and send it to the user.
For each voice, use a greeting in the voice's language:
"Hello, I'm [name]. How can I help you today?""Hallo, ich heiße [Name]. Wie kann ich Ihnen helfen?""Bonjour, je m'appelle [prénom]. Comment puis-je vous aider?""Hola, me llamo [nombre]. ¿Cómo puedo ayudarte?""Cześć, mam na imię [imię]. Jak mogę Ci pomóc?""Ciao, mi chiamo [nome]. Come posso aiutarti?""Olá, meu nome é [nome]. Como posso ajudar?""Привет, меня зовут [имя]. Чем могу помочь?"Replace [name] with the voice name (e.g. Ryan, Amy, Thorsten).
const sample = await tts({ text: 'Hello, I\'m Ryan. How can I help you today?', voice: 'en_US-ryan-medium' });
// send sample.path to the user as a voice message
Send all samples, then ask: "Which voice do you prefer? Or shall I download a different one?"
After the user picks a voice, ask: "How fast should I speak? Normal is 100%. Some options: 125% (faster), 115% (slightly faster), 100% (normal), 80% (slower) — or tell me a percentage."
Always present speed as a percentage to the user. Never mention lengthScale directly.
lengthScale is the internal duration multiplier — lower = faster. To convert: lengthScale = 1 / (speed% / 100).
Examples:
Generate a short sample at the chosen speed so the user can hear the difference:
const sample = await tts({ text: 'This is how I sound at this speed.', voice: 'chosen-voice', lengthScale: 0.8 });
// send sample.path to the user
Confirm with the user, then offer to save it permanently: "Should I save this as your default speed? It'll be used automatically every session."
If the user agrees:
await saveConfig({ lengthScale: 0.8 });
Once saved, tts() reads it from config.json in the skill directory automatically — no need to pass lengthScale on every call.
Once confirmed, remember both voice and lengthScale for the session. Pass them to every subsequent tts() call unless the user asks to change them.
Always call status() before the first tts() call in a session to determine what is needed.
stage | Meaning | What to do |
|---|---|---|
ready | Fully installed, at least one voice model present | Proceed with tts() |
not-setup | Piper not installed | Ask user for confirmation, then call setup() |
no-piper | Venv exists but piper binary missing | Ask user for confirmation, then call setup() |
no-model | Piper installed but no voice model downloaded | Follow Steps 3–5 of first-run flow above |
IMPORTANT: Always ask the user for confirmation before calling setup().
It installs the piper-tts package from PyPI into a venv inside the skill directory.
text, optional format ("ogg" or "wav"), optional voice (model stem), optional lengthScale (speech speed, default 1.0).ogg)To list installed voices, call listVoices() — returns stems of all installed .onnx models.
Never assume a fixed list; it varies per user and installation.
Auto-detection (no voice param):
The script detects language from the text using character and script analysis:
Auto-detection is best-effort. For reliable results with a specific language, always pass the voice parameter explicitly.
Explicit override: set PIPER_VOICE_MODEL env var to a full .onnx path (overrides everything).
When the user requests a specific voice or language:
listVoices() to see what is installedvoice to tts(), e.g. voice: "en_US-amy-medium"downloadVoices([stem])To switch back to auto-detect, omit the voice parameter.
The user may say things like "I don't like this voice, use a female one" or "Download a German voice". When this happens:
de_DE-thorsten-medium) and call downloadVoices([stem])listVoices() — the new voice is immediately usableThe user may say "remove that voice" or "I don't need the German voice anymore". When this happens:
listVoices() to confirm which voices are installedremoveVoice(stem) — e.g. removeVoice('de_DE-thorsten-medium'){ removed, filesDeleted } on successNever remove the last remaining voice without warning the user that TTS will stop working.
The user may say things like "speak faster", "too slow", or "speed it up". When this happens:
lengthScale = 1 / (speed% / 100)await tts({ text: '...', voice: 'current-voice', lengthScale: 0.8 })saveConfig({ lengthScale: 0.8 })lengthScale for all subsequent tts() calls in the sessionOPENCLAW_WORKSPACE/tts/ if OPENCLAW_WORKSPACE env var is set~/.openclaw/workspace/tts/python3 (3.8+) — required for setup() to create the venvffmpeg — for WAV → OGG/Opus conversionespeak-ng — system library used by Piper internally; setup() checks for it and warns if missing.
Install: sudo apt install espeak-ng (Debian/Ubuntu), sudo dnf install espeak-ng (Fedora),
brew install espeak (macOS).onnx + .onnx.json voice model pair in the skill directoryrm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure
This removes everything: skill code, venv, and all voice models.