Local Tts Workflow
v1.0.0OpenClaw text-to-speech workflow for an OpenAI-compatible TTS server, including remote/self-hosted deployments such as vLLM Omni. Use when configuring, testi...
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
Local TTS Workflow
Use this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.
Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.
Current known deployment in this workspace: http://100.66.193.127:8000/v1.
Core rule: normalize numbers before synthesis
If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.
Convert them into words first.
Examples:
- Chinese output: write
一 二 三, not123 - English output: write
one two three, not123
This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.
When preparing spoken text, normalize:
- dates
- times
- counts
- version-like strings if they will be read aloud
- mixed Chinese/English numeric snippets
If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.
Workflow
1. Verify the server before touching OpenClaw
Read ~/.openclaw/openclaw.json first and extract:
messages.tts.providermessages.tts.openai.baseUrlmessages.tts.openai.modelmessages.tts.openai.voice
Check the basics against the actual configured host:
curl http://100.66.193.127:8000/health
curl http://100.66.193.127:8000/v1/models
Confirm that the intended TTS model exists.
If the server is task-gated, ensure TTS is enabled:
MLX_AUDIO_SERVER_TASKS=tts uv run python server.py
2. Prove the raw TTS endpoint works
Always isolate the server from the client stack.
Minimal non-streaming test:
curl http://100.66.193.127:8000/v1/audio/speech \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "/models/lj-qwen3-tts/",
"voice": "lj",
"input": "你好,这是一次性返回完整音频的测试。",
"format": "wav",
"stream": false
}' \
--output sample.wav
Basic streaming test:
curl http://100.66.193.127:8000/v1/audio/speech \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"model": "/models/lj-qwen3-tts/",
"voice": "lj",
"input": "你好,这是实时流式语音合成测试。",
"format": "wav",
"stream": true,
"stream_format": "audio"
}' \
| ffplay -i -
If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.
3. Distinguish server failure from integration failure
Use this rule:
- Direct curl fails → fix the local TTS server first
- Direct curl works, but OpenClaw sounds wrong or falls back → inspect OpenClaw provider selection, fallback, and request shape
- OpenClaw sends requests but voice/mode is wrong → inspect fields like
model,voice,instructions,ref_audio,ref_text, and streaming flags
4. Know the four TTS modes
Use the right request shape for the right model type.
Base speaker
Use built-in speaker playback.
Typical shape:
- model type:
base - no full
ref_audio + ref_text voice.idmeans built-in speaker name
Base clone
Use clone-style synthesis.
Typical shape:
- model type:
base - must provide both
ref_audioandref_text, or supply a consent voice identity that resolves to both
Hard rule: do not attempt clone with only ref_audio.
CustomVoice
Use a model with prebuilt custom speakers.
Typical shape:
- model type:
custom_voice voicemay be accepted either as a plain string or as{"id":"..."}depending on the server- for this workspace,
lj-qwen3-tts//models/lj-qwen3-tts/must use speaker/voicelj - do not send clone payloads
VoiceDesign
Use style-description-driven synthesis.
Typical shape:
- model type:
voice_design - must provide
instructions - do not send
voice,ref_audio, orref_text
5. Treat streaming as a real transport choice
This server supports real incremental generation, not fake post-hoc slicing.
Important behavior:
- If
streamis omitted, the server defaults to streaming behavior stream=falseforces full non-streaming responsestream_format="audio"returns playable audio bytesstream_format="event"returns SSE events with base64 chunks
If a client expects one mode and the server is returning the other, you will get confusing results fast.
6. Use consent uploads properly
For consent-based clone flows, upload voice material through /v1/audio/voice_consents.
Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.
If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:
- reference audio
- reference text
7. OpenClaw-specific debugging pattern
When OpenClaw TTS appears broken:
- Confirm
messages.ttspoints at the actual configured endpoint inopenclaw.json - Confirm the intended model exists in
/v1/modelsor is otherwise accepted by the server - Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback
- Test direct
curlwith the same effective model/voice/mode assumptions - Inspect whether OpenClaw is falling back to another provider
- If using
[[tts:...]], verify whether single-reply override keys (model,voice, maybeprovider) are enabled and are being honored - If needed, compare raw request shape with a dump proxy
If OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.
8. Preferred test ladder
Use this order:
GET /healthGET /v1/models- direct non-streaming TTS test
- direct streaming TTS test
- consent upload test if clone is involved
- OpenAI client compatibility test if relevant
- OpenClaw integration test
- dump-proxy / log inspection only if still ambiguous
9. Common conclusions
Server good, integration bad
Typical signs:
- manual
curlreturns playable audio - OpenClaw output sounds like fallback voice or wrong mode
- provider selection is inconsistent
Conclusion: fix integration, not inference.
Text normalization bug
Typical signs:
- synthesis succeeds technically
- numbers are read awkwardly, skipped, or glitched
Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.
Mode mismatch
Typical signs:
- clone request sent to CustomVoice
- VoiceDesign called without
instructions - only
ref_audiopresent for Base clone
Conclusion: wrong request semantics for the chosen model type.
10. Use the reference doc when exact fields matter
Read references/tts-api.md when you need exact behavior for:
/v1/audio/speech/v1/audio/voice_consents- streaming vs non-streaming
stream_format="audio"vsstream_format="event"- mode selection and response headers
- consent storage semantics
- exact model/request mismatch errors
Do not assume generic OpenAI TTS docs fully match this local server.
Resources
references/
references/tts-api.md— exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions
Files
2 totalComments
Loading comments…
