Install
openclaw skills install local-tts-workflowOpenClaw text-to-speech workflow for an OpenAI-compatible TTS server, including remote/self-hosted deployments such as vLLM Omni. Use when configuring, testing, debugging, or validating `/v1/audio/speech`, single-reply `[[tts:...]]` overrides, custom voice behavior, streaming vs non-streaming behavior, mode selection (base speaker, base clone, custom voice, voice design), local model-path fallback, and OpenClaw TTS integration. Also use when preparing text for speech output so numbers are normalized into spoken words instead of raw Arabic digits.
openclaw skills install local-tts-workflowUse this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.
Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.
Current known deployment in this workspace: http://127.0.0.1:8000/v1.
Current local model-path fallback worth remembering: if the server did not pull a model by registry name, it may be loading directly from a local path such as ./models/qwen3-tts-0.6b-mlx.
When exact route shape matters, the local OpenAPI document is available at:
http://localhost:8000/openapi.jsonUse this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.
If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.
Convert them into words first.
Examples:
一 二 三, not 123one two three, not 123This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.
When preparing spoken text, normalize:
If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.
Read ~/.openclaw/openclaw.json first and extract:
messages.tts.providermessages.tts.openai.baseUrlmessages.tts.openai.modelmessages.tts.openai.voiceCheck the basics against the actual configured host:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Confirm that the intended TTS model exists.
If the model does not appear by pulled registry name, do not assume TTS is broken — this server may be loading a local-path model such as ./models/qwen3-tts-0.6b-mlx.
If the server is task-gated, ensure TTS is enabled:
MLX_AUDIO_SERVER_TASKS=tts uv run python server.py
Always isolate the server from the client stack.
Minimal non-streaming test:
curl http://127.0.0.1:8000/v1/audio/speech \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "/models/lj-qwen3-tts/",
"voice": "lj",
"input": "你好,这是一次性返回完整音频的测试。",
"response_format": "wav",
"stream": false
}' \
--output sample.wav
Basic streaming test:
curl http://127.0.0.1:8000/v1/audio/speech \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"model": "/models/lj-qwen3-tts/",
"voice": "lj",
"input": "你好,这是实时流式语音合成测试。",
"response_format": "wav",
"stream": true,
"streaming_interval": 2.0
}' \
| ffplay -i -
If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.
Use this rule:
model, voice, instructions, ref_audio, ref_text, and streaming flagsUse the right request shape for the right model type.
Use built-in speaker playback.
Typical shape:
baseref_audio + ref_textvoice.id means built-in speaker nameUse clone-style synthesis.
Typical shape:
baseref_audio and ref_text, or supply a consent voice identity that resolves to bothHard rule: do not attempt clone with only ref_audio.
Use a model with prebuilt custom speakers.
Typical shape:
custom_voicevoice may be accepted either as a plain string or as {"id":"..."} depending on the serverlj-qwen3-tts / /models/lj-qwen3-tts/ must use speaker/voice ljUse style-description-driven synthesis.
Typical shape:
voice_designinstructionsvoice, ref_audio, or ref_textThis server supports real incremental generation, not fake post-hoc slicing.
Important behavior:
stream defaults to falseresponse_format defaults to mp3streaming_interval defaults to 2.0model and inputinstruct, voice, speed, gender, pitch, lang_code, ref_audio, ref_text, temperature, top_p, top_k, repetition_penalty, response_format, stream, streaming_interval, max_tokens, and verboseDo not assume OpenAI parity on names or defaults — check the local OpenAPI schema first.
For consent-based clone flows, upload voice material through /v1/audio/voice_consents.
Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.
If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:
When OpenClaw TTS appears broken:
messages.tts points at the actual configured endpoint in openclaw.json/v1/models or is otherwise accepted by the server; if not, check whether it is a local-path-backed deployment such as ./models/qwen3-tts-0.6b-mlxcurl with the same effective model/voice/mode assumptions[[tts:...]], verify whether single-reply override keys (model, voice, maybe provider) are enabled and are being honoredIf OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.
Use this order:
GET /healthGET /v1/modelsTypical signs:
curl returns playable audioConclusion: fix integration, not inference.
Typical signs:
Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.
Typical signs:
instructionsref_audio present for Base cloneConclusion: wrong request semantics for the chosen model type.
Read references/tts-api.md when you need exact behavior for:
/v1/audio/speech/v1/audio/voice_consentsstream_format="audio" vs stream_format="event"Do not assume generic OpenAI TTS docs fully match this local server.
references/tts-api.md — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions