Install
openclaw skills install claw-qwen3-ttsHigh-quality text-to-speech using Qwen3-TTS. 10 built-in speakers with emotional instruct control, voice cloning (3s of audio), natural-language voice design, 10+ languages, persistent named voices, and delivering audio via Telegram/WhatsApp as native voice messages. Auto-detects GPU hardware (CUDA, ROCm, Intel XPU, CPU).
openclaw skills install claw-qwen3-ttsYou have access to a powerful text-to-speech system that can generate human-quality speech with 10 built-in speakers, design new voices from descriptions, clone existing voices from audio samples, and send audio via Telegram/WhatsApp as native voice messages.
If the skill is not yet installed (no ~/clawd/skills/qwen3-tts directory), run:
bash <(curl -fsSL https://raw.githubusercontent.com/daMustermann/claw-qwen3-tts/main/install.sh)
Or if already cloned but not set up (no .venv/ directory):
bash ~/clawd/skills/qwen3-tts/install.sh
This auto-detects the GPU (CUDA, ROCm, Intel XPU, or CPU-only), creates a Python venv, and installs all dependencies. It takes 5–15 minutes on first run.
Before any TTS operation, ensure the server is running:
# Start (idempotent — won't restart if already running)
bash ~/clawd/skills/qwen3-tts/scripts/start_server.sh
# Check health
bash ~/clawd/skills/qwen3-tts/scripts/health_check.sh
# Stop (when done)
bash ~/clawd/skills/qwen3-tts/scripts/stop_server.sh
The server runs at http://localhost:8880.
| Model ID | Use Case | Notes |
|---|---|---|
custom-voice-1.7b | High-quality TTS with built-in speakers — default | Best quality, ~5 GB VRAM |
custom-voice-0.6b | Fast TTS with built-in speakers | Lightweight, ~2 GB VRAM |
voice-design | Design new voices from natural language descriptions | Uses VoiceDesign model |
base-1.7b | Basic TTS (auto-corrected to custom-voice-1.7b) | Use custom-voice-* instead |
base-0.6b | Basic TTS (auto-corrected to custom-voice-0.6b) | Use custom-voice-* instead |
Important: On the
/v1/audio/speechendpoint,base-*andvoice-designmodels are automatically corrected to the correspondingcustom-voice-*model. Always prefercustom-voice-1.7borcustom-voice-0.6bfor speech generation.
The custom-voice-* models include 10 built-in voices:
Chelsie · Ethan · Aidan · Serena · Ryan · Vivian · Claire · Lucas · Eleanor · Benjamin
You can discover speakers dynamically: curl http://localhost:8880/v1/speakers
When to use: User asks to speak text, read something aloud, generate audio, do a voiceover, narrate, or say something.
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "custom-voice-1.7b",
"input": "TEXT_HERE",
"voice": "default",
"speaker": "Chelsie",
"language": "en",
"instruct": "",
"response_format": "wav"
}' \
--output ~/clawd/skills/qwen3-tts/output/speech.wav
Parameters:
| Parameter | Required | Default | Description |
|---|---|---|---|
model | no | custom-voice-1.7b | TTS model to use |
input | yes | — | The text to synthesize |
voice | no | default | "default" for built-in speakers, or a saved voice name (e.g. "Angie") |
speaker | no | Chelsie | Built-in speaker name (only when voice is "default") |
language | no | en | Language code: en, zh, ja, ko, de, fr, ru, pt, es, it |
instruct | no | "" | Emotional/style instruction (see below) |
response_format | no | wav | Output format: wav, mp3, ogg, flac |
speed | no | 1.0 | Speech speed multiplier |
Language codes: en, zh, ja, ko, de, fr, ru, pt, es, it — or full names like English, Chinese, German, etc.
Instruct examples (controls tone, emotion, and style):
"Speak happily and with excitement""Whisper softly, as if telling a secret""Read this in a calm, professional news anchor tone""用愤怒的语气" (Speak angrily — works in target language too)"" (empty string = neutral default)When voice is a saved name: If you pass "voice": "Angie" and a voice named "Angie" exists, the server uses voice cloning with the saved reference audio instead of a built-in speaker. The speaker field is ignored in this case.
When to use: User wants to create a custom voice, describe how a character should sound, design a persona's voice.
curl -X POST http://localhost:8880/v1/audio/voice-design \
-H "Content-Type: application/json" \
-d '{
"model": "voice-design",
"input": "TEXT_TO_SPEAK",
"voice_description": "DESCRIBE THE VOICE IN NATURAL LANGUAGE",
"language": "en",
"response_format": "wav"
}' \
--output ~/clawd/skills/qwen3-tts/output/designed.wav
Parameters:
| Parameter | Required | Default | Description |
|---|---|---|---|
model | no | voice-design | Must be voice-design |
input | yes | — | Text to synthesize with the designed voice |
voice_description | yes | — | Natural language description of the desired voice |
language | no | en | Target language |
response_format | no | wav | Output format |
Example descriptions:
"A warm, deep male voice with a slight British accent, calm and authoritative, like a BBC presenter in his 40s""A young, energetic female voice, bright and cheerful, with a slight rasp""An old wizard with a slow, mysterious, gravelly voice"The response includes a X-Voice-Id header — capture it to save the voice (see §4).
When to use: User provides a reference audio clip and wants to generate new speech in that voice.
curl -X POST http://localhost:8880/v1/audio/voice-clone \
-F "reference_audio=@/path/to/reference.wav" \
-F "reference_text=Transcript of the reference audio" \
-F "input=New text to speak in the cloned voice" \
-F "language=en" \
-F "response_format=wav" \
--output ~/clawd/skills/qwen3-tts/output/cloned.wav
Parameters:
| Parameter | Required | Default | Description |
|---|---|---|---|
reference_audio | yes | — | Audio file to clone the voice from |
input | yes | — | New text to synthesize in the cloned voice |
reference_text | no | "" | Transcription of the reference audio (improves quality) |
language | no | en | Target language |
response_format | no | wav | Output format |
Guidelines:
reference_text transcription significantly improves resultsreference_text is empty, uses x-vector-only mode (audio features only)The response includes a X-Voice-Id header — capture it to save the voice (see §4).
YOU MUST FOLLOW THESE RULES:
After EVERY voice-design or voice-clone request, ask the user:
"Would you like to save this voice for future use? What name should I give it?"
If the user says yes, capture the X-Voice-Id from the response headers and save it:
curl -X POST http://localhost:8880/v1/voices \
-H "Content-Type: application/json" \
-d '{
"name": "USER_CHOSEN_NAME",
"source_voice_id": "VOICE_ID_FROM_X_VOICE_ID_HEADER",
"description": "Description of the voice",
"tags": ["tag1", "tag2"],
"language": "en"
}'
When user requests TTS with a voice name (e.g. "say this with Angie"):
"voice": "Angie" in the /v1/audio/speech requestWhen user asks to list voices:
curl http://localhost:8880/v1/voices
Present the results as a formatted list with name, description, source, language, tags, and usage count. Voices are sorted by usage count (most used first).
When user asks to delete a voice: Confirm with the user first, then:
curl -X DELETE http://localhost:8880/v1/voices/VOICE_NAME
When user asks to rename a voice:
curl -X PATCH http://localhost:8880/v1/voices/OLD_NAME \
-H "Content-Type: application/json" \
-d '{"name": "NEW_NAME"}'
When user asks to update a voice's metadata (description, tags, language):
curl -X PATCH http://localhost:8880/v1/voices/VOICE_NAME \
-H "Content-Type: application/json" \
-d '{"description": "Updated description", "tags": ["new", "tags"]}'
Voice names are case-insensitive but stored in the casing the user provided.
No duplicate names allowed. If a name already exists, the save will fail (409). Ask the user for a different name or offer to delete the existing one first.
Voice profiles are stored locally in ~/clawd/skills/qwen3-tts/voices/ and persist across server restarts. Each voice consists of:
<name>.json — metadata<name>.pt — embedding tensor<name>_sample.wav — reference audio sample (used for re-cloning)When to use: User needs audio in a specific format, or you need to prepare audio for messaging.
curl -X POST http://localhost:8880/v1/audio/convert \
-F "audio=@input.wav" \
-F "target_format=mp3" \
--output output.mp3
Supported formats: wav, mp3, ogg (Opus), flac
You can also use the shell script directly:
bash ~/clawd/skills/qwen3-tts/scripts/convert_to_ogg_opus.sh input.wav output.ogg
When to use: User is interacting via Telegram, or explicitly asks to send audio to a Telegram chat.
curl -X POST http://localhost:8880/v1/audio/send/telegram \
-H "Content-Type: application/json" \
-d '{
"audio_file": "/path/to/audio.wav",
"chat_id": "CHAT_ID",
"bot_token": "BOT_TOKEN",
"caption": "Optional caption"
}'
bot_token is optional if already configured in config.jsonsendVoice APIWhen to use: User is interacting via WhatsApp, or explicitly asks to send audio there.
curl -X POST http://localhost:8880/v1/audio/send/whatsapp \
-H "Content-Type: application/json" \
-d '{
"audio_file": "/path/to/audio.wav",
"phone_number_id": "PHONE_ID",
"recipient": "+14155551234",
"access_token": "ACCESS_TOKEN"
}'
phone_number_id and access_token are optional if already configured in config.jsonUse these to dynamically discover available models and speakers:
# List all available TTS models
curl http://localhost:8880/v1/models
# List built-in speakers
curl http://localhost:8880/v1/speakers
# Server health check (device info, voice count, version)
curl http://localhost:8880/health
After generating speech:
After saving a voice:
voice: Captain Hook."After sending via Telegram/WhatsApp:
When choosing a speaker: If the user doesn't specify, default to "Chelsie". If they describe the kind of voice they want (but not a full voice-design request), pick the most fitting built-in speaker.
When choosing a model: Default to custom-voice-1.7b. Only use custom-voice-0.6b if the user asks for speed, or if the system has limited VRAM/memory.
The agent can update ~/clawd/skills/qwen3-tts/config.json to set:
custom-voice-1.7b or custom-voice-0.6bIf config.json doesn't exist, copy the template:
cp ~/clawd/skills/qwen3-tts/config.json.template ~/clawd/skills/qwen3-tts/config.json