Fully offline Qwen3 TTS for your agent

High-quality text-to-speech using Qwen3-TTS. 10 built-in speakers with emotional instruct control, voice cloning (3s of audio), natural-language voice design, 10+ languages, persistent named voices, and delivering audio via Telegram/WhatsApp as native voice messages. Auto-detects GPU hardware (CUDA, ROCm, Intel XPU, CPU).

Audits

Pending

ClawScanReview

Agentic behavior and permission review.

Static analysisPending

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install claw-qwen3-tts

Qwen3-TTS Skill

You have access to a powerful text-to-speech system that can generate human-quality speech with 10 built-in speakers, design new voices from descriptions, clone existing voices from audio samples, and send audio via Telegram/WhatsApp as native voice messages.

First-Time Setup

If the skill is not yet installed (no ~/clawd/skills/qwen3-tts directory), run:

bash <(curl -fsSL https://raw.githubusercontent.com/daMustermann/claw-qwen3-tts/main/install.sh)

Or if already cloned but not set up (no .venv/ directory):

bash ~/clawd/skills/qwen3-tts/install.sh

This auto-detects the GPU (CUDA, ROCm, Intel XPU, or CPU-only), creates a Python venv, and installs all dependencies. It takes 5–15 minutes on first run.

Starting & Stopping the Server

Before any TTS operation, ensure the server is running:

# Start (idempotent — won't restart if already running)
bash ~/clawd/skills/qwen3-tts/scripts/start_server.sh

# Check health
bash ~/clawd/skills/qwen3-tts/scripts/health_check.sh

# Stop (when done)
bash ~/clawd/skills/qwen3-tts/scripts/stop_server.sh

The server runs at http://localhost:8880.

Available Models

Model ID	Use Case	Notes
`custom-voice-1.7b`	High-quality TTS with built-in speakers — default	Best quality, ~5 GB VRAM
`custom-voice-0.6b`	Fast TTS with built-in speakers	Lightweight, ~2 GB VRAM
`voice-design`	Design new voices from natural language descriptions	Uses VoiceDesign model
`base-1.7b`	Basic TTS (auto-corrected to `custom-voice-1.7b`)	Use `custom-voice-*` instead
`base-0.6b`	Basic TTS (auto-corrected to `custom-voice-0.6b`)	Use `custom-voice-*` instead

Important: On the /v1/audio/speech endpoint, base-* and voice-design models are automatically corrected to the corresponding custom-voice-* model. Always prefer custom-voice-1.7b or custom-voice-0.6b for speech generation.

Built-in Speakers

The custom-voice-* models include 10 built-in voices:

Chelsie · Ethan · Aidan · Serena · Ryan · Vivian · Claire · Lucas · Eleanor · Benjamin

You can discover speakers dynamically: curl http://localhost:8880/v1/speakers

Capabilities

1. Generate Speech from Text

When to use: User asks to speak text, read something aloud, generate audio, do a voiceover, narrate, or say something.

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "custom-voice-1.7b",
    "input": "TEXT_HERE",
    "voice": "default",
    "speaker": "Chelsie",
    "language": "en",
    "instruct": "",
    "response_format": "wav"
  }' \
  --output ~/clawd/skills/qwen3-tts/output/speech.wav

Parameters:

Parameter	Required	Default	Description
`model`	no	`custom-voice-1.7b`	TTS model to use
`input`	yes	—	The text to synthesize
`voice`	no	`default`	`"default"` for built-in speakers, or a saved voice name (e.g. `"Angie"`)
`speaker`	no	`Chelsie`	Built-in speaker name (only when `voice` is `"default"`)
`language`	no	`en`	Language code: en, zh, ja, ko, de, fr, ru, pt, es, it
`instruct`	no	`""`	Emotional/style instruction (see below)
`response_format`	no	`wav`	Output format: wav, mp3, ogg, flac
`speed`	no	`1.0`	Speech speed multiplier

Language codes: en, zh, ja, ko, de, fr, ru, pt, es, it — or full names like English, Chinese, German, etc.

Instruct examples (controls tone, emotion, and style):

"Speak happily and with excitement"
"Whisper softly, as if telling a secret"
"Read this in a calm, professional news anchor tone"
"用愤怒的语气" (Speak angrily — works in target language too)
"" (empty string = neutral default)

When voice is a saved name: If you pass "voice": "Angie" and a voice named "Angie" exists, the server uses voice cloning with the saved reference audio instead of a built-in speaker. The speaker field is ignored in this case.

2. Design a New Voice

When to use: User wants to create a custom voice, describe how a character should sound, design a persona's voice.

curl -X POST http://localhost:8880/v1/audio/voice-design \
  -H "Content-Type: application/json" \
  -d '{
    "model": "voice-design",
    "input": "TEXT_TO_SPEAK",
    "voice_description": "DESCRIBE THE VOICE IN NATURAL LANGUAGE",
    "language": "en",
    "response_format": "wav"
  }' \
  --output ~/clawd/skills/qwen3-tts/output/designed.wav

Parameters:

Parameter	Required	Default	Description
`model`	no	`voice-design`	Must be `voice-design`
`input`	yes	—	Text to synthesize with the designed voice
`voice_description`	yes	—	Natural language description of the desired voice
`language`	no	`en`	Target language
`response_format`	no	`wav`	Output format

Example descriptions:

"A warm, deep male voice with a slight British accent, calm and authoritative, like a BBC presenter in his 40s"
"A young, energetic female voice, bright and cheerful, with a slight rasp"
"An old wizard with a slow, mysterious, gravelly voice"

The response includes a X-Voice-Id header — capture it to save the voice (see §4).

3. Clone a Voice

When to use: User provides a reference audio clip and wants to generate new speech in that voice.

curl -X POST http://localhost:8880/v1/audio/voice-clone \
  -F "reference_audio=@/path/to/reference.wav" \
  -F "reference_text=Transcript of the reference audio" \
  -F "input=New text to speak in the cloned voice" \
  -F "language=en" \
  -F "response_format=wav" \
  --output ~/clawd/skills/qwen3-tts/output/cloned.wav

Parameters:

Parameter	Required	Default	Description
`reference_audio`	yes	—	Audio file to clone the voice from
`input`	yes	—	New text to synthesize in the cloned voice
`reference_text`	no	`""`	Transcription of the reference audio (improves quality)
`language`	no	`en`	Target language
`response_format`	no	`wav`	Output format

Guidelines:

Minimum 3 seconds of reference audio
Recommended 10–30 seconds for best quality
Providing an accurate reference_text transcription significantly improves results
Supports cross-language cloning (clone from English → speak in Japanese)
If reference_text is empty, uses x-vector-only mode (audio features only)

The response includes a X-Voice-Id header — capture it to save the voice (see §4).

4. ⭐ CRITICAL: Voice Save Prompting Rules

YOU MUST FOLLOW THESE RULES:

After EVERY voice-design or voice-clone request, ask the user:

"Would you like to save this voice for future use? What name should I give it?"

If the user says yes, capture the X-Voice-Id from the response headers and save it:

curl -X POST http://localhost:8880/v1/voices \
  -H "Content-Type: application/json" \
  -d '{
    "name": "USER_CHOSEN_NAME",
    "source_voice_id": "VOICE_ID_FROM_X_VOICE_ID_HEADER",
    "description": "Description of the voice",
    "tags": ["tag1", "tag2"],
    "language": "en"
  }'

When user requests TTS with a voice name (e.g. "say this with Angie"):
- Use "voice": "Angie" in the /v1/audio/speech request
- The server automatically loads the saved reference audio and uses voice cloning
- If the name doesn't exist, tell the user and offer to design or clone one
When user asks to list voices:
```
curl http://localhost:8880/v1/voices
```
Present the results as a formatted list with name, description, source, language, tags, and usage count. Voices are sorted by usage count (most used first).
When user asks to delete a voice: Confirm with the user first, then:
```
curl -X DELETE http://localhost:8880/v1/voices/VOICE_NAME
```

When user asks to rename a voice:

curl -X PATCH http://localhost:8880/v1/voices/OLD_NAME \
  -H "Content-Type: application/json" \
  -d '{"name": "NEW_NAME"}'

When user asks to update a voice's metadata (description, tags, language):

curl -X PATCH http://localhost:8880/v1/voices/VOICE_NAME \
  -H "Content-Type: application/json" \
  -d '{"description": "Updated description", "tags": ["new", "tags"]}'

Voice names are case-insensitive but stored in the casing the user provided.
No duplicate names allowed. If a name already exists, the save will fail (409). Ask the user for a different name or offer to delete the existing one first.
Voice profiles are stored locally in ~/clawd/skills/qwen3-tts/voices/ and persist across server restarts. Each voice consists of:
- <name>.json — metadata
- <name>.pt — embedding tensor
- <name>_sample.wav — reference audio sample (used for re-cloning)

5. Convert Audio Formats

When to use: User needs audio in a specific format, or you need to prepare audio for messaging.

curl -X POST http://localhost:8880/v1/audio/convert \
  -F "audio=@input.wav" \
  -F "target_format=mp3" \
  --output output.mp3

Supported formats: wav, mp3, ogg (Opus), flac

You can also use the shell script directly:

bash ~/clawd/skills/qwen3-tts/scripts/convert_to_ogg_opus.sh input.wav output.ogg

6. Send via Telegram (PTT Voice Message)

When to use: User is interacting via Telegram, or explicitly asks to send audio to a Telegram chat.

curl -X POST http://localhost:8880/v1/audio/send/telegram \
  -H "Content-Type: application/json" \
  -d '{
    "audio_file": "/path/to/audio.wav",
    "chat_id": "CHAT_ID",
    "bot_token": "BOT_TOKEN",
    "caption": "Optional caption"
  }'

bot_token is optional if already configured in config.json
Audio is auto-converted to OGG/Opus and sent via Telegram's sendVoice API
Displays as a native PTT waveform voice message in the chat

7. Send via WhatsApp (PTT Voice Message)

When to use: User is interacting via WhatsApp, or explicitly asks to send audio there.

curl -X POST http://localhost:8880/v1/audio/send/whatsapp \
  -H "Content-Type: application/json" \
  -d '{
    "audio_file": "/path/to/audio.wav",
    "phone_number_id": "PHONE_ID",
    "recipient": "+14155551234",
    "access_token": "ACCESS_TOKEN"
  }'

phone_number_id and access_token are optional if already configured in config.json
Audio is auto-converted to OGG/Opus and sent as a native WhatsApp voice message

8. Discovery Endpoints

Use these to dynamically discover available models and speakers:

# List all available TTS models
curl http://localhost:8880/v1/models

# List built-in speakers
curl http://localhost:8880/v1/speakers

# Server health check (device info, voice count, version)
curl http://localhost:8880/health

How to Respond

After generating speech:

Tell the user the audio has been generated
Provide the output file path
If it was voice-design or voice-clone, always ask to save the voice (Rule §4.1)
If the user is on Telegram/WhatsApp, offer to send it as a voice message

After saving a voice:

Confirm the name and tell the user they can use it anytime with that name
Example: "Voice saved as 'Captain Hook'! You can reference it anytime with voice: Captain Hook."

After sending via Telegram/WhatsApp:

Confirm successful delivery

When choosing a speaker: If the user doesn't specify, default to "Chelsie". If they describe the kind of voice they want (but not a full voice-design request), pick the most fitting built-in speaker.

When choosing a model: Default to custom-voice-1.7b. Only use custom-voice-0.6b if the user asks for speed, or if the system has limited VRAM/memory.

Configuration

The agent can update ~/clawd/skills/qwen3-tts/config.json to set:

Telegram: bot token and default chat ID
WhatsApp: phone number ID and access token
Default model: custom-voice-1.7b or custom-voice-0.6b
Default audio format: wav, mp3, ogg, flac
Device override: auto, cuda:0, xpu:0, cpu

If config.json doesn't exist, copy the template:

cp ~/clawd/skills/qwen3-tts/config.json.template ~/clawd/skills/qwen3-tts/config.json