{"skill":{"slug":"claw-qwen3-tts","displayName":"Fully offline Qwen3 TTS for your agent","summary":"High-quality text-to-speech using Qwen3-TTS. 10 built-in speakers with emotional instruct control, voice cloning (3s of audio), natural-language voice design...","description":"---\nname: qwen3-tts\ndescription: >\n  High-quality text-to-speech using Qwen3-TTS. 10 built-in speakers with\n  emotional instruct control, voice cloning (3s of audio), natural-language\n  voice design, 10+ languages, persistent named voices, and delivering audio\n  via Telegram/WhatsApp as native voice messages. Auto-detects GPU hardware\n  (CUDA, ROCm, Intel XPU, CPU).\nversion: \"1.0\"\nauthor: daMustermann\nrepository: https://github.com/daMustermann/claw-qwen3-tts\nlicense: MIT\nrequires:\n  - python>=3.10\n  - ffmpeg\n  - sox\n  - git\ntags:\n  - tts\n  - audio\n  - voice\n  - speech\n  - voice-cloning\n  - voice-design\n  - telegram\n  - whatsapp\n  - clawhub\n---\n\n# Qwen3-TTS Skill\n\nYou have access to a powerful text-to-speech system that can generate human-quality speech with 10 built-in speakers, design new voices from descriptions, clone existing voices from audio samples, and send audio via Telegram/WhatsApp as native voice messages.\n\n## First-Time Setup\n\nIf the skill is not yet installed (no `~/clawd/skills/qwen3-tts` directory), run:\n\n```bash\nbash <(curl -fsSL https://raw.githubusercontent.com/daMustermann/claw-qwen3-tts/main/install.sh)\n```\n\nOr if already cloned but not set up (no `.venv/` directory):\n\n```bash\nbash ~/clawd/skills/qwen3-tts/install.sh\n```\n\nThis auto-detects the GPU (CUDA, ROCm, Intel XPU, or CPU-only), creates a Python venv, and installs all dependencies. It takes 5–15 minutes on first run.\n\n## Starting & Stopping the Server\n\nBefore any TTS operation, ensure the server is running:\n\n```bash\n# Start (idempotent — won't restart if already running)\nbash ~/clawd/skills/qwen3-tts/scripts/start_server.sh\n\n# Check health\nbash ~/clawd/skills/qwen3-tts/scripts/health_check.sh\n\n# Stop (when done)\nbash ~/clawd/skills/qwen3-tts/scripts/stop_server.sh\n```\n\nThe server runs at `http://localhost:8880`.\n\n---\n\n## Available Models\n\n| Model ID | Use Case | Notes |\n|----------|----------|-------|\n| `custom-voice-1.7b` | High-quality TTS with built-in speakers — **default** | Best quality, ~5 GB VRAM |\n| `custom-voice-0.6b` | Fast TTS with built-in speakers | Lightweight, ~2 GB VRAM |\n| `voice-design` | Design new voices from natural language descriptions | Uses VoiceDesign model |\n| `base-1.7b` | Basic TTS (auto-corrected to `custom-voice-1.7b`) | Use `custom-voice-*` instead |\n| `base-0.6b` | Basic TTS (auto-corrected to `custom-voice-0.6b`) | Use `custom-voice-*` instead |\n\n> **Important:** On the `/v1/audio/speech` endpoint, `base-*` and `voice-design` models are automatically corrected to the corresponding `custom-voice-*` model. Always prefer `custom-voice-1.7b` or `custom-voice-0.6b` for speech generation.\n\n## Built-in Speakers\n\nThe `custom-voice-*` models include 10 built-in voices:\n\n> **Chelsie** · **Ethan** · **Aidan** · **Serena** · **Ryan** · **Vivian** · **Claire** · **Lucas** · **Eleanor** · **Benjamin**\n\nYou can discover speakers dynamically: `curl http://localhost:8880/v1/speakers`\n\n---\n\n## Capabilities\n\n### 1. Generate Speech from Text\n\n**When to use:** User asks to speak text, read something aloud, generate audio, do a voiceover, narrate, or say something.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/speech \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"custom-voice-1.7b\",\n    \"input\": \"TEXT_HERE\",\n    \"voice\": \"default\",\n    \"speaker\": \"Chelsie\",\n    \"language\": \"en\",\n    \"instruct\": \"\",\n    \"response_format\": \"wav\"\n  }' \\\n  --output ~/clawd/skills/qwen3-tts/output/speech.wav\n```\n\n**Parameters:**\n\n| Parameter | Required | Default | Description |\n|-----------|----------|---------|-------------|\n| `model` | no | `custom-voice-1.7b` | TTS model to use |\n| `input` | **yes** | — | The text to synthesize |\n| `voice` | no | `default` | `\"default\"` for built-in speakers, or a **saved voice name** (e.g. `\"Angie\"`) |\n| `speaker` | no | `Chelsie` | Built-in speaker name (only when `voice` is `\"default\"`) |\n| `language` | no | `en` | Language code: en, zh, ja, ko, de, fr, ru, pt, es, it |\n| `instruct` | no | `\"\"` | Emotional/style instruction (see below) |\n| `response_format` | no | `wav` | Output format: wav, mp3, ogg, flac |\n| `speed` | no | `1.0` | Speech speed multiplier |\n\n**Language codes:** `en`, `zh`, `ja`, `ko`, `de`, `fr`, `ru`, `pt`, `es`, `it` — or full names like `English`, `Chinese`, `German`, etc.\n\n**Instruct examples** (controls tone, emotion, and style):\n- `\"Speak happily and with excitement\"`\n- `\"Whisper softly, as if telling a secret\"`\n- `\"Read this in a calm, professional news anchor tone\"`\n- `\"用愤怒的语气\"` (Speak angrily — works in target language too)\n- `\"\"` (empty string = neutral default)\n\n**When voice is a saved name:** If you pass `\"voice\": \"Angie\"` and a voice named \"Angie\" exists, the server uses voice cloning with the saved reference audio instead of a built-in speaker. The `speaker` field is ignored in this case.\n\n### 2. Design a New Voice\n\n**When to use:** User wants to create a custom voice, describe how a character should sound, design a persona's voice.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/voice-design \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"voice-design\",\n    \"input\": \"TEXT_TO_SPEAK\",\n    \"voice_description\": \"DESCRIBE THE VOICE IN NATURAL LANGUAGE\",\n    \"language\": \"en\",\n    \"response_format\": \"wav\"\n  }' \\\n  --output ~/clawd/skills/qwen3-tts/output/designed.wav\n```\n\n**Parameters:**\n\n| Parameter | Required | Default | Description |\n|-----------|----------|---------|-------------|\n| `model` | no | `voice-design` | Must be `voice-design` |\n| `input` | **yes** | — | Text to synthesize with the designed voice |\n| `voice_description` | **yes** | — | Natural language description of the desired voice |\n| `language` | no | `en` | Target language |\n| `response_format` | no | `wav` | Output format |\n\n**Example descriptions:**\n- `\"A warm, deep male voice with a slight British accent, calm and authoritative, like a BBC presenter in his 40s\"`\n- `\"A young, energetic female voice, bright and cheerful, with a slight rasp\"`\n- `\"An old wizard with a slow, mysterious, gravelly voice\"`\n\nThe response includes a `X-Voice-Id` header — capture it to save the voice (see §4).\n\n### 3. Clone a Voice\n\n**When to use:** User provides a reference audio clip and wants to generate new speech in that voice.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/voice-clone \\\n  -F \"reference_audio=@/path/to/reference.wav\" \\\n  -F \"reference_text=Transcript of the reference audio\" \\\n  -F \"input=New text to speak in the cloned voice\" \\\n  -F \"language=en\" \\\n  -F \"response_format=wav\" \\\n  --output ~/clawd/skills/qwen3-tts/output/cloned.wav\n```\n\n**Parameters:**\n\n| Parameter | Required | Default | Description |\n|-----------|----------|---------|-------------|\n| `reference_audio` | **yes** | — | Audio file to clone the voice from |\n| `input` | **yes** | — | New text to synthesize in the cloned voice |\n| `reference_text` | no | `\"\"` | Transcription of the reference audio (improves quality) |\n| `language` | no | `en` | Target language |\n| `response_format` | no | `wav` | Output format |\n\n**Guidelines:**\n- Minimum **3 seconds** of reference audio\n- Recommended **10–30 seconds** for best quality\n- Providing an accurate `reference_text` transcription significantly improves results\n- Supports **cross-language cloning** (clone from English → speak in Japanese)\n- If `reference_text` is empty, uses x-vector-only mode (audio features only)\n\nThe response includes a `X-Voice-Id` header — capture it to save the voice (see §4).\n\n### 4. ⭐ CRITICAL: Voice Save Prompting Rules\n\n**YOU MUST FOLLOW THESE RULES:**\n\n1. **After EVERY voice-design or voice-clone request**, ask the user:\n   > \"Would you like to save this voice for future use? What name should I give it?\"\n\n2. **If the user says yes**, capture the `X-Voice-Id` from the response headers and save it:\n   ```bash\n   curl -X POST http://localhost:8880/v1/voices \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\n       \"name\": \"USER_CHOSEN_NAME\",\n       \"source_voice_id\": \"VOICE_ID_FROM_X_VOICE_ID_HEADER\",\n       \"description\": \"Description of the voice\",\n       \"tags\": [\"tag1\", \"tag2\"],\n       \"language\": \"en\"\n     }'\n   ```\n\n3. **When user requests TTS with a voice name** (e.g. \"say this with Angie\"):\n   - Use `\"voice\": \"Angie\"` in the `/v1/audio/speech` request\n   - The server automatically loads the saved reference audio and uses voice cloning\n   - If the name doesn't exist, tell the user and offer to design or clone one\n\n4. **When user asks to list voices:**\n   ```bash\n   curl http://localhost:8880/v1/voices\n   ```\n   Present the results as a formatted list with name, description, source, language, tags, and usage count. Voices are sorted by usage count (most used first).\n\n5. **When user asks to delete a voice:** Confirm with the user first, then:\n   ```bash\n   curl -X DELETE http://localhost:8880/v1/voices/VOICE_NAME\n   ```\n\n6. **When user asks to rename a voice:**\n   ```bash\n   curl -X PATCH http://localhost:8880/v1/voices/OLD_NAME \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\"name\": \"NEW_NAME\"}'\n   ```\n\n7. **When user asks to update a voice's metadata** (description, tags, language):\n   ```bash\n   curl -X PATCH http://localhost:8880/v1/voices/VOICE_NAME \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\"description\": \"Updated description\", \"tags\": [\"new\", \"tags\"]}'\n   ```\n\n8. **Voice names are case-insensitive** but stored in the casing the user provided.\n\n9. **No duplicate names allowed.** If a name already exists, the save will fail (409). Ask the user for a different name or offer to delete the existing one first.\n\n10. **Voice profiles are stored locally** in `~/clawd/skills/qwen3-tts/voices/` and persist across server restarts. Each voice consists of:\n    - `<name>.json` — metadata\n    - `<name>.pt` — embedding tensor\n    - `<name>_sample.wav` — reference audio sample (used for re-cloning)\n\n### 5. Convert Audio Formats\n\n**When to use:** User needs audio in a specific format, or you need to prepare audio for messaging.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/convert \\\n  -F \"audio=@input.wav\" \\\n  -F \"target_format=mp3\" \\\n  --output output.mp3\n```\n\nSupported formats: **wav**, **mp3**, **ogg** (Opus), **flac**\n\nYou can also use the shell script directly:\n```bash\nbash ~/clawd/skills/qwen3-tts/scripts/convert_to_ogg_opus.sh input.wav output.ogg\n```\n\n### 6. Send via Telegram (PTT Voice Message)\n\n**When to use:** User is interacting via Telegram, or explicitly asks to send audio to a Telegram chat.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/send/telegram \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"audio_file\": \"/path/to/audio.wav\",\n    \"chat_id\": \"CHAT_ID\",\n    \"bot_token\": \"BOT_TOKEN\",\n    \"caption\": \"Optional caption\"\n  }'\n```\n\n- `bot_token` is optional if already configured in `config.json`\n- Audio is auto-converted to OGG/Opus and sent via Telegram's `sendVoice` API\n- Displays as a native PTT waveform voice message in the chat\n\n### 7. Send via WhatsApp (PTT Voice Message)\n\n**When to use:** User is interacting via WhatsApp, or explicitly asks to send audio there.\n\n```bash\ncurl -X POST http://localhost:8880/v1/audio/send/whatsapp \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"audio_file\": \"/path/to/audio.wav\",\n    \"phone_number_id\": \"PHONE_ID\",\n    \"recipient\": \"+14155551234\",\n    \"access_token\": \"ACCESS_TOKEN\"\n  }'\n```\n\n- `phone_number_id` and `access_token` are optional if already configured in `config.json`\n- Audio is auto-converted to OGG/Opus and sent as a native WhatsApp voice message\n\n### 8. Discovery Endpoints\n\nUse these to dynamically discover available models and speakers:\n\n```bash\n# List all available TTS models\ncurl http://localhost:8880/v1/models\n\n# List built-in speakers\ncurl http://localhost:8880/v1/speakers\n\n# Server health check (device info, voice count, version)\ncurl http://localhost:8880/health\n```\n\n---\n\n## How to Respond\n\n**After generating speech:**\n1. Tell the user the audio has been generated\n2. Provide the output file path\n3. If it was voice-design or voice-clone, **always ask to save the voice** (Rule §4.1)\n4. If the user is on Telegram/WhatsApp, offer to send it as a voice message\n\n**After saving a voice:**\n- Confirm the name and tell the user they can use it anytime with that name\n- Example: *\"Voice saved as 'Captain Hook'! You can reference it anytime with `voice: Captain Hook`.\"*\n\n**After sending via Telegram/WhatsApp:**\n- Confirm successful delivery\n\n**When choosing a speaker:** If the user doesn't specify, default to `\"Chelsie\"`. If they describe the kind of voice they want (but not a full voice-design request), pick the most fitting built-in speaker.\n\n**When choosing a model:** Default to `custom-voice-1.7b`. Only use `custom-voice-0.6b` if the user asks for speed, or if the system has limited VRAM/memory.\n\n---\n\n## Configuration\n\nThe agent can update `~/clawd/skills/qwen3-tts/config.json` to set:\n- **Telegram:** bot token and default chat ID\n- **WhatsApp:** phone number ID and access token\n- **Default model:** `custom-voice-1.7b` or `custom-voice-0.6b`\n- **Default audio format:** wav, mp3, ogg, flac\n- **Device override:** auto, cuda:0, xpu:0, cpu\n\nIf `config.json` doesn't exist, copy the template:\n```bash\ncp ~/clawd/skills/qwen3-tts/config.json.template ~/clawd/skills/qwen3-tts/config.json\n```\n","topics":["Text-to-Speech","Audio","Telegram","WhatsApp","Voice Cloning"],"tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":274,"installsAllTime":10,"installsCurrent":1,"stars":0,"versions":1},"createdAt":1771641605632,"updatedAt":1778992002097},"latestVersion":{"version":"1.0.0","createdAt":1771641605632,"changelog":"- Initial release of Qwen3-TTS skill for high-quality text-to-speech.\n- Features 10 built-in speakers, voice cloning from 3s+ audio, natural-language voice design, and emotional control.\n- Supports 10+ languages with persistent, named voices.\n- Audio can be delivered as native messages in Telegram and WhatsApp.\n- Auto-detects available GPU/CPU hardware for optimal performance.\n- Includes detailed setup instructions and usage guidelines for speech generation, voice design, and cloning.","license":null},"metadata":null,"owner":{"handle":"damustermann","userId":"s17dq6y70pdxjemzj90rprm4wd8857qp","displayName":"Julian","image":"https://avatars.githubusercontent.com/u/61767855?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1779943861887}}