{"skill":{"slug":"qwen3-tts-local-inference","displayName":"qwen3-tts-local-inference","summary":"Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2)...","description":"---\r\nname: qwen3-tts-local-inference\r\ndescription: >\r\n  Generate speech from text using Qwen3-TTS via direct Python inference — no\r\n  server required. Use when: (1) converting text to speech / synthesising audio,\r\n  (2) creating voiceovers or spoken content, (3) cloning a voice from reference\r\n  audio, (4) generating TTS with built-in speakers or custom voice descriptions.\r\n  Supports custom-voice (9 speakers), voice-design (natural language), and\r\n  voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default)\r\n  and 1.7B (large) models available. Runs entirely offline after model download.\r\n---\r\n\r\n# Qwen3-TTS — Local Inference (No Server)\r\n\r\nRun Qwen3-TTS directly in Python — no HTTP server, no REST API. Call a script\r\nor import the engine in your own code.\r\n\r\n## Quick reference\r\n\r\n| Mode | What it does | Key args |\r\n|------|-------------|----------|\r\n| **custom-voice** | 9 built-in speakers, optional emotion/style | `--speaker`, `--instruct` |\r\n| **voice-design** | Describe the voice in natural language | `--instruct` (required) |\r\n| **voice-clone** | Clone from ~3 s reference audio | `--ref-audio`, `--ref-text` |\r\n\r\n**Available Speakers**\r\n\r\nThe CustomVoice model includes 9 premium voices:\r\n\r\n| Speaker | Language | Description |\r\n|---------|----------|-------------|\r\n| Vivian | Chinese | Bright, slightly edgy young female |\r\n| Serena | Chinese | Warm, gentle young female |\r\n| Uncle_Fu | Chinese | Seasoned male, low mellow timbre |\r\n| Dylan | Chinese (Beijing) | Youthful Beijing male, clear |\r\n| Eric | Chinese (Sichuan) | Lively Chengdu male, husky |\r\n| Ryan | English | Dynamic male, rhythmic |\r\n| Aiden | English | Sunny American male |\r\n| Ono_Anna | Japanese | Playful female, light nimble |\r\n| Sohee | Korean | Warm female, rich emotion |\r\n\r\n**Languages:** Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian, Auto\r\n\r\n---\r\n\r\n## 1 — Setup\r\n\r\nInstall dependencies once (from the skill directory):\r\n\r\n**First-time setup** (one-time):\r\n\r\n```bash\r\nbash scripts/setup.sh\r\n```\r\n\r\nCustom download location:\r\n\r\n```bash\r\npython scripts/download_models.py --model-dir /path/to/models\r\n```\r\n\r\nModels are stored under `{baseDir}/models/` by default. Override with\r\n`QWEN_TTS_MODEL_DIR` env var or `--model-dir` flag.\r\n\r\n---\r\n\r\n## 2 — Generate speech (CLI)\r\n\r\n### Custom Voice (default)\r\n\r\n```bash\r\ncd {baseDir}\r\npython scripts/tts.py \"Hello, how are you today?\" --speaker Ryan --language English\r\n```\r\n\r\nWith emotion/style instruction:\r\n\r\n```bash\r\npython scripts/tts.py \"Great news everyone!\" --speaker Aiden --instruct \"cheerful and energetic\"\r\n```\r\n\r\n### Voice Design\r\n\r\nDescribe the voice in natural language:\r\n\r\n```bash\r\npython scripts/tts.py \"Welcome to our show!\" \\\r\n  --mode voice-design \\\r\n  --language English \\\r\n  --instruct \"Warm, confident female voice in her 30s with a slight British accent\"\r\n```\r\n\r\n### Voice Clone\r\n\r\nClone a voice from a short (~3 s) reference audio clip:\r\n\r\n```bash\r\npython scripts/tts.py \"This is spoken in the cloned voice.\" \\\r\n  --mode voice-clone \\\r\n  --language English \\\r\n  --ref-audio path/to/reference.wav \\\r\n  --ref-text \"Transcript of the reference audio.\"\r\n```\r\n\r\n### Common options\r\n\r\n| Flag | Purpose |\r\n|------|---------|\r\n| `-o output.wav` | Save to exact file path instead of auto-named file |\r\n| `--output-dir DIR` | Override output directory (default: `tts_output/`) |\r\n| `--model-dir DIR` | Override model directory |\r\n| `--json` | Print result as JSON |\r\n| `-v` | Verbose logging |\r\n\r\n---\r\n\r\n## 3 — Python API\r\n\r\nUse the engine directly in code:\r\n\r\n```python\r\nimport sys\r\nsys.path.insert(0, \"{baseDir}/scripts\")\r\n\r\nfrom inference import TTSInferenceEngine\r\n\r\nengine = TTSInferenceEngine(\r\n    model_dir=\"{baseDir}/models\",   # optional, uses default if omitted\r\n    output_dir=\"./tts_output\",       # optional\r\n)\r\n\r\nresult = engine.generate_custom_voice(\r\n    text=\"Hello world!\",\r\n    language=\"English\",\r\n    speaker=\"Ryan\",\r\n    instruct=\"calm and professional\",\r\n)\r\nprint(result)\r\n# {\"file\": \"tts_output/custom_voice_20260218_...wav\", \"duration_s\": 1.23, \"inference_s\": 4.56}\r\n```\r\n\r\nAvailable methods:\r\n- `engine.generate_custom_voice(text, language, speaker, instruct)`\r\n- `engine.generate_voice_design(text, language, instruct)`\r\n- `engine.generate_voice_clone(text, language, ref_audio, ref_text)`\r\n- `engine.status()` — returns loaded variant, device, paths\r\n\r\n---\r\n\r\n## 4 — Configuration\r\n\r\nAll settings are controlled via environment variables. Set them before running.\r\n\r\n| Variable | Default | Description |\r\n|----------|---------|-------------|\r\n| `QWEN_TTS_MODEL_SIZE` | `small` | `small` (0.6B) or `large` (1.7B) |\r\n| `QWEN_TTS_MODEL_DIR` | `{baseDir}/models` | Where model weights are stored |\r\n| `QWEN_TTS_DEVICE` | auto (`cuda:0` or `cpu`) | Inference device |\r\n| `QWEN_TTS_DTYPE` | auto (`bfloat16` / `float32`) | Model precision |\r\n| `QWEN_TTS_OUTPUT_DIR` | `./tts_output` | Where generated .wav files are saved |\r\n\r\nSwitch to the 1.7B model:\r\n\r\n```bash\r\nset QWEN_TTS_MODEL_SIZE=large\r\npython scripts/tts.py \"Hello world\"\r\n```\r\n\r\nUse a custom model directory:\r\n\r\n```bash\r\nset QWEN_TTS_MODEL_DIR=D:\\my-models\\qwen-tts\r\npython scripts/tts.py \"Hello world\"\r\n```\r\n\r\n---\r\n\r\n## Important notes\r\n\r\n- **Small model (0.6B) is the default.** It uses less RAM and is faster.\r\n  Switch to `large` (1.7B) for higher quality.\r\n- **CPU inference is slow.** Expect 30-120 s per sentence for the 1.7B model.\r\n  The 0.6B model is roughly 2x faster.\r\n- Only **one model variant** is loaded at a time. Switching modes (e.g.\r\n  custom-voice to voice-clone) triggers a model swap.\r\n- Output `.wav` files land in `tts_output/` by default.\r\n- Models are downloaded to `{baseDir}/models/` by default. Run\r\n  `download_models.py --size all` to pre-download both sizes for offline use.\r\n- Voice Design mode has **no 0.6B variant** — it always uses the 1.7B model\r\n  regardless of `QWEN_TTS_MODEL_SIZE`.\r\n","topics":["Text-to-Speech","Audio"],"tags":{"latest":"0.0.1"},"stats":{"comments":0,"downloads":251,"installsAllTime":9,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1771450630984,"updatedAt":1778491578254},"latestVersion":{"version":"0.0.1","createdAt":1771450630984,"changelog":"- Initial release of Qwen3-TTS local inference skill.\n- Generate high-quality speech from text using Qwen3-TTS directly in Python without requiring a server.\n- Supports three modes: 9-speaker custom voices, natural language voice design, and voice cloning from reference audio.\n- Compatible with Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.\n- Exposes both a CLI and Python API for flexible use.\n- Includes model management, configuration via environment variables, and outputs generated audio as .wav files.","license":null},"metadata":null,"owner":{"handle":"jithinm","userId":"s177ccfsaf4phwp0j5raxbjz8s83exd1","displayName":"JithinM","image":"https://avatars.githubusercontent.com/u/12230840?v=4"},"moderation":null}