{"skill":{"slug":"qwen3-audio","displayName":"Qwen3 Audio","summary":"High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).","description":"---\nname: qwen3-audio\ndescription: \"High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).\"\nversion: \"0.0.3\"\n---\n\n# Qwen3-Audio\n\n## Overview\n\nQwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.\n\n## Prerequisites\n\n- Python 3.10+\n- Apple Silicon Mac (M1/M2/M3/M4)\n\n### Environment checks\n\nBefore using any capability, verify that all items in `./references/env-check-list.md` are complete.\n\n## Capabilities\n\n### Text to Speech\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"hello world\" --output \"/path_to_save.wav\"\n```\n\n**Returns (JSON):**\n```json\n{\n  \"audio_path\": \"/path_to_save.wav\",\n  \"duration\": 1.234,\n  \"sample_rate\": 24000\n}\n```\n\n### Voice Cloning\nClone any voice using a reference audio sample. Provide the wav file and its transcript:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"hello world\" --output \"/path_to_save.wav\" --ref_audio \"sample_audio.wav\" --ref_text \"This is what my voice sounds like.\"\n```\nref_audio: reference audio to clone\nref_text: transcript of the reference audio\n\n### Use Created Voice (Shortcut)\nUse a voice created with `voice create` by its ID:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"hello world\" --output \"/path_to_save.wav\" --ref_voice \"my-voice-id\"\n```\nThis automatically loads `ref_audio` and `ref_text` from the voice profile.\n\n### CustomVoice (Emotion Control)\nUse predefined voices with emotion/style instructions:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"hello world\" --output \"/path_to_save.wav\" --speaker \"Ryan\" --language \"English\" --instruct \"Very happy and excited.\"\n```\n\n### VoiceDesign (Create Any Voice)\nCreate any voice from a text description:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"hello world\" --output \"/path_to_save.wav\" --language \"English\" --instruct \"A cheerful young female voice with high pitch and energetic tone.\"\n```\n\n### Automatic Speech Recognition (STT)\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" stt --audio \"/sample_audio.wav\" --output \"/path_to_save.txt\" --output-format srt\n```\nTest audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav\noutput-format: \"txt\" | \"ass\" | \"srt\" | \"all\"\n\n**Returns (JSON):**\n```json\n{\n  \"text\": \"transcribed text content\",\n  \"duration\": 10.5,\n  \"sample_rate\": 16000,\n  \"files\": [\"/path_to_save.txt\", \"/path_to_save.srt\"]\n}\n```\n\n### Voice Management\n\nVoices are stored in the `voices/` directory at the skill root level. Each voice has its own folder containing:\n- `ref_audio.wav` - Reference audio file\n- `ref_text.txt` - Reference text transcript\n- `ref_instruct.txt` - Voice style description\n\n#### Create a Voice\nCreate a reusable voice profile using VoiceDesign model. The `--instruct` parameter is required to describe the voice style:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" voice create --text \"This is a sample voice reference text.\" --instruct \"A warm, friendly female voice with a professional tone.\" --language \"English\"\n```\nOptional: `--id \"my-voice-id\"` to specify a custom voice ID.\n\n**Returns (JSON):**\n```json\n{\n  \"id\": \"abc12345\",\n  \"ref_audio\": \"/path/to/skill/voices/abc12345/ref_audio.wav\",\n  \"ref_text\": \"This is a sample voice reference text.\",\n  \"instruct\": \"A warm, friendly female voice with a professional tone.\",\n  \"duration\": 3.456,\n  \"sample_rate\": 24000\n}\n```\n\n#### List Voices\nList all created voice profiles:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" voice list\n```\n\n**Returns (JSON):**\n```json\n[\n  {\n    \"id\": \"abc12345\",\n    \"ref_audio\": \"/path/to/skill/voices/abc12345/ref_audio.wav\",\n    \"ref_text\": \"This is a sample voice reference text.\",\n    \"instruct\": \"A warm, friendly female voice with a professional tone.\",\n    \"duration\": 3.456,\n    \"sample_rate\": 24000\n  }\n]\n```\n\n#### Use a Created Voice\nAfter creating a voice, use it for TTS with the `--ref_voice` parameter. The instruct will be automatically loaded:\n```bash\nuv run --python \".venv/bin/python\" \"./scripts/mlx-audio.py\" tts --text \"New text to speak\" --output \"/output.wav\" --ref_voice \"abc12345\"\n```\n\n## Predefined Speakers (CustomVoice)\n\nFor `Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice` models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.\n\n| Speaker | Voice Description | Native Language |\n| --- | --- | --- |\n| Vivian | Bright, slightly edgy young female voice. | Chinese |\n| Serena | Warm, gentle young female voice. | Chinese |\n| Uncle_Fu | Seasoned male voice with a low, mellow timbre. | Chinese |\n| Dylan | Youthful Beijing male voice with a clear, natural timbre. | Chinese (Beijing Dialect) |\n| Eric | Lively Chengdu male voice with a slightly husky brightness. | Chinese (Sichuan Dialect) |\n| Ryan | Dynamic male voice with strong rhythmic drive. | English |\n| Aiden | Sunny American male voice with a clear midrange. | English |\n| Ono_Anna | Playful Japanese female voice with a light, nimble timbre. | Japanese |\n| Sohee | Warm Korean female voice with rich emotion. | Korean |\n\n\n### Released Models\n\n| Model | Features | Language Support | Instruction Control |\n|---|---|---|---|\n| Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |\n| Qwen3-TTS-12Hz-1.7B-CustomVoice | Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |\n| Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian |  |\n","tags":{"latest":"0.1.1"},"stats":{"comments":0,"downloads":850,"installsAllTime":0,"installsCurrent":0,"stars":0,"versions":2},"createdAt":1772513985779,"updatedAt":1778994700369},"latestVersion":{"version":"0.1.1","createdAt":1772602178164,"changelog":"Voice profile management updated to require and support style descriptions.\n\n- Voice profiles now include a mandatory instruct (style description) field.\n- voices/ directory structure updated: each voice now contains ref_instruct.txt.\n- voice create command requires --instruct to describe voice style (used with VoiceDesign model).\n- Listing or using voices now shows and applies the instruct field automatically.\n- Documentation updated to reflect new requirements and workflow for voice profile creation and use.","license":null},"metadata":null,"owner":{"handle":"darknoah","userId":"s17ej6pe6hc67ygkxpdwcwpfd583gryy","displayName":"noah","image":"https://avatars.githubusercontent.com/u/13805682?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780089753398}}