Install
openclaw skills install qwen3-audioHigh-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).
openclaw skills install qwen3-audioQwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.
Before using any capability, verify that all items in ./references/env-check-list.md are complete.
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav"
Returns (JSON):
{
"audio_path": "/path_to_save.wav",
"duration": 1.234,
"sample_rate": 24000
}
Clone any voice using a reference audio sample. Provide the wav file and its transcript:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."
ref_audio: reference audio to clone ref_text: transcript of the reference audio
Use a voice created with voice create by its ID:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_voice "my-voice-id"
This automatically loads ref_audio and ref_text from the voice profile.
Use predefined voices with emotion/style instructions:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --speaker "Ryan" --language "English" --instruct "Very happy and excited."
Create any voice from a text description:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --language "English" --instruct "A cheerful young female voice with high pitch and energetic tone."
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" stt --audio "/sample_audio.wav" --output "/path_to_save.txt" --output-format srt
Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav output-format: "txt" | "ass" | "srt" | "all"
Returns (JSON):
{
"text": "transcribed text content",
"duration": 10.5,
"sample_rate": 16000,
"files": ["/path_to_save.txt", "/path_to_save.srt"]
}
Voices are stored in the voices/ directory at the skill root level. Each voice has its own folder containing:
ref_audio.wav - Reference audio fileref_text.txt - Reference text transcriptref_instruct.txt - Voice style descriptionCreate a reusable voice profile using VoiceDesign model. The --instruct parameter is required to describe the voice style:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --language "English"
Optional: --id "my-voice-id" to specify a custom voice ID.
Returns (JSON):
{
"id": "abc12345",
"ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
"ref_text": "This is a sample voice reference text.",
"instruct": "A warm, friendly female voice with a professional tone.",
"duration": 3.456,
"sample_rate": 24000
}
List all created voice profiles:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice list
Returns (JSON):
[
{
"id": "abc12345",
"ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
"ref_text": "This is a sample voice reference text.",
"instruct": "A warm, friendly female voice with a professional tone.",
"duration": 3.456,
"sample_rate": 24000
}
]
After creating a voice, use it for TTS with the --ref_voice parameter. The instruct will be automatically loaded:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "New text to speak" --output "/output.wav" --ref_voice "abc12345"
For Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.
| Speaker | Voice Description | Native Language |
|---|---|---|
| Vivian | Bright, slightly edgy young female voice. | Chinese |
| Serena | Warm, gentle young female voice. | Chinese |
| Uncle_Fu | Seasoned male voice with a low, mellow timbre. | Chinese |
| Dylan | Youthful Beijing male voice with a clear, natural timbre. | Chinese (Beijing Dialect) |
| Eric | Lively Chengdu male voice with a slightly husky brightness. | Chinese (Sichuan Dialect) |
| Ryan | Dynamic male voice with strong rhythmic drive. | English |
| Aiden | Sunny American male voice with a clear midrange. | English |
| Ono_Anna | Playful Japanese female voice with a light, nimble timbre. | Japanese |
| Sohee | Warm Korean female voice with rich emotion. | Korean |
| Model | Features | Language Support | Instruction Control |
|---|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian |