Qwen3 Tts Mlx

v2.1.0

Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

0· 269·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
Name/description and the scripts align with a local TTS tool using mlx-audio and MLX models; however the SKILL.md emphasizes 'offline' usage while the code calls generate_audio with model names (e.g. mlx-community/...) that will typically be fetched from remote model repositories (Hugging Face/MLX) unless pre-downloaded. The documentation does not explain model download, disk/storage needs (~1–2+ GB per model), or how to operate fully offline, which is a meaningful mismatch.
!
Instruction Scope
The SKILL.md instructs installing mlx-audio and ffmpeg and running the provided scripts; the scripts read user JSON and audio files and write outputs (expected). The runtime instructions do not disclose that generate_audio may download models or contact remote endpoints; that omission gives the agent broader network/IO behavior than the 'offline' claim implies. The batch script also monkeypatches transformers.AutoTokenizer to set a fix flag — a side-effect that mutates third-party library behavior in-process (benign but notable).
Install Mechanism
There is no formal install spec; SKILL.md recommends 'pip install mlx-audio' and 'brew install ffmpeg'. Using pip means pulling code from PyPI (or whichever index the environment uses). This is an expected pattern for Python scripts but carries typical supply-chain risks: the package 'mlx-audio' and its dependencies should be reviewed and/or pinned. No arbitrary download URLs or archived extracts are included in the skill files themselves.
Credentials
The skill requests no environment variables, credentials, or config paths. The scripts operate only on user-provided files and local outputs, which is proportionate for a TTS tool.
Persistence & Privilege
The skill is not always-enabled and does not request persistent elevated privileges or attempt to modify other skills or system-wide agent settings. It runs as a user-invoked CLI tool — expected behavior.
What to consider before installing
This package is coherent with a local TTS tool, but take these precautions before installing: - Expect network activity and large downloads unless you pre-download models: the scripts refer to model identifiers (mlx-community/...), which are normally fetched from remote model hubs. If you truly need offline-only operation, confirm/model-downloads and pre-stage the weights locally. - Review and pin the 'mlx-audio' package (and its dependencies) before pip install to reduce supply-chain risk. Consider installing in a virtualenv or sandbox. - Be prepared for significant disk usage (models are ~1–2+ GB each) and possible memory/GPU requirements on your Mac. - Voice cloning processes user audio; consider privacy and consent implications before processing other people's voice samples. - The batch script monkeypatches transformers.AutoTokenizer to set a flag — this mutates third-party behavior in-process. It's likely harmless but you may want to review or remove that patch if you prefer not to change library internals. - If you need stronger assurance, inspect or run the code in an isolated environment, monitor network connections during first runs, and verify model sources/ licences (and that they are allowed for your use case).

Like a lobster shell, security has layers — review code before you run it.

latestvk9790f2gw66x61j43evjp7tysn820hnn

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

  • Generate speech fully offline on a Mac
  • Produce narration, audiobooks, podcasts, or video voiceovers
  • Create multilingual TTS with controllable style and emotion
  • Clone any voice from a short audio sample
  • Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

VariantModelSizeMemoryUse Case
CustomVoicemlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit~1GB~4GBBuilt-in voices + style control (recommended)
VoiceDesignmlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit~2GB~5GBCreate voices from text descriptions
Basemlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit~1GB~4GBVoice cloning from reference audio

Supported Languages

LanguageCodeNotes
Auto-detectautoDefault, detects from text
ChineseChineseMandarin
EnglishEnglish
JapaneseJapanese
KoreanKorean
FrenchFrench
GermanGerman
SpanishSpanish
PortuguesePortuguese
ItalianItalian
RussianRussian

Built-in Voices

VoiceLanguageCharacter
VivianChineseFemale, bright, young
SerenaChineseFemale, gentle, soft
Uncle_FuChineseMale, authoritative, news anchor
DylanChineseMale, Beijing dialect
EricChineseMale, Sichuan dialect
RyanEnglishMale, energetic
AidenEnglishMale, clear, neutral
Ono_AnnaJapaneseFemale
SoheeKoreanFemale

Voice Selection Guide:

ScenarioRecommended Voice
Chinese news/narrationUncle_Fu
Chinese casual/livelyEric
Chinese female, professionalVivian
Chinese female, storytellingSerena
English energetic contentRyan
English neutral/educationalAiden
Japanese contentOno_Anna
Korean contentSohee

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

  • "calm and warm" - Soft, friendly delivery
  • "news anchor, authoritative" - Professional broadcast style
  • "excited and energetic" - High energy, enthusiastic
  • "sad and melancholic" - Emotional, somber tone
  • "whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

  • "young cheerful female with high pitch"
  • "elderly wise male with deep resonant voice"
  • "professional female news anchor, clear articulation"
  • "friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

  • Use clean audio without background noise
  • 5-10 seconds of speech works best
  • Provide accurate transcript of the reference
  • Reference and output language should match

CLI Parameters

ParameterRequiredDefaultDescription
--textYes-Text to synthesize
--voiceNoVivianBuilt-in voice (CustomVoice only)
--lang_codeNoautoLanguage code
--instructNo-Style control or voice description
--speedNo1.0Speech speed multiplier
--temperatureNo0.7Sampling temperature (higher = more variation)
--modelNo(per mode)Override default model
--outputNo-Output file path
--out-dirNo./outputsOutput directory when --output not set
--ref_audioVoiceClone-Reference audio file
--ref_textVoiceClone-Reference audio transcript

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

MetricValue
Sample rate24,000 Hz
Real-time factor~0.7x (faster than real-time)
Peak memory~4-6 GB
First runDownloads model (~1-2GB)

Troubleshooting

IssueSolution
Slow generationUse 4-bit CustomVoice model
Unnatural pausesAdd punctuation, keep sentences short
Wrong language detectedSpecify --lang_code explicitly
Voice cloning qualityUse cleaner reference audio, accurate transcript
Tokenizer warningsHarmless, can be ignored
Out of memoryClose other apps, use 4-bit model

Files

4 total
Select a file
Select a file to preview.

Comments

Loading comments…