mimo-v2.5-tts|Xiaomi Official Version

API key required

MiMo V2.5 TTS 语音合成。使用小米 MiMo V2.5 TTS 系列模型生成语音。当需要将文字转为语音、发送语音消息、朗读内容、或用户要求「说出来」「语音回复」时激活此 skill。支持预置音色、音色设计、音色克隆三种模式，支持自然语言控制、导演模式，支持语气、情绪、方言的风格标签控制，预置音色支持唱歌。

Install

openclaw skills install mimo-v25-tts

MiMo V2.5 TTS 语音合成 / Speech Synthesis (中文/English)

使用小米 MiMo V2.5 TTS 系列模型生成语音。支持中英文、预置音色、音色设计、音色克隆、情绪风格、方言、唱歌。 Generate speech using Xiaomi MiMo V2.5 TTS models. Supports Chinese/English, preset voices, voice design, voice cloning, emotion, dialect, and singing.

脚本目录 / Scripts path: $SKILLS_PATH/mimo-v2-5-tts/scripts/

$SKILLS_PATH 说明 / Note: skills 目录路径，因部署环境而异 / Path varies by deployment environment.

模型选择 / Model Selection

V2.5 系列提供三种模型，根据使用场景选择： The V2.5 series offers three models for different use cases:

模型 ID / Model ID	用途 / Purpose	音色来源 / Voice Source	特殊能力 / Special
`mimo-v2.5-tts`	预置音色语音合成 / Preset voice TTS	内置精品音色 / Built-in high-quality	支持唱歌 / Singing
`mimo-v2.5-tts-voicedesign`	文本描述定制音色 / Voice design via text	文本描述生成 / Text description	—
`mimo-v2.5-tts-voiceclone`	音频样本复刻音色 / Voice cloning	音频样本 / Audio sample	—

选择建议 / Recommendation:

快速生成语音、唱歌 → mimo-v2.5-tts（预置音色 / preset voice）
需要独特音色 → mimo-v2.5-tts-voicedesign（文本生成 / text-to-voice）
模仿特定声音 → mimo-v2.5-tts-voiceclone（样本复刻 / sample cloning）

注意 / Note: TTS 有随机性，同样输入效果可能不同，可以多生成几次挑选 / TTS has randomness — generate multiple times to pick the best result.

环境依赖 / Dependencies

环境变量 / Env Var	说明 / Description	必需 / Required
`MIMO_API_KEY`	MiMo API 密钥 / MiMo API key	是 / Yes

依赖 / Dependency	说明 / Description	必需 / Required
`python3`	运行脚本 / Run scripts	是 / Yes
`openai`	`pip install openai`	是 / Yes
`ffmpeg`	格式转换、长文本拼接 / Format conversion, long text concat	仅拼接 / Concat only
`curl`	飞书 API 调用 / Feishu API calls	仅飞书 / Feishu only

预置音色 / Preset Voices

使用 mimo-v2.5-tts 模型时必须明确指定音色。 Must specify a voice when using the mimo-v2.5-tts model.

音色名 / Name	Voice ID	语言 / Lang	性别 / Gender	风格 / Style
冰糖	`冰糖`	中文 / Chinese	女性 / Female	活泼少女 / Lively girl
茉莉	`茉莉`	中文 / Chinese	女性 / Female	知性女声 / Elegant woman
苏打	`苏打`	中文 / Chinese	男性 / Male	阳光少年 / Sunny youth
白桦	`白桦`	中文 / Chinese	男性 / Male	成熟男声 / Mature man
Mia	`Mia`	English	Female	Lively girl
Chloe	`Chloe`	English	Female	Sweet Dreamy
Milo	`Milo`	English	Male	Sunny boy
Dean	`Dean`	English	Male	Steady Gentle

自然语言控制 / Natural Language Control

所有模型都支持自然语言控制。 All models support natural language style control.

通过自然语言描述调整语气、情绪等风格。所有模型均可通过 --context 参数传入指令： Use natural language to control tone, emotion, etc. Pass via --context parameter:

mimo-v2.5-tts / mimo-v2.5-tts-voiceclone: 调整指定音色下的风格 / Adjust style within a voice
mimo-v2.5-tts-voicedesign: 同时控制音色和风格 / Control both voice and style

能力特点 / Capabilities:

多风格切换 / Multi-style switching: 同一段语音内完成播报→低语→嘶吼 / switch between announcement, whisper, and roar
多情绪混合 / Mixed emotions: "压抑的愤怒" suppressed anger、"带着哽咽的笑意" tearful smile
多粒度控制 / Multi-granularity: 段落→句子→词→字 / paragraph → sentence → word → character

示例 / Examples:

用轻快上扬的语调向领导报喜，语速稍快，带着查到成绩后压抑不住的激动与小骄傲，声音明亮有活力。
Speak to your boss with a cheerful, upward tone, slightly fast, with barely contained excitement and pride.

看着刚解决的难题成果忍不住得意忘形地惊呼，声音高亢明亮，语速偏快，语气中带着满满的自信与难以置信。
Can't help but exclaim triumphantly at the solved problem — bright, high-pitched, confident, disbelieving.

导演模式 / Director Mode

自然语言控制的特殊用法「导演模式」：从角色、场景、指导三个维度刻画人物与声线。 A special form of natural language control — describe character, scene, and direction.

【角色 / Character】 人物身份、性格底色 / Identity, personality traits, speaking habits
【场景 / Scene】 此刻发生了什么 / What's happening, who they're talking to
【指导 / Direction】 语速、气息、停顿、重音 / Speed, breath, pauses, emphasis, resonance

示例 / Example:

角色：百年门阀岑家的现任大当家。自出生便被过继给祖庙的守门老人抚养，被塑造成一尊完美无瑕、绝情断欲的家族图腾。
Character: The current head of the ancient Cen family clan. Raised by a temple keeper to become a flawless, emotionless family icon.

场景：在祠堂的阴影里引诱着那个不顾一切来找她的男人。她要用最冷硬的阶级壁垒，绞杀对方也绞杀自己刚刚萌芽的感情。
Scene: In the ancestral hall's shadows, tempting the man who came for her despite everything. She will use cold class barriers to kill both him and her budding feelings.

指导：冰冷、慵懒却极具威压的低音御姐。
Direction: Cold, lazy but oppressive low-toned voice.

音频标签控制 / Audio Tag Control

mimo-v2.5-tts 和 mimo-v2.5-tts-voiceclone 支持音频标签。在文本任意位置用括号描述语气/情绪/声音动作。 mimo-v2.5-tts and mimo-v2.5-tts-voiceclone support audio tags. Use brackets anywhere in text to describe tone/emotion/sound.

中文支持全角 （）、半角 ()、方括号 [] / Chinese supports （） () []；英文支持 () [] / English supports () [].

（紧张，深呼吸）呼……冷静，冷静。不就是一个面试吗……
Nervous, deep breath... Calm down. It's just an interview...

（极其疲惫，有气无力）师傅……到地方了叫我一声……
Exhausted, weak: Driver... wake me up when we arrive...

(heavy breathing) Just... give me... a second.
（喘着粗气）等...等我一下...

整体风格标签 / Global Style Tags

在文本开头添加 (风格) 标签指定整体风格。 Add a style tag at the beginning to set the overall style.

唱歌 / Singing: 必须 (唱歌)歌词 / Must start with (singing)lyrics

类别 / Category	常用风格 / Common Styles
基础情绪 / Basic emotion	`开心 happy` `悲伤 sad` `愤怒 angry` `恐惧 fearful` `惊讶 surprised` `兴奋 excited` `委屈 wronged` `平静 calm` `冷漠 cold`
复合情绪 / Compound	`怅然 wistful` `欣慰 relieved` `无奈 helpless` `愧疚 guilty` `释然 resigned` `动情 emotional`
整体语调 / Tone	`温柔 gentle` `高冷 aloof` `活泼 lively` `严肃 serious` `慵懒 lazy` `俏皮 playful` `深沉 deep`
音色定位 / Voice	`磁性 magnetic` `醇厚 mellow` `清亮 clear` `空灵 ethereal` `甜美 sweet` `沙哑 hoarse`
人设腔调 / Character	`夹子音 baby voice` `御姐音 mature woman` `正太音 boyish` `大叔音 uncle` `台湾腔 Taiwanese accent`
方言 / Dialect	`东北话 Dongbei` `四川话 Sichuan` `河南话 Henan` `粤语 Cantonese`
唱歌 / Singing	`唱歌` `sing` `singing`

经典组合 / Classic combos: (怅然/wistful) 这么多年过去了... (慵懒/lazy) 再让我睡五分钟... (东北话/Dongbei) 哎呀妈呀...

音色描述编写 / Voice Description Guide

当使用 mimo-v2.5-tts-voicedesign 进行文本描述定制音色时： When using mimo-v2.5-tts-voicedesign to design a voice via text:

音色描述是嗓子的身份卡，只描写声音本身。 A voice description is the identity card of a voice — describe the voice itself, not the scene or action.

必写项 / Required:

身份锚点 / Identity anchor: 年龄段+性别 / Age + gender
声音质感 / Voice quality: 气息、共鸣、吐字 / Breath, resonance, articulation
语速节奏 / Pace: 稳/快/慢 / Steady/fast/slow
情绪底色 / Emotional baseline: 高亢/松弛/温软/克制 / Bright/relaxed/warm/restrained

推荐 / Recommended: 5. 风格标签 / Style tag: 拍卖师/美食评论家/播音员 / Auctioneer/food critic/announcer 6. 辨识度小癖好 / Signature quirk: 闭眼吸气/字尾颤音 / Eyes-closed inhale/trembling endings

硬约束 / Rules:

一到两句话，白描式 / 1-2 sentences, plain description
不写场景、动作 / No scenes or actions
不用真实演员或 IP 角色名 / No real actors or IP character names
默认普通话或英文 / Default Mandarin or English

样例 / Examples:

中年男性，节奏极快，情绪高亢，拍卖师风格。
Middle-aged male, very fast pace, excited tone, auctioneer style.

青年男性，电竞解说风格，语速极快且连贯。
Young male, esports commentator style, extremely fast and fluent.

中年男性，法庭陈词风格，声线沉稳偏正式。
Middle-aged male, courtroom speech style, steady and formal.

内容与标签增强 / Content & Tag Enhancement

当用户没有直接提供文本时，应自行编写；当只有文本没有情绪细节时，应插入合适的标签。 When the user doesn't provide text, write it yourself. When text has no emotion details, add appropriate tags.

硬规则 / Hard rules:

文本情绪必须和音色契合 / Text emotion must match the voice
长度 2-5 句 / 2-5 sentences, one paragraph
标签是调味，不是主菜 / Tags are seasoning, not the main dish
标点有表演意义 / Punctuation has performance meaning
标签语言跟随正文 / Tag language follows the text language

推荐标签（中文）/ Recommended Tags (Chinese):

类别 / Category	标签 / Tags
节奏 / Pacing	`[停顿 pause]` `[长停顿 long pause]` `[急促 urgent]` `[语速加快 speed up]`
情绪 / Emotion	`[轻声 whisper]` `[低语 murmur]` `[叹气 sigh]` `[哽咽 choked]` `[强调 emphasis]` `[笑 laugh]`

推荐标签（英文）/ Recommended Tags (English):

Category	Tags
Pacing	`[pause]` `[long pause]` `[fast]` `[drawn out]`
Emotion	`[whispering]` `[sighs]` `[inhale]` `[choked up]` `[emphasis]` `[laughs]`

Python 脚本用法 / Python Script Usage

脚本 / Script	模型 / Model	用途 / Purpose
`mimo_tts.py`	`mimo-v2.5-tts`	预置音色语音合成 / Preset voice TTS
`mimo_tts_voicedesign.py`	`mimo-v2.5-tts-voicedesign`	文本描述定制音色 / Voice design via text
`mimo_tts_voiceclone.py`	`mimo-v2.5-tts-voiceclone`	音频样本复刻音色 / Voice cloning

预置音色 / Preset Voice TTS (mimo_tts.py)

python3 mimo_tts.py --text "你好，今天天气真不错。" --voice "冰糖"

python3 mimo_tts.py --context "用温柔的语气，语速稍慢" --text "没关系，慢慢来，我等你。" --voice "冰糖" --output comfort.wav

python3 mimo_tts.py --text "（紧张，深呼吸）呼……冷静，冷静。" --voice "冰糖" --output interview.wav

python3 mimo_tts.py --text "(唱歌)原谅我这一生不羁放纵爱自由" --voice "冰糖" --output singing.wav

python3 mimo_tts.py --text "I just... (sighs deeply) I don't know anymore." --voice "Mia" --output english.wav

音色设计 / Voice Design (mimo_tts_voicedesign.py)

python3 mimo_tts_voicedesign.py --context "Give me a young male tone." --text "Yes, I had a sandwich."

音色克隆 / Voice Cloning (mimo_tts_voiceclone.py)

python3 mimo_tts_voiceclone.py --voice-file voice.mp3 --text "Yes, I had a sandwich." --output clone.wav

python3 mimo_tts_voiceclone.py --voice-file voice.mp3 --context "用温柔的语气" --text "没关系" --output directed.wav

长文本处理 / Long Text Handling

V2.5 常规场景无需分段，仅超过 2500 字才需分段拼接。 No need to split for most cases. Only split when exceeding 2500 characters.

# 拼接方案 / Concatenation:
echo "file 'part1.wav'" > list.txt && echo "file 'part2.wav'" >> list.txt
ffmpeg -y -f concat -safe 0 -i list.txt -c copy combined.wav

飞书语音消息发送 / Feishu Voice Message

仅当需要将 TTS 语音发送到飞书时才用 / Only use when sending TTS audio to Feishu.

环境依赖 / Dependencies

环境变量 / Env Var	来源 / Source	说明 / Description
`FEISHU_APP_ID`	飞书开放平台 / Feishu Open Platform	应用 App ID
`FEISHU_APP_SECRET`	飞书开放平台 / Feishu Open Platform	应用 App Secret

依赖 / Dep	说明 / Description	必需 / Required
`ffmpeg`	WAV 转 Opus + 获取音频时长 / Convert WAV to Opus + get duration	是 / Yes
`curl`	调用飞书 API / Call Feishu API	是 / Yes

私聊发送 / Private Chat (open_id)

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py --text "好的" --voice "冰糖" --output /tmp/voice.wav
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/voice.wav open_id ou_xxxxxx

群聊发送 / Group Chat (chat_id)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/voice.wav chat_id oc_xxxxxx

feishu_send_audio.sh 内部流程 / Internal flow: wav → opus (ffmpeg) → 获取 token → 上传文件 → 发送 audio 消息