mimo-v2.5-tts|Xiaomi Official Version

API key required
Other

MiMo V2.5 TTS 语音合成。使用小米 MiMo V2.5 TTS 系列模型生成语音。当需要将文字转为语音、发送语音消息、朗读内容、或用户要求「说出来」「语音回复」时激活此 skill。支持预置音色、音色设计、音色克隆三种模式,支持自然语言控制、导演模式,支持语气、情绪、方言的风格标签控制,预置音色支持唱歌。

Install

openclaw skills install mimo-v25-tts

MiMo V2.5 TTS 语音合成 / Speech Synthesis (中文/English)

使用小米 MiMo V2.5 TTS 系列模型生成语音。支持中英文、预置音色、音色设计、音色克隆、情绪风格、方言、唱歌。 Generate speech using Xiaomi MiMo V2.5 TTS models. Supports Chinese/English, preset voices, voice design, voice cloning, emotion, dialect, and singing.

脚本目录 / Scripts path: $SKILLS_PATH/mimo-v2-5-tts/scripts/

$SKILLS_PATH 说明 / Note: skills 目录路径,因部署环境而异 / Path varies by deployment environment.

模型选择 / Model Selection

V2.5 系列提供三种模型,根据使用场景选择: The V2.5 series offers three models for different use cases:

模型 ID / Model ID用途 / Purpose音色来源 / Voice Source特殊能力 / Special
mimo-v2.5-tts预置音色语音合成 / Preset voice TTS内置精品音色 / Built-in high-quality支持唱歌 / Singing
mimo-v2.5-tts-voicedesign文本描述定制音色 / Voice design via text文本描述生成 / Text description
mimo-v2.5-tts-voiceclone音频样本复刻音色 / Voice cloning音频样本 / Audio sample

选择建议 / Recommendation:

  • 快速生成语音、唱歌 → mimo-v2.5-tts(预置音色 / preset voice)
  • 需要独特音色 → mimo-v2.5-tts-voicedesign(文本生成 / text-to-voice)
  • 模仿特定声音 → mimo-v2.5-tts-voiceclone(样本复刻 / sample cloning)

注意 / Note: TTS 有随机性,同样输入效果可能不同,可以多生成几次挑选 / TTS has randomness — generate multiple times to pick the best result.

环境依赖 / Dependencies

环境变量 / Env Var说明 / Description必需 / Required
MIMO_API_KEYMiMo API 密钥 / MiMo API key是 / Yes
依赖 / Dependency说明 / Description必需 / Required
python3运行脚本 / Run scripts是 / Yes
openaipip install openai是 / Yes
ffmpeg格式转换、长文本拼接 / Format conversion, long text concat仅拼接 / Concat only
curl飞书 API 调用 / Feishu API calls仅飞书 / Feishu only

预置音色 / Preset Voices

使用 mimo-v2.5-tts 模型时必须明确指定音色。 Must specify a voice when using the mimo-v2.5-tts model.

音色名 / NameVoice ID语言 / Lang性别 / Gender风格 / Style
冰糖冰糖中文 / Chinese女性 / Female活泼少女 / Lively girl
茉莉茉莉中文 / Chinese女性 / Female知性女声 / Elegant woman
苏打苏打中文 / Chinese男性 / Male阳光少年 / Sunny youth
白桦白桦中文 / Chinese男性 / Male成熟男声 / Mature man
MiaMiaEnglishFemaleLively girl
ChloeChloeEnglishFemaleSweet Dreamy
MiloMiloEnglishMaleSunny boy
DeanDeanEnglishMaleSteady Gentle

自然语言控制 / Natural Language Control

所有模型都支持自然语言控制。 All models support natural language style control.

通过自然语言描述调整语气、情绪等风格。所有模型均可通过 --context 参数传入指令: Use natural language to control tone, emotion, etc. Pass via --context parameter:

  • mimo-v2.5-tts / mimo-v2.5-tts-voiceclone: 调整指定音色下的风格 / Adjust style within a voice
  • mimo-v2.5-tts-voicedesign: 同时控制音色和风格 / Control both voice and style

能力特点 / Capabilities:

  • 多风格切换 / Multi-style switching: 同一段语音内完成播报→低语→嘶吼 / switch between announcement, whisper, and roar
  • 多情绪混合 / Mixed emotions: "压抑的愤怒" suppressed anger、"带着哽咽的笑意" tearful smile
  • 多粒度控制 / Multi-granularity: 段落→句子→词→字 / paragraph → sentence → word → character

示例 / Examples:

用轻快上扬的语调向领导报喜,语速稍快,带着查到成绩后压抑不住的激动与小骄傲,声音明亮有活力。
Speak to your boss with a cheerful, upward tone, slightly fast, with barely contained excitement and pride.

看着刚解决的难题成果忍不住得意忘形地惊呼,声音高亢明亮,语速偏快,语气中带着满满的自信与难以置信。
Can't help but exclaim triumphantly at the solved problem — bright, high-pitched, confident, disbelieving.

导演模式 / Director Mode

自然语言控制的特殊用法「导演模式」:从角色、场景、指导三个维度刻画人物与声线。 A special form of natural language control — describe character, scene, and direction.

  • 【角色 / Character】 人物身份、性格底色 / Identity, personality traits, speaking habits
  • 【场景 / Scene】 此刻发生了什么 / What's happening, who they're talking to
  • 【指导 / Direction】 语速、气息、停顿、重音 / Speed, breath, pauses, emphasis, resonance

示例 / Example:

角色:百年门阀岑家的现任大当家。自出生便被过继给祖庙的守门老人抚养,被塑造成一尊完美无瑕、绝情断欲的家族图腾。
Character: The current head of the ancient Cen family clan. Raised by a temple keeper to become a flawless, emotionless family icon.

场景:在祠堂的阴影里引诱着那个不顾一切来找她的男人。她要用最冷硬的阶级壁垒,绞杀对方也绞杀自己刚刚萌芽的感情。
Scene: In the ancestral hall's shadows, tempting the man who came for her despite everything. She will use cold class barriers to kill both him and her budding feelings.

指导:冰冷、慵懒却极具威压的低音御姐。
Direction: Cold, lazy but oppressive low-toned voice.

音频标签控制 / Audio Tag Control

mimo-v2.5-ttsmimo-v2.5-tts-voiceclone 支持音频标签。在文本任意位置用括号描述语气/情绪/声音动作。 mimo-v2.5-tts and mimo-v2.5-tts-voiceclone support audio tags. Use brackets anywhere in text to describe tone/emotion/sound.

中文支持全角 ()、半角 ()、方括号 [] / Chinese supports () () [];英文支持 () [] / English supports () [].

(紧张,深呼吸)呼……冷静,冷静。不就是一个面试吗……
Nervous, deep breath... Calm down. It's just an interview...

(极其疲惫,有气无力)师傅……到地方了叫我一声……
Exhausted, weak: Driver... wake me up when we arrive...

(heavy breathing) Just... give me... a second.
(喘着粗气)等...等我一下...

整体风格标签 / Global Style Tags

在文本开头添加 (风格) 标签指定整体风格。 Add a style tag at the beginning to set the overall style.

唱歌 / Singing: 必须 (唱歌)歌词 / Must start with (singing)lyrics

类别 / Category常用风格 / Common Styles
基础情绪 / Basic emotion开心 happy 悲伤 sad 愤怒 angry 恐惧 fearful 惊讶 surprised 兴奋 excited 委屈 wronged 平静 calm 冷漠 cold
复合情绪 / Compound怅然 wistful 欣慰 relieved 无奈 helpless 愧疚 guilty 释然 resigned 动情 emotional
整体语调 / Tone温柔 gentle 高冷 aloof 活泼 lively 严肃 serious 慵懒 lazy 俏皮 playful 深沉 deep
音色定位 / Voice磁性 magnetic 醇厚 mellow 清亮 clear 空灵 ethereal 甜美 sweet 沙哑 hoarse
人设腔调 / Character夹子音 baby voice 御姐音 mature woman 正太音 boyish 大叔音 uncle 台湾腔 Taiwanese accent
方言 / Dialect东北话 Dongbei 四川话 Sichuan 河南话 Henan 粤语 Cantonese
唱歌 / Singing唱歌 sing singing

经典组合 / Classic combos: (怅然/wistful) 这么多年过去了... (慵懒/lazy) 再让我睡五分钟... (东北话/Dongbei) 哎呀妈呀...

音色描述编写 / Voice Description Guide

当使用 mimo-v2.5-tts-voicedesign 进行文本描述定制音色时: When using mimo-v2.5-tts-voicedesign to design a voice via text:

音色描述是嗓子的身份卡,只描写声音本身。 A voice description is the identity card of a voice — describe the voice itself, not the scene or action.

必写项 / Required:

  1. 身份锚点 / Identity anchor: 年龄段+性别 / Age + gender
  2. 声音质感 / Voice quality: 气息、共鸣、吐字 / Breath, resonance, articulation
  3. 语速节奏 / Pace: 稳/快/慢 / Steady/fast/slow
  4. 情绪底色 / Emotional baseline: 高亢/松弛/温软/克制 / Bright/relaxed/warm/restrained

推荐 / Recommended: 5. 风格标签 / Style tag: 拍卖师/美食评论家/播音员 / Auctioneer/food critic/announcer 6. 辨识度小癖好 / Signature quirk: 闭眼吸气/字尾颤音 / Eyes-closed inhale/trembling endings

硬约束 / Rules:

  • 一到两句话,白描式 / 1-2 sentences, plain description
  • 不写场景、动作 / No scenes or actions
  • 不用真实演员或 IP 角色名 / No real actors or IP character names
  • 默认普通话或英文 / Default Mandarin or English

样例 / Examples:

中年男性,节奏极快,情绪高亢,拍卖师风格。
Middle-aged male, very fast pace, excited tone, auctioneer style.

青年男性,电竞解说风格,语速极快且连贯。
Young male, esports commentator style, extremely fast and fluent.

中年男性,法庭陈词风格,声线沉稳偏正式。
Middle-aged male, courtroom speech style, steady and formal.

内容与标签增强 / Content & Tag Enhancement

当用户没有直接提供文本时,应自行编写;当只有文本没有情绪细节时,应插入合适的标签。 When the user doesn't provide text, write it yourself. When text has no emotion details, add appropriate tags.

硬规则 / Hard rules:

  1. 文本情绪必须和音色契合 / Text emotion must match the voice
  2. 长度 2-5 句 / 2-5 sentences, one paragraph
  3. 标签是调味,不是主菜 / Tags are seasoning, not the main dish
  4. 标点有表演意义 / Punctuation has performance meaning
  5. 标签语言跟随正文 / Tag language follows the text language

推荐标签(中文)/ Recommended Tags (Chinese):

类别 / Category标签 / Tags
节奏 / Pacing[停顿 pause] [长停顿 long pause] [急促 urgent] [语速加快 speed up]
情绪 / Emotion[轻声 whisper] [低语 murmur] [叹气 sigh] [哽咽 choked] [强调 emphasis] [笑 laugh]

推荐标签(英文)/ Recommended Tags (English):

CategoryTags
Pacing[pause] [long pause] [fast] [drawn out]
Emotion[whispering] [sighs] [inhale] [choked up] [emphasis] [laughs]

Python 脚本用法 / Python Script Usage

脚本 / Script模型 / Model用途 / Purpose
mimo_tts.pymimo-v2.5-tts预置音色语音合成 / Preset voice TTS
mimo_tts_voicedesign.pymimo-v2.5-tts-voicedesign文本描述定制音色 / Voice design via text
mimo_tts_voiceclone.pymimo-v2.5-tts-voiceclone音频样本复刻音色 / Voice cloning

预置音色 / Preset Voice TTS (mimo_tts.py)

python3 mimo_tts.py --text "你好,今天天气真不错。" --voice "冰糖"

python3 mimo_tts.py --context "用温柔的语气,语速稍慢" --text "没关系,慢慢来,我等你。" --voice "冰糖" --output comfort.wav

python3 mimo_tts.py --text "(紧张,深呼吸)呼……冷静,冷静。" --voice "冰糖" --output interview.wav

python3 mimo_tts.py --text "(唱歌)原谅我这一生不羁放纵爱自由" --voice "冰糖" --output singing.wav

python3 mimo_tts.py --text "I just... (sighs deeply) I don't know anymore." --voice "Mia" --output english.wav

音色设计 / Voice Design (mimo_tts_voicedesign.py)

python3 mimo_tts_voicedesign.py --context "Give me a young male tone." --text "Yes, I had a sandwich."

音色克隆 / Voice Cloning (mimo_tts_voiceclone.py)

python3 mimo_tts_voiceclone.py --voice-file voice.mp3 --text "Yes, I had a sandwich." --output clone.wav

python3 mimo_tts_voiceclone.py --voice-file voice.mp3 --context "用温柔的语气" --text "没关系" --output directed.wav

长文本处理 / Long Text Handling

V2.5 常规场景无需分段,仅超过 2500 字才需分段拼接。 No need to split for most cases. Only split when exceeding 2500 characters.

# 拼接方案 / Concatenation:
echo "file 'part1.wav'" > list.txt && echo "file 'part2.wav'" >> list.txt
ffmpeg -y -f concat -safe 0 -i list.txt -c copy combined.wav

飞书语音消息发送 / Feishu Voice Message

仅当需要将 TTS 语音发送到飞书时才用 / Only use when sending TTS audio to Feishu.

环境依赖 / Dependencies

环境变量 / Env Var来源 / Source说明 / Description
FEISHU_APP_ID飞书开放平台 / Feishu Open Platform应用 App ID
FEISHU_APP_SECRET飞书开放平台 / Feishu Open Platform应用 App Secret
依赖 / Dep说明 / Description必需 / Required
ffmpegWAV 转 Opus + 获取音频时长 / Convert WAV to Opus + get duration是 / Yes
curl调用飞书 API / Call Feishu API是 / Yes

私聊发送 / Private Chat (open_id)

python3 $SKILLS_PATH/mimo-v2-5-tts/scripts/mimo_tts.py --text "好的" --voice "冰糖" --output /tmp/voice.wav
bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/voice.wav open_id ou_xxxxxx

群聊发送 / Group Chat (chat_id)

bash $SKILLS_PATH/mimo-v2-5-tts/scripts/feishu_send_audio.sh /tmp/voice.wav chat_id oc_xxxxxx

feishu_send_audio.sh 内部流程 / Internal flow: wav → opus (ffmpeg)获取 token上传文件发送 audio 消息