vociemaster

专业级 AI 短视频配音助手，支持多角色音色映射、自动语速调节及 BGM 建议。

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 19 · 0 current installs · 0 all-time installs

by@xiaocaijic

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

ℹ

Purpose & Capability

The name/description, SKILL.md and helper.py all align around generating TTS via SenseAudio and merging segments. The only minor mismatch is that SKILL.md shows examples using curl + jq to call the API, while the included helper.py uses Python's urllib to call the same API. Requiring curl/jq is reasonable for the shell examples but not strictly necessary for the provided helper.py.

✓

Instruction Scope

SKILL.md stays within the stated purpose: it instructs splitting multi-role scripts, calling SenseAudio's TTS endpoint, decoding hex audio, and optionally concatenating segments with ffmpeg. It does not request unrelated files, other credentials, or hidden external endpoints.

✓

Install Mechanism

There is no install spec (instruction-only skill with one helper script), so nothing arbitrary is downloaded or installed. This lowers install risk. The presence of a local helper.py is expected and self-contained (standard library only).

✓

Credentials

Only TTS_API_KEY is required, which is appropriate for a TTS integration. No unrelated secrets, config paths, or additional credentials are requested.

✓

Persistence & Privilege

Skill is not marked always:true and does not attempt to modify other skills or system-wide settings. It only requires transient environment access (TTS_API_KEY) at runtime and writes audio outputs to local files as expected.

Assessment

This skill appears to do what it says: it will send your provided text to SenseAudio's API and save returned audio locally. Before installing, ensure you: (1) are comfortable providing your TTS_API_KEY (the key is used to call api.senseaudio.cn and is read from the TTS_API_KEY env var), (2) understand that audio data and input text are transmitted to SenseAudio (check your privacy/account permissions), (3) have ffmpeg installed if you want automatic merging of segments, and (4) note that SKILL.md examples use curl+jq but the included helper.py uses Python's standard library — curl/jq are only needed if you run the shell examples yourself. If you don't trust SenseAudio or the source of the skill, don't provide your API key.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk979asmfd4fs06v54srpeskg7h8316n5

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

Binscurl, jq, ffmpeg

EnvTTS_API_KEY

SKILL.md

VoiceMaster

目标

为短视频文案生成可直接交付的 AI 配音结果。优先输出单个 mp3 文件；无法本地合并时，返回分段下载卡片并明确片段顺序。使用 SenseAudio 官方接口文档：

https://senseaudio.cn/docs/api-key
https://senseaudio.cn/docs/text_to_speech_api
https://senseaudio.cn/docs/voice_api

如果当前环境没有 TTS_API_KEY，先提示用户提供 API Key 或先在终端设置环境变量；不要把密钥写进 SKILL.md、脚本源码或提交记录。

内置音色映射

始终优先使用用户显式指定的 voice_id。未指定时，根据文案语气、角色设定和平台风格，从下表选择最贴近的音色，并在同一项目内保持角色映射稳定。

在当前 SenseAudio key 权限有限时，仅默认使用已确认可用的以下音色：

child_0001_b：可爱萌娃，平稳
male_0004_a：儒雅道长，平稳
male_0018_a：沙哑青年，深情

不要默认选择未确认授权的 VIP / SVIP 音色。若接口返回 403 no access to the specified voice，优先回退到 child_0001_b，而不是重复尝试未授权音色。

VOICE_MAP:
  温柔女声: child_0001_b
  知性旁白: male_0004_a
  新闻主播: male_0004_a
  热血男声: male_0018_a
  沉稳纪录片: male_0004_a
  青春活力: child_0001_b
  电商促销: child_0001_b
  儿童陪伴: child_0001_b
  悬疑低语: male_0018_a
  治愈故事: child_0001_b
  儒雅道长: male_0004_a
  沙哑青年: male_0018_a
  可爱萌娃-平稳: child_0001_b

多角色文案处理规则：

识别 角色名:、旁白:、主持人: 等显式说话人标记。
为每个角色建立一次性 role -> voice_id 映射，后续所有分段复用同一映射。
若当前 key 可使用的音色有限，允许多个角色共用同一音色，优先保证生成成功，不强行制造音色差异。
SenseAudio 单次请求只能使用一个 voice_id，所以多角色脚本不能整段一次性提交。
多角色脚本必须先按角色台词切分为多个子片段，每个子片段只允许一个说话人。
每个角色子片段分别调用一次 TTS，生成多个小段音频后再按原始顺序拼接。
如果没有按角色逐段请求，而是把混合台词整段提交，那么最终听感会接近“全员同一音色”，这不算多角色配音成功。

输入整理

在执行前整理以下信息：

文案全文。
目标风格，例如温柔、新闻感、励志、悬疑、剧情口播、带货。
角色数量与角色关系。
语速 speed。未指定时按风格估算，但必须限制在 0.5 到 2.0。
音高 pitch。未指定时使用 0。
输出文件名和输出文件路径。未指定时使用 voicemaster-output.mp3。

默认参数：

format: mp3
sample_rate: 44100
speed: 1.0
pitch: 0

建议语速：

温柔叙事、治愈、情感类：0.88 到 0.98
新闻播报、知识解说：0.98 到 1.08
励志混剪、节奏短视频：1.05 到 1.18
直播带货、促销叫卖：1.12 到 1.25

若可用音色受限，默认采取以下补偿策略：

资讯、纪录片、旁白：优先 male_0004_a，语速 0.92 到 1.0
情绪故事、深情文案、悬疑口播：优先 male_0018_a，语速 0.88 到 0.98
带货、轻松内容、儿童感场景：优先 child_0001_b，语速 0.96 到 1.08
通用兜底：child_0001_b
不对用户隐瞒音色降级：在结果里明确说明“当前 key 仅可使用已授权音色”

分段策略

当文本超过 500 字时，必须自动进行逻辑分段请求，避免超时。

执行规则：

按段落、场景切换、说话人切换、完整句号优先切分。
将每段控制在 180 到 450 字之间，避免把一句话拆开。
为每段保留原始顺序编号，从 01 开始。
多角色场景中，不要把同一轮对话拆到不同片段。
全部片段生成完成后，优先使用 helper.py concat 合并为一个 mp3。
如果一个片段内部仍然包含多个角色行，继续细分到“单片段单角色”再调用 TTS。

API 请求模板

API 地址固定为：

https://api.senseaudio.cn/v1/t2a_v2

优先使用 jq 构造 JSON，避免转义错误。默认走非流式模式，便于直接拿到 hex 音频并落盘。

重要：先按最小请求体调用官方接口，不要一开始就附带全部可选字段。若接口返回 400 input content type is not supported，优先怀疑请求体结构与官方当前协议不一致，而不是继续切换音色。

请求模板：

jq -n \
  --arg text "$TEXT" \
  --arg voice_id "$VOICE_ID" \
  --arg model "SenseAudio-TTS-1.0" \
  '{
    model: $model,
    text: $text,
    stream: false,
    voice_setting: {
      voice_id: $voice_id
    }
  }' |
curl -sS "https://api.senseaudio.cn/v1/t2a_v2" \
  -H "Authorization: Bearer $TTS_API_KEY" \
  -H "Content-Type: application/json" \
  --data-binary @-

最小请求体跑通后，再逐步增加以下可选字段，每次只增加一类：

voice_setting.speed
voice_setting.pitch
audio_setting.format
audio_setting.sample_rate
audio_setting.bitrate
audio_setting.channel

多片段请求时，额外在本地维护上下文元数据即可，不要求额外提交给官方接口。保持每段 model、voice_setting、audio_setting 一致。

SenseAudio 响应处理

SenseAudio 非流式 TTS 的成功响应为 JSON，其中 data.audio 是 hex 编码的音频数据。处理规则如下：

先检查 base_resp.status_code 是否为 0。
若成功，读取 data.audio 并按十六进制解码为二进制音频文件。
使用 extra_info.audio_length、extra_info.audio_sample_rate 作为结果回执。
若 data.audio 为空或 status_code 非 0，直接返回 base_resp.status_msg。

响应结构参考：

{
  "data": {
    "audio": "hex编码音频",
    "status": 2
  },
  "extra_info": {
    "audio_length": 3500,
    "audio_sample_rate": 44100
  },
  "base_resp": {
    "status_code": 0,
    "status_msg": "success"
  }
}

返回结果处理

调用 helper.py synthesize 发送请求并把 hex 音频保存到本地。
长文案分段时，将每段保存为 segment-01.mp3、segment-02.mp3。
多角色场景中，进一步保存为更细粒度的角色片段，例如 segment-01-narrator.mp3、segment-02-youngman.mp3。
全部片段完成后，调用 helper.py concat 生成最终文件。
如果 ffmpeg 不可用，则保留分段文件并把路径按顺序返回给用户。
如果返回里没有 data.audio，把完整原始 JSON 一并带回，便于后续比对协议变更。

示例：

python helper.py synthesize ^
  --text-file segment-01.txt ^
  --voice-id male_0004_a ^
  --speed 0.96 ^
  --pitch 0 ^
  --output outputs\segment-01.mp3

python helper.py concat ^
  --output outputs\final.mp3 ^
  outputs\segment-01.mp3 outputs\segment-02.mp3 outputs\segment-03.mp3

helper.py 用法

仅在以下情况调用本地脚本：

需要把 SenseAudio 的 hex 音频响应落盘。
长文案需要本地合并多个片段。
需要统一输出文件名和目录结构。

命令概要：

python helper.py synthesize --text-file <segment.txt> --voice-id <voice_id> --output <file.mp3>
python helper.py concat --output <final.mp3> <segment1.mp3> <segment2.mp3> ...

输出要求

完成配音后，始终给出：

使用的 voice_id，以及为何匹配该风格。
实际 speed 与 pitch。
最终音频文件路径。
如果做了分段，说明分段数量与是否已成功合并。
若请求失败，返回 SenseAudio 的 status_code 与 status_msg。
如果因为套餐权限限制降级到授权音色，明确写出降级原因与最终使用的 voice_id。

背景音乐建议

配音完成后，必须根据文案情感推荐 2 到 3 种背景音乐方向，避免只给宽泛标签。按以下映射优先推荐：

悲伤、遗憾、追忆：钢琴氛围、弦乐极简、低速 lo-fi
励志、成长、逆袭：电影感激励、流行摇滚推进、企业宣传 uplift
温馨、治愈、亲子：木吉他轻快、暖感钢琴、轻爵士刷鼓
悬疑、反转、故事感：暗色脉冲、稀疏打击乐、电子氛围 tension
带货、促销、种草：明亮电子流行、节奏 house、funk 轻律动

输出建议时，同时说明：

适合的镜头节奏。
是否应低音量铺底，避免压住人声。
是否需要在转场处加鼓点或上升音效。

执行原则

优先保证语气和角色一致性，再追求绝对快速度。
对未明确的风格参数做合理估算，但必须在回复中写明使用了什么默认值。
遇到长文本时，默认启用分段，不要一次性硬请求超长文案。
如果本地缺少 ffmpeg，返回按顺序编号的分段结果，并说明未执行自动合并。
如果用户未提供已验证可用的 voice_id，按以下顺序优先尝试：male_0004_a、male_0018_a、child_0001_b。
多角色配音的真实前提是“逐角色多次请求并拼接”，不是仅在文本里写出多个角色标签。

Files

4 total

Select a file

Select a file to preview.

Comments

Loading comments…