Polyphone TTS

Other

Fix Chinese polyphone (多音字) mispronunciation in TTS by auto-detecting ambiguous characters and applying pinyin annotations. Use when users complain about wrong pronunciation, need precise tone control, or are synthesizing text with characters like 行/干/量/好/了/得/地/的/着/过. Triggers on "读音不对", "这个字读错了", "多音字", "标注拼音", "银行行长", "绕口令", or any request to correct TTS pronunciation.

Install

openclaw skills install @scikkk/polyphone

SenseAudio Polyphone TTS (多音字)

Precise pronunciation control for Chinese TTS via pinyin annotation. The dictionary parameter lets you override how specific characters are read — essential for polyphones (多音字) that the model might guess wrong.

The dictionary parameter only works with cloned voices and model SenseAudio-TTS-1.5. System voices (male_0004_a etc.) do not support it.

Step 1: Scan for Polyphones

When the user provides text, scan it for these common polyphones and flag any that appear:

Character	Readings	Context clues
行	háng (行业/银行/行列) / xíng (行走/行动/可行)	银行、行长、行业 → háng
干	gān (干净/干燥) / gàn (干活/干部)	干部、干活 → gàn
量	liáng (量体温/测量) / liàng (数量/重量)	数量、质量 → liàng
铺	pū (铺床/铺路) / pù (店铺/铺子)	店铺、铺面 → pù
好	hǎo (好的/很好) / hào (好奇/爱好)	爱好、好学 → hào
了	le (吃了/来了) / liǎo (了解/了结)	了解、了不起 → liǎo
得	de (跑得快) / dé (得到) / děi (得去)	得到 → dé；必须 → děi
地	de (慢慢地) / dì (土地/地方)	副词用法 → de
的	de (我的) / dí (的确) / dì (目的)	目的、的确 → dì/dí
着	zhe (看着) / zháo (着火) / zhuó (着装)	着火、着急 → zháo；着装 → zhuó
长	cháng (长度/很长) / zhǎng (成长/行长)	行长、生长 → zhǎng
重	zhòng (重量/重要) / chóng (重复/重新)	重复、重新 → chóng
中	zhōng (中间/中国) / zhòng (中奖/中毒)	中奖、中毒 → zhòng
还	hái (还有/还是) / huán (还钱/归还)	还钱、偿还 → huán
发	fā (发现/发展) / fà (头发/理发)	头发、理发 → fà
数	shù (数字/数量) / shǔ (数数/数一数)	数数、数落 → shǔ
参	cān (参加/参考) / shēn (人参/党参)	人参、党参 → shēn
差	chā (差别/差距) / chà (差不多) / chāi (出差)	出差 → chāi；差不多 → chà

Show the user which polyphones were found and your best guess at the intended reading, then ask them to confirm or correct before synthesizing.

Example:

检测到多音字：
- "行" (第2个): 银行 → 建议读 háng [hang2] ✓ 还是 xíng [xing2]?
- "行" (第4个): 行长 → 建议读 zhǎng [zhang3] ✓ 还是 cháng [chang2]?

Step 2: Build the Dictionary

Convert confirmed readings into the dictionary array. Each entry covers one phrase containing the polyphone:

原文片段 → replacement 格式：在多音字前加 [pinyin]，其余字保持原样

Pinyin format: [声母韵母声调数字] — e.g., [hang2]、[xing2]、[zhang3]

Example:

original: 银行行长
replacement: 银[hang2]行[zhang3]长

Build the full dictionary array:

"dictionary": [
  {"original": "银行行长", "replacement": "银[hang2]行[zhang3]长"},
  {"original": "好奇心", "replacement": "[hao4]奇心"}
]

Each original should be a short phrase (3–8 chars) that uniquely identifies the occurrence in context. Avoid single-character originals — they may match unintended occurrences.

Step 3: Synthesize

The user must provide a cloned voice ID. If they don't have one, remind them that dictionary requires a cloned voice and suggest using the senseaudio-voice-cloner skill first.

curl -s -X POST https://api.senseaudio.cn/v1/t2a_v2 \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "SenseAudio-TTS-1.5",
    "text": "<TEXT>",
    "stream": false,
    "voice_setting": {
      "voice_id": "<CLONED_VOICE_ID>"
    },
    "audio_setting": {
      "format": "mp3"
    },
    "dictionary": <DICTIONARY_ARRAY>
  }' -o response.json

jq -r '.data.audio' response.json | xxd -r -p > output.mp3

Check base_resp.status_code == 0 before decoding.

Step 4: Iterate

After the user listens, they may find additional mispronunciations. Update the dictionary array and re-synthesize. Keep the previous response.json until the new one succeeds.

Report: file path, duration (jq '.extra_info.audio_length' response.json ms), character count, and which dictionary entries were applied.