Pronunciation Coach

v1.0.0

Foreign language pronunciation coach — listen to standard TTS pronunciation, record yourself, get word-by-word feedback on what was wrong, then practice targ...

0· 241· 1 versions· 0 current· 0 all-time· Updated 20h ago· MIT-0

Install

openclaw skills install pronunciation

SenseAudio Pronunciation Coach

Listen → Record → Compare → Drill. The loop that actually improves pronunciation.

Step 1: Choose Practice Material

Three input modes:

A — Direct input: User pastes a word, phrase, or sentence.

B — Scene presets: Offer these if the user isn't sure what to practice:

SceneSample phrase
机场值机"I'd like a window seat, please."
餐厅点餐"Could I have the menu, please?"
商务会议"Let me walk you through the agenda."
酒店入住"I have a reservation under my name."
购物"Do you have this in a different size?"
问路"Excuse me, how do I get to the station?"

C — Topic-based: User says "练习 th 发音" or "练习 r 和 l 的区别" — generate 5 sentences targeting that phoneme.

Also ask: 目标语言? (default: English)

Step 2: Generate Standard Pronunciation

Produce two versions — slow for learning, normal for natural rhythm:

# Slow version (speed 0.75)
curl -s -X POST https://api.senseaudio.cn/v1/t2a_v2 \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"SenseAudio-TTS-1.0\",
    \"text\": \"<TEXT>\",
    \"stream\": false,
    \"voice_setting\": { \"voice_id\": \"<VOICE_ID>\", \"speed\": 0.75 },
    \"audio_setting\": { \"format\": \"mp3\" }
  }" -o slow.json
jq -r '.data.audio' slow.json | xxd -r -p > standard_slow.mp3

# Normal version (speed 1.0)
curl -s -X POST https://api.senseaudio.cn/v1/t2a_v2 \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"SenseAudio-TTS-1.0\",
    \"text\": \"<TEXT>\",
    \"stream\": false,
    \"voice_setting\": { \"voice_id\": \"<VOICE_ID>\", \"speed\": 1.0 },
    \"audio_setting\": { \"format\": \"mp3\" }
  }" -o normal.json
jq -r '.data.audio' normal.json | xxd -r -p > standard_normal.mp3

Voice selection by language:

  • English: female_0006_a (clear, neutral accent)
  • Chinese: female_0008_c (standard Mandarin)
  • Default: female_0006_a

Tell the user: "慢速版和正常速版已生成。先听慢速版,感受每个音的发音,再听正常版感受自然节奏。准备好后,录一段你的跟读发给我。"

Step 3: Transcribe User Recording

When the user uploads their recording:

curl -s -X POST https://api.senseaudio.cn/v1/audio/transcriptions \
  -H "Authorization: Bearer $SENSEAUDIO_API_KEY" \
  -F "file=@<USER_RECORDING>" \
  -F "model=sense-asr-pro" \
  -F "response_format=verbose_json" \
  -F "language=<LANGUAGE_CODE>" \
  -F "timestamp_granularities[]=word" \
  > asr_result.json

Language codes: English → en, Chinese → zh, Japanese → ja, French → fr, Spanish → es

Extract the transcript: jq -r '.text' asr_result.json

Step 4: Word-by-Word Comparison (LLM task)

Compare the ASR transcript against the original text yourself. Align words and identify mismatches:

Comparison approach:

  1. Tokenize both original and ASR output into words
  2. Use sequence alignment (like diff) to match them
  3. Flag words where ASR output differs from original

Diagnosis output format:

跟读分析:

✓ "I'd like a"  — 正确
✗ "window"      — 识别为 "winder"(可能是 -ow 结尾发音问题)
✓ "seat"        — 正确
✗ "please"      — 识别为 "pleas"(末尾 -z 音可能不够清晰)

准确率:3/5 词 (60%)

Common phoneme issues for Chinese speakers (English):

Misrecognized asLikely problemPhoneme
"free" for "three"th → f/θ/
"light" for "right"r → l confusion/r/
"wery" for "very"v → w/v/
"sit" for "seat"short vs long vowel/ɪ/ vs /iː/
"fink" for "think"th → f/θ/
dropped final consonantfinal stop deletion/t/, /d/, /k/

When a word is misrecognized, infer the likely phoneme issue and name it specifically.

Step 5: Targeted Drill

For each identified problem phoneme, generate a focused drill set:

Phoneme drill library:

PhonemeDrill words
/θ/ (th)think, three, through, both, weather, teeth, breathe
/r/red, right, road, very, sorry, around, mirror
/r/ vs /l/right/light, road/load, rice/lice, pray/play
/v/very, voice, love, live, over, never, river
/iː/ vs /ɪ/seat/sit, beat/bit, sheep/ship, feel/fill
final /t/cat, hat, right, night, about, what, that
final /d/road, said, good, food, bad, head

Present 3–5 drill words and generate slow TTS for each.

Step 6: Track Progress

Save session results to pronunciation_progress.json in the current directory:

{
  "sessions": [
    {
      "date": "<ISO date>",
      "text": "<practice text>",
      "accuracy": 0.6,
      "errors": ["window (/ow/)", "please (final /z/)"],
      "phonemes_drilled": ["/ow/", "/z/"]
    }
  ]
}

After 3+ sessions, show a summary:

发音弱项分析(最近5次练习):

/θ/ (th)  ████████░░  4次出错  ← 重点练习
/r/       ████░░░░░░  2次出错
/iː/      ██░░░░░░░░  1次出错

建议:重点练习 th 发音,可以说"把舌尖放在上下牙之间,轻轻吹气"。

Iteration

After each round, ask: "再来一遍,还是换一个句子?" Keep the loop going until the user is satisfied or accuracy reaches 90%+.

Version tags

latestvk975a8801nt5nt7hg3pfgmtd0582z5na

Runtime requirements

Binscurl, jq, xxd
EnvSENSEAUDIO_API_KEY
Primary envSENSEAUDIO_API_KEY