Install
openclaw skills install zxkane-audio-transcriber-funasrThis skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to text", "generate meeting minutes from audio", "do speech-to-text", "transcribe with speaker diarization", "identify speakers in audio", "transcribe Chinese audio", "transcribe English audio", "transcribe Japanese audio", "multi-speaker transcription", "transcribe a podcast", "transcribe podcast episode", "transcribe an interview", "convert podcast to text", "podcast to transcript", or mentions FunASR, Paraformer, SenseVoice, Whisper, MiMo, MiMo-V2.5-ASR, meeting transcription, podcast transcription, or speaker diarization. Supports multi-speaker meeting and podcast transcription in Chinese, English, Japanese, Korean, Cantonese, and 99 languages (via Whisper), plus Xiaomi MiMo-V2.5-ASR (8B, local GPU) for stronger proper-noun and code-switching accuracy. Automatic speaker diarization via CAM++, hotword biasing (FunASR path), LLM cleanup. FunASR works on GPU and CPU; MiMo requires a local CUDA GPU with >=20GB VRAM.
openclaw skills install zxkane-audio-transcriber-funasrTranscribe multi-speaker audio into structured Markdown with automatic speaker diarization, hotword biasing, and optional LLM cleanup. Two ASR engine families are available: FunASR (Paraformer / SenseVoice / Whisper — fast, cheap, GPU or CPU, 99 languages) and MiMo-V2.5-ASR (Xiaomi's 8B model, local GPU only, stronger on proper nouns and code-switching). Both share the same VAD + speaker-clustering stack.
All scripts run directly from the plugin directory — no copying needed. Define this shorthand at the start of every session:
SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/audio-transcribe/scripts
--lang | Model | Languages | Hotword |
|---|---|---|---|
zh (default) | SeACo-Paraformer | Chinese (CER 1.95%) | Yes |
zh-basic | Paraformer-large | Chinese | No |
en | Paraformer-en | English | No |
auto | SenseVoiceSmall | Auto-detect: zh/en/ja/ko/yue | No |
whisper | Whisper-large-v3-turbo | 99 languages | No |
mimo | MiMo-V2.5-ASR (local 8B, GPU-only) | zh/en/code-switch/dialects | No |
All presets include speaker diarization (CAM++) and VAD (FSMN).
mimo reuses the FSMN VAD + CAM++ stack around MiMo's text output.
Diarization caveat:
autoandwhisperdo not output per-sentence timestamps, so speaker diarization does not work with these presets. Usezh,zh-basic,en, ormimowhen speaker identification is needed (e.g., podcasts, meetings).
Before starting transcription, always ask the user:
"Do you have any of the following to improve accuracy?"
- Attendee / guest list — for hotwords and speaker mapping
- Meeting agenda or episode topic — for hotwords (terms, names)
- Reference documents (show notes, prior notes) — for speaker identification and ASR correction
Adapt defaults by recording type:
--lang zh, ask about supporting files--lang zh, --num-speakers 2, always ask for
host + guest names, suggest --speaker-context for roles
(do NOT use --lang auto — it lacks timestamps for speaker diarization)⚠️
--speakersmust use the speaker's real name, not a podcast alias. The value passed to--speakersis used verbatim as the speaker label in the output transcript. Always derive it from the host/guest's actual name (e.g. from a shownotes "Host:" field), not from the podcast feed name or title.Example: if shownotes lists "Host: 张三(张三的播客)", pass
--speakers '张三'— not the alias "张三的播客". Add both the real name and the alias tohotwords.txtso ASR can recognise both forms.When both
--speakersand--referenceare supplied, the script detects this mistake at startup and prints anACTION REQUIREDblock naming the suggested real name. If you see that block, stop the run and re-invoke with the corrected--speakersvalue before Phase 3 — the warning does not abort the pipeline.
If the user provides supporting materials:
hotwords.txt (include both real name and alias)speaker-context.json--referenceAUTO_YES=1 bash $SCRIPTS/setup_env.sh
# Or force CPU: AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu
The setup script patches FunASR's spectral clustering for O(N²·k) performance. Without this, recordings over ~1 hour hang for hours during speaker clustering.
Output files are written to the current working directory.
LLM cleanup (Phase 3) is opt-in. By default, transcription runs locally
without contacting any external service. To enable LLM-powered ASR correction
and speaker name refinement, pass --model <model-id>. Use LLM cleanup when:
⚠️ Data Privacy: When LLM cleanup is enabled via
--model, transcript excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or OpenAI depending on the model ID). Use--skip-llmor omit--modelto keep all data local. For Bedrock, boto3 uses the standard AWS credential chain (IAM role, SSO,~/.aws/credentials, env vars).
# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txt
# English meeting with speaker names
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang en --speakers "Alice,Bob,Carol,Dave"
# Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang auto --num-speakers 6
# Whisper for any language
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang whisper --num-speakers 4
# Enable LLM cleanup for polished output (requires --model)
# Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txt \
--provider bedrock --model us.anthropic.claude-sonnet-4-6
# Bedrock "global" cross-region profile (recent AWS deployments)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model global.anthropic.claude-sonnet-4-6
# Bedrock via litellm-style wrapper (supported; prefix is stripped for boto3)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model amazon-bedrock/global.anthropic.claude-sonnet-4-6
# Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider anthropic --model claude-sonnet-4-6
# OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider openai --model gpt-4o
# Full pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe.py episode.m4a \
--lang zh --num-speakers 2 \
--hotwords hotwords.txt \
--speakers "关羽,张飞" \
--speaker-context speaker-context.json \
--reference show-notes.md \
--model us.anthropic.claude-sonnet-4-6
# Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe.py meeting.wav \
--skip-transcribe --model us.anthropic.claude-sonnet-4-6
If the transcript has swapped speaker labels (common with podcasts), the verification script can detect and fix mismatches using LLM analysis:
# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "关羽,张飞" \
--speaker-context speaker-context.json
# Apply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "关羽,张飞" \
--speaker-context speaker-context.json --fix
# Multi-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
--speakers "Alice,Bob,Carol,Dave" \
--speaker-context speaker-context.json --fix
# Then regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe.py original.m4a \
--skip-transcribe --clean-cache
The script analyzes the first 5 minutes (configurable with --minutes)
and auto-detects podcast (2 speakers, swap detection) vs meeting
(N speakers, full reassignment).
The script automatically converts input audio to 16kHz mono FLAC and validates that no audio is lost (detects silent truncation).
| Format | 4h14m meeting | Quality | Recommendation |
|---|---|---|---|
| FLAC | 219MB | Lossless | Default, safest |
| Opus | 55MB | Lossy | Risk of truncation on long files |
| WAV | 465MB | Lossless | Works but larger |
| Original M4A | 173MB | Source | Also works directly |
Do NOT split long recordings — splitting breaks speaker ID consistency.
--lang mimo runs Xiaomi's
MiMo-V2.5-ASR locally on a
CUDA GPU. Use it when:
Requirements:
setup_env.sh).flash-attn==2.7.4.post1 compile
(needs nvcc from the CUDA toolkit, takes 10–30 min).Install (opt-in):
# One-time: install MiMo on top of the standard environment
AUTO_YES=1 INSTALL_MIMO=1 \
MIMO_WEIGHTS_PATH=/mnt/models/hf \
bash $SCRIPTS/setup_env.sh
Run:
python3 $SCRIPTS/transcribe.py podcast.m4a \
--lang mimo --num-speakers 2 \
--mimo-weights-path /mnt/models/hf
Resume after failure:
python3 $SCRIPTS/transcribe.py podcast.m4a \
--lang mimo --resume-mimo --mimo-weights-path /mnt/models/hf
Limitations:
--hotwords is ignored).| Flag | Purpose |
|---|---|
--lang | zh (default), zh-basic, en, auto, whisper |
--hotwords | Hotword file or string — biases ASR (zh only) |
--reference F | Reference file for LLM ASR correction |
--num-speakers N | Expected speaker count (improves diarization) |
--speakers "A,B,C" | Assign real names by first-appearance order |
--speaker-context F | JSON with per-speaker roles for LLM |
--no-detect-gender | Disable automatic speaker gender detection (CAM++ gender classifier) |
--speaker-genders "A:female,B:male" | Override per-speaker gender (also accepts positional female,male) |
--audio-format | flac (default), opus, wav |
--device cpu | Force CPU mode |
--batch-size N | Adjust for memory (60 for CPU, 100 if GPU OOM) |
--phase1-only | Exit after Phase 1 (VAD + ASR + diarization), skip Phase 2 + 3 |
--json-out PATH | Write raw transcript JSON to explicit path (overrides default naming) |
--skip-transcribe | Resume from saved *_raw_transcript.json |
--skip-llm | Skip LLM cleanup (default when --model is omitted) |
--model ID | Enable LLM cleanup with this model (auto-detects Bedrock/Anthropic/OpenAI) |
--title "..." | Output document title |
--clean-cache | Delete LLM chunk cache after completion |
--output PATH | Custom output file path |
--model-cache-dir | ModelScope model cache directory (~3GB, default: ~/.cache/modelscope/) |
--mimo-audio-tag | MiMo language hint: <chinese> (default), <english>, <auto> |
--mimo-batch N | Concurrent VAD segments per MiMo call (default 1; H100/80GB can go higher) |
--mimo-weights-path DIR | Cache dir for MiMo weights (default: $HF_HOME → ~/.cache/huggingface) |
--resume-mimo | Resume MiMo Phase 1 from *_mimo_partial.json after a mid-run failure |
<stem>-transcript.md — Final Markdown with speaker labels and timestamps<stem>_raw_transcript.json — Raw Phase 1 output (for resume/analysis)FunASR's CAM++ may merge acoustically similar speakers. To improve:
--num-speakers N — Hint expected count--hotwords — Include participant names (Chinese names work best)--speaker-context — Provide per-person keywords for LLM splitting*_raw_transcript.json for unique phrasesEnabled by default: each detected speaker is classified as male / female
via 3D-Speaker's CAM++ gender classifier (iic/speech_campplus_two_class_gender_16k).
The result appears next to each name in the Speaker List table and is
injected into the LLM cleanup prompt so pronouns (他/她, he/she) get corrected.
Precedence when combined:
--speaker-genders "Alice:female,Bob:male" (explicit CLI) — always wins主播(女):韩梅梅 or Host (male): Alice — override autoDisable with --no-detect-gender if you don't need gender and want to save
the ~500 MB model download and extra inference time.
Long recordings on resource-constrained machines may hit exec timeouts
or OOM kills. See references/pipeline-details.md for workarounds:
systemd-run or nohup--lang zh-basic (lighter model)references/pipeline-details.md — Architecture, model specs, benchmarks,
speaker role verification, hotword effectiveness, clustering patchscripts/transcribe.py — Main transcription pipelinescripts/verify_speakers.py — Speaker label verification & fixscripts/llm_utils.py — Shared LLM infrastructure (Bedrock/Anthropic/OpenAI)scripts/setup_env.sh — Environment setup (venv + deps + patch)