Funasr Transcribe

v1.5.1

This skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to te...

⭐ 0· 230·0 current·0 all-time

by@zxkane

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zxkane/zxkane-audio-transcriber-funasr.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Funasr Transcribe" (zxkane/zxkane-audio-transcriber-funasr) from ClawHub.
Skill page: https://clawhub.ai/zxkane/zxkane-audio-transcriber-funasr
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3, ffmpeg
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Canonical install target

openclaw skills install zxkane/zxkane-audio-transcriber-funasr

ClawHub CLI

Package manager switcher

npx clawhub@latest install zxkane-audio-transcriber-funasr

Security Scan

Capability signals

CryptoRequires sensitive credentials

These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

✓

Purpose & Capability

Name/description match the included code and instructions: the scripts convert audio, run FunASR models, perform diarization, post-process transcripts, and optionally call external LLMs for cleanup. Required binaries (python3, ffmpeg) are appropriate.

ℹ

Instruction Scope

SKILL.md describes a four-phase pipeline and includes explicit LLM system prompts and instructions to collect supporting files (hotwords, speaker-context). This is expected for an opt-in 'LLM cleanup' feature, but the SKILL.md contains large system_prompt templates and instructions that will be sent to LLMs (and a pre-scan flagged 'system-prompt-override'). Also references CLAUDE_PLUGIN_ROOT as an environment variable/shorthand without declaring it in requires.env — likely a platform-provided variable but worth verifying.

Install Mechanism

There is no registry install spec, but the included setup_env.sh will pip-install FunASR, modelscope, boto3 (and suggests anthropic/openai optionally), install ffmpeg via apt/brew (invoking sudo), create a venv, and then run patch_clustering.py which modifies FunASR's installed package file(s) in site-packages. Patching third-party site-packages is functional for long-audio performance but is a notable elevation of risk: it changes code outside the skill's own directory and may require sudo or affect system-wide packages. Review and run the patch only in a controlled environment (container/VM/isolated venv).

✓

Credentials

No required secrets are declared. Optional env vars (AWS_REGION, ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENAI_BASE_URL) are reasonable for the documented opt-in LLM cleanup. The code uses the standard AWS credential chain (no explicit keys required). No unrelated credentials or surprising env demands are requested.

Persistence & Privilege

The skill is not 'always' and does not demand elevated platform privileges, but setup_env.sh will install packages and patch an installed library (FunASR) in site-packages. That behavior modifies other installed code and can be considered a persistent change beyond the skill's own files; it increases blast radius and should be approved by the user or performed in an isolated environment. The scripts also support a non-interactive --yes mode, which would apply the patch automatically when invoked by setup_env.sh.

Scan Findings in Context

[system-prompt-override] expected: The SKILL.md and verify_* scripts include long system_prompt templates for calling LLMs and explicitly ask the LLM to respond in precise formats (VERDICT/JSON). Pattern scanner flagged 'system-prompt-override' — this is expected because the skill implements LLM-based speaker verification/cleanup, but you should audit the prompts to ensure they don't instruct model behavior outside the intended cleanup scope.

What to consider before installing

This skill appears to do what it says (local FunASR transcription plus optional LLM cleanup), but take precautions before running the provided setup scripts: 1) Run setup_env.sh and the clustering patch only inside an isolated environment (container, VM, or dedicated Python virtualenv) because the patch modifies installed FunASR files in site-packages and setup_env.sh may use sudo to install ffmpeg. 2) Inspect patch_clustering.py and confirm you are comfortable with it editing third-party package files; prefer to run it interactively so you can review the target path. 3) Only enable Phase 3 LLM cleanup if you understand that transcript excerpts (and any hotwords / speaker names you supply) will be sent to external LLM providers — provide API keys only to providers you trust and consider redacting sensitive PII before sending. 4) Verify the CLAUDE_PLUGIN_ROOT/other platform variables referenced by SKILL.md exist on your platform. 5) If you have low tolerance for changing system packages, consider running the skill on a disposable host or skip the clustering patch (with the tradeoff of slower clustering for very long recordings). If you want me to, I can point out exactly which lines modify site-packages and summarize the LLM system prompts for manual review.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🎙️ Clawdis

Binspython3, ffmpeg

latestvk975smepbf1n3v14wnekez97zd85jav3

230downloads

0stars

6versions

Updated 1d ago

v1.5.1

MIT-0

FunASR Meeting & Podcast Transcription

Transcribe multi-speaker audio into structured Markdown with automatic speaker diarization, hotword biasing, and optional LLM cleanup.

All scripts run directly from the plugin directory — no copying needed. Define this shorthand at the start of every session:

SCRIPTS=${CLAUDE_PLUGIN_ROOT}/skills/funasr-transcribe/scripts

Supported Languages

`--lang`	Model	Languages	Hotword
`zh` (default)	SeACo-Paraformer	Chinese (CER 1.95%)	Yes
`zh-basic`	Paraformer-large	Chinese	No
`en`	Paraformer-en	English	No
`auto`	SenseVoiceSmall	Auto-detect: zh/en/ja/ko/yue	No
`whisper`	Whisper-large-v3-turbo	99 languages	No

All presets include speaker diarization (CAM++) and VAD (FSMN).

Diarization caveat: auto and whisper do not output per-sentence timestamps, so speaker diarization does not work with these presets. Use zh, zh-basic, or en when speaker identification is needed (e.g., podcasts, meetings).

Workflow

Before starting transcription, always ask the user:

Audio file — path to the recording (required)
Type — meeting, podcast, or interview? (affects defaults)
Language — what language is spoken? (default: Chinese)
Number of speakers — how many participants? (improves diarization)
Speaker names — for podcasts: host + guest names; for meetings: attendee list
Supporting files — ask:
"Do you have any of the following to improve accuracy?"
- Attendee / guest list — for hotwords and speaker mapping
- Meeting agenda or episode topic — for hotwords (terms, names)
- Reference documents (show notes, prior notes) — for speaker identification and ASR correction

Adapt defaults by recording type:

Meeting: default --lang zh, ask about supporting files
Podcast / interview: default --lang zh, --num-speakers 2, always ask for host + guest names, suggest --speaker-context for roles (do NOT use --lang auto — it lacks timestamps for speaker diarization)

⚠️ --speakers must use the speaker's real name, not a podcast alias. The value passed to --speakers is used verbatim as the speaker label in the output transcript. Always derive it from the host/guest's actual name (e.g. from a shownotes "Host:" field), not from the podcast feed name or title.

Example: if shownotes lists "Host: 张三（张三的播客）", pass --speakers '张三' — not the alias "张三的播客". Add both the real name and the alias to hotwords.txt so ASR can recognise both forms.

When both --speakers and --reference are supplied, the script detects this mistake at startup and prints an ACTION REQUIRED block naming the suggested real name. If you see that block, stop the run and re-invoke with the corrected --speakers value before Phase 3 — the warning does not abort the pipeline.

If the user provides supporting materials:

Extract participant names and key terms → create hotwords.txt (include both real name and alias)
Extract per-person context → create speaker-context.json
Pass original reference document with --reference
Use all three together for best results

Quick Start

1. Environment Setup

AUTO_YES=1 bash $SCRIPTS/setup_env.sh
# Or force CPU:  AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu

The setup script patches FunASR's spectral clustering for O(N²·k) performance. Without this, recordings over ~1 hour hang for hours during speaker clustering.

2. Run Transcription

Output files are written to the current working directory.

LLM cleanup (Phase 3) is opt-in. By default, transcription runs locally without contacting any external service. To enable LLM-powered ASR correction and speaker name refinement, pass --model <model-id>. Use LLM cleanup when:

The raw transcript has many ASR errors (names, technical terms)
You need polished, publication-ready output
Speaker names need to be refined from context

⚠️ Data Privacy: When LLM cleanup is enabled via --model, transcript excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or OpenAI depending on the model ID). Use --skip-llm or omit --model to keep all data local. For Bedrock, boto3 uses the standard AWS credential chain (IAM role, SSO, ~/.aws/credentials, env vars).

# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt

# English meeting with speaker names
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang en --speakers "Alice,Bob,Carol,Dave"

# Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang auto --num-speakers 6

# Whisper for any language
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang whisper --num-speakers 4

# Enable LLM cleanup for polished output (requires --model)
# Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --lang zh --num-speakers 9 --hotwords hotwords.txt \
    --model us.anthropic.claude-sonnet-4-6

# Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --model claude-sonnet-4-6

# OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --model gpt-4o

# Full pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe_funasr.py episode.m4a \
    --lang zh --num-speakers 2 \
    --hotwords hotwords.txt \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json \
    --reference show-notes.md \
    --model us.anthropic.claude-sonnet-4-6

# Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe_funasr.py meeting.wav \
    --skip-transcribe --model us.anthropic.claude-sonnet-4-6

3. Verify Speaker Labels

If the transcript has swapped speaker labels (common with podcasts), the verification script can detect and fix mismatches using LLM analysis:

# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json

# Apply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
    --speakers "关羽,张飞" \
    --speaker-context speaker-context.json --fix

# Multi-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
    --speakers "Alice,Bob,Carol,Dave" \
    --speaker-context speaker-context.json --fix

# Then regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe_funasr.py original.m4a \
    --skip-transcribe --clean-cache

The script analyzes the first 5 minutes (configurable with --minutes) and auto-detects podcast (2 speakers, swap detection) vs meeting (N speakers, full reassignment).

Audio Preprocessing

The script automatically converts input audio to 16kHz mono FLAC and validates that no audio is lost (detects silent truncation).

Format	4h14m meeting	Quality	Recommendation
FLAC	219MB	Lossless	Default, safest
Opus	55MB	Lossy	Risk of truncation on long files
WAV	465MB	Lossless	Works but larger
Original M4A	173MB	Source	Also works directly

Do NOT split long recordings — splitting breaks speaker ID consistency.

Key Flags

Flag	Purpose
`--lang`	`zh` (default), `zh-basic`, `en`, `auto`, `whisper`
`--hotwords`	Hotword file or string — biases ASR (zh only)
`--reference F`	Reference file for LLM ASR correction
`--num-speakers N`	Expected speaker count (improves diarization)
`--speakers "A,B,C"`	Assign real names by first-appearance order
`--speaker-context F`	JSON with per-speaker roles for LLM
`--audio-format`	`flac` (default), `opus`, `wav`
`--device cpu`	Force CPU mode
`--batch-size N`	Adjust for memory (60 for CPU, 100 if GPU OOM)
`--phase1-only`	Exit after Phase 1 (VAD + ASR + diarization), skip Phase 2 + 3
`--json-out PATH`	Write raw transcript JSON to explicit path (overrides default naming)
`--skip-transcribe`	Resume from saved `*_raw_transcript.json`
`--skip-llm`	Skip LLM cleanup (default when `--model` is omitted)
`--model ID`	Enable LLM cleanup with this model (auto-detects Bedrock/Anthropic/OpenAI)
`--title "..."`	Output document title
`--clean-cache`	Delete LLM chunk cache after completion
`--output PATH`	Custom output file path
`--model-cache-dir`	ModelScope model cache directory (~3GB, default: `~/.cache/modelscope/`)

Outputs

<stem>-transcript.md — Final Markdown with speaker labels and timestamps
<stem>_raw_transcript.json — Raw Phase 1 output (for resume/analysis)

Speaker Diarization Tips

FunASR's CAM++ may merge acoustically similar speakers. To improve:

--num-speakers N — Hint expected count
--hotwords — Include participant names (Chinese names work best)
--speaker-context — Provide per-person keywords for LLM splitting
Keyword matching — Search *_raw_transcript.json for unique phrases

CPU-only / Low-Memory Machines

Long recordings on resource-constrained machines may hit exec timeouts or OOM kills. See references/pipeline-details.md for workarounds:

Detach from agent timeouts with systemd-run or nohup
Prevent OOM via swap and/or --lang zh-basic (lighter model)

Additional Resources

references/pipeline-details.md — Architecture, model specs, benchmarks, speaker role verification, hotword effectiveness, clustering patch
scripts/transcribe_funasr.py — Main transcription pipeline
scripts/verify_speakers.py — Speaker label verification & fix
scripts/llm_utils.py — Shared LLM infrastructure (Bedrock/Anthropic/OpenAI)
scripts/setup_env.sh — Environment setup (venv + deps + patch)

Comments

Loading comments...