Install
openclaw skills install omnivoiceAll-in-one voice identity toolkit: speaker identification, voice library management, voice cloning, and speech-to-text. The only OpenClaw skill with speaker identification — recognize WHO is speaking, not just WHAT they said. 10 operations: identify speakers, manage a voice library (CRUD), clone voices, transcribe audio, voice swap, and persona voice replies. Activate when user sends voice/audio, asks to identify a speaker, manage a voice library, clone someone's voice, transcribe audio, or wants voice-based Q&A in a specific person's voice. Triggers: voice, audio, transcribe, 转文字, 语音, identify speaker, who is speaking, 这是谁的声音, 声纹识别, voice clone, 克隆声音, 模仿声音, voice library, 声音库, voice swap, 声音换皮.
openclaw skills install omnivoiceTen operations across four capabilities: identify (认) · manage (存) · transcribe (听) · clone (说).
| Component | Install | Purpose |
|---|---|---|
| Whisper | pip install openai-whisper | Speech-to-text |
| Speaker ID | pip install transformers librosa | Speaker identification (UniSpeech-SAT) |
| CosyVoice2 | SiliconFlow API (SF_API_KEY) | Voice cloning |
| ffmpeg | System package | Audio conversion |
Voice references are stored in voice-refs/ at workspace root.
Metadata lives in TOOLS.md under a "Voice Library" section.
See references/voice-library-format.md for format spec.
Input: audio → Output: who is speaking (or "unknown")
python3 scripts/voice_identify.py <audio_file> [--threshold 0.75]
Compares audio against all voice-refs/*-ref*.* using UniSpeech-SAT x-vector embeddings.
First run downloads model (~360MB) to /tmp/hf_models/.
Accuracy: Reliably separates male/female voices. Same-gender speakers need ≥5s audio for best results. Threshold 0.75 is default; raise to 0.85 for stricter matching.
Input: audio + speaker name → stores in voice library
voice-refs/<name>-ref1.<ext>whisper <audio> --model small --output_format txt --output_dir /tmpTOOLS.md (see format in references/)voice_identify.py SPEAKER_MAPGood reference audio: 10-15s clear speech, minimal noise, natural pace. 5s minimum.
TOOLS.md voice library section + ls voice-refs/voice-refs/, update TOOLS.md entryvoice-refs/, remove TOOLS.md entry, remove from SPEAKER_MAPInput: text + library speaker → Output: audio in that speaker's voice
set -a; source <env_file_with_SF_API_KEY>; set +a
python3 scripts/cosyvoice_clone.py \
--text "Text to speak" \
--ref voice-refs/<speaker>-ref1.<ext> \
--ref-text "What is said in reference audio" \
--output /tmp/clone_output.wav
Long reference (>15s): truncate first with ffmpeg -y -i <ref> -t 15 -ar 24000 -ac 1 /tmp/ref_trimmed.wav.
Input: audio → Output: text
whisper <audio_file> --model small --output_format txt --output_dir /tmp --language <lang>
Languages: zh (Chinese), en (English), ja (Japanese). Omit for auto-detect.
Input: audio → Output: who said what
Run Op 5 and Op 1 in parallel, report both results together.
Input: two audio files → Output: same person or not
python3 scripts/voice_identify.py <audio_1> --threshold 0.75
python3 scripts/voice_identify.py <audio_2> --threshold 0.75
Compare the top-ranked speaker from both runs. If they match → same person. For direct pairwise comparison without a library, extract embeddings and compute cosine similarity (see voice_identify.py internals).
Input: audio + library speaker → Output: same words, different voice
Input: audio question + library speaker → Output: AI answer in that speaker's voice
Input: text question + library speaker → Output: AI answer in that speaker's voice
set -a; source <env_file>; set +a
bash scripts/feishu_send_audio.sh <wav_file> <receive_id>
Converts wav → opus, uploads, sends as voice message.
Requires FEISHU_APP_ID + FEISHU_APP_SECRET env vars.
ffmpeg -y -i <video_file> -vn -ar 24000 -ac 1 /tmp/extracted_audio.wav