Install
openclaw skills install audio-speaker-toolsSpeaker separation, voice comparison, and audio processing tools. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks including: (1) separating speakers from audio files via Demucs and pyannote diarization, (2) comparing voice samples for speaker verification or voice clone quality assessment using Resemblyzer, (3) extracting audio segments, (4) preparing samples for ElevenLabs voice cloning, or (5) validating speaker diarization results.
openclaw skills install audio-speaker-toolsTools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.
This skill provides three main workflows:
Run once to create the venv and install dependencies:
bash scripts/setup_venv.sh
Default venv location: ./.venv
Requirements:
brew install ffmpeg)HF_TOKEN)diarize_and_slice_mps.pySeparate speakers from multi-speaker audio:
# Basic usage
HF_TOKEN=<your-hf-token> \
/path/to/venv/bin/python scripts/diarize_and_slice_mps.py \
--input audio.mp3 \
--outdir /path/to/output \
--prefix MyShow
# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
--input audio.mp3 \
--outdir ./out \
--min-speakers 2 \
--max-speakers 5 \
--pad-ms 100
Process:
Output:
<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)diarization.rttm (time-stamped speaker segments)segments.jsonl (JSON segments metadata)meta.json (pipeline info and speaker index)Important:
HF_TOKEN env var, never as CLI arg./separated/compare_voices.pyMeasure similarity between two voice samples using Resemblyzer:
# Basic comparison
python scripts/compare_voices.py \
--audio1 sample1.wav \
--audio2 sample2.wav
# JSON output
python scripts/compare_voices.py \
--audio1 reference.wav \
--audio2 clone.wav \
--threshold 0.85 \
--json
# Exit code = 0 if pass, 1 if fail
Scores:
< 0.75 = Different speakers0.75-0.84 = Likely same speaker0.85+ = Excellent match (ideal for voice cloning validation)Use cases:
See: references/scoring-guide.md for detailed interpretation
Use ffmpeg directly for segment extraction:
# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3
# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3
Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning
# 1. Separate speakers
HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \
--input podcast.mp3 --outdir ./out --prefix Podcast
# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)
# 3. Select best sample (5-30s, clean speech)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav
# 4. Upload to ElevenLabs as instant voice clone
See: references/elevenlabs-cloning.md for best practices
Goal: Measure how well a cloned voice matches the original
# 1. Generate test audio with ElevenLabs clone
# (done via ElevenLabs web UI or API)
# 2. Compare clone vs. reference
python scripts/compare_voices.py \
--audio1 original_sample.wav \
--audio2 elevenlabs_clone.wav \
--threshold 0.85 \
--json
# 3. Interpret score:
# 0.85+ = excellent, publish-ready
# 0.80-0.84 = acceptable, may need tweaking
# < 0.80 = poor, try different sample or settings
See: references/scoring-guide.md for troubleshooting low scores
Goal: Separate and identify speakers in a conversation
# 1. Run diarization
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
--input meeting.mp3 --outdir ./out --prefix Meeting
# 2. Check detected speakers (meta.json)
cat out/meta.json
# 3. Compare speaker pairs to confirm separation
python scripts/compare_voices.py \
--audio1 out/Meeting_speaker1.wav \
--audio2 out/Meeting_speaker2.wav
# Expected: < 0.75 if separation worked correctly
To force CPU for diarization: --device cpu
pyannote/speaker-diarization-3.1 on HFHF_TOKEN env var, never CLI arg--two-stems vocalsexport HF_TOKEN=<your-token>HF_TOKEN=<your-token> python script.py ...demucs --two-stems vocals input.mp3references/scoring-guide.md troubleshooting section--min-speakers and --max-speakers flagspython -c "import torch; print(torch.backends.mps.is_available())"--device cpusetup_venv.sh to reinstall PyTorch