Install
openclaw skills install audio-speaker-toolsSpeaker separation, voice comparison, and audio processing tools. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks inc...
openclaw skills install audio-speaker-toolsTools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.
This skill provides three main workflows:
Run once to create the venv and install dependencies:
bash scripts/setup_venv.sh
Default venv location: ./.venv
Requirements:
brew install ffmpeg)HF_TOKEN)diarize_and_slice_mps.pySeparate speakers from multi-speaker audio:
# Basic usage
HF_TOKEN=<your-hf-token> \
/path/to/venv/bin/python scripts/diarize_and_slice_mps.py \
--input audio.mp3 \
--outdir /path/to/output \
--prefix MyShow
# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
--input audio.mp3 \
--outdir ./out \
--min-speakers 2 \
--max-speakers 5 \
--pad-ms 100
Process:
Output:
<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)diarization.rttm (time-stamped speaker segments)segments.jsonl (JSON segments metadata)meta.json (pipeline info and speaker index)Important:
HF_TOKEN env var, never as CLI arg./separated/compare_voices.pyMeasure similarity between two voice samples using Resemblyzer:
# Basic comparison
python scripts/compare_voices.py \
--audio1 sample1.wav \
--audio2 sample2.wav
# JSON output
python scripts/compare_voices.py \
--audio1 reference.wav \
--audio2 clone.wav \
--threshold 0.85 \
--json
# Exit code = 0 if pass, 1 if fail
Scores:
< 0.75 = Different speakers0.75-0.84 = Likely same speaker0.85+ = Excellent match (ideal for voice cloning validation)Use cases:
See: references/scoring-guide.md for detailed interpretation
Use ffmpeg directly for segment extraction:
# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3
# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3
Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning
# 1. Separate speakers
HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \
--input podcast.mp3 --outdir ./out --prefix Podcast
# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)
# 3. Select best sample (5-30s, clean speech)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav
# 4. Upload to ElevenLabs as instant voice clone
See: references/elevenlabs-cloning.md for best practices
Goal: Measure how well a cloned voice matches the original
# 1. Generate test audio with ElevenLabs clone
# (done via ElevenLabs web UI or API)
# 2. Compare clone vs. reference
python scripts/compare_voices.py \
--audio1 original_sample.wav \
--audio2 elevenlabs_clone.wav \
--threshold 0.85 \
--json
# 3. Interpret score:
# 0.85+ = excellent, publish-ready
# 0.80-0.84 = acceptable, may need tweaking
# < 0.80 = poor, try different sample or settings
See: references/scoring-guide.md for troubleshooting low scores
Goal: Separate and identify speakers in a conversation
# 1. Run diarization
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
--input meeting.mp3 --outdir ./out --prefix Meeting
# 2. Check detected speakers (meta.json)
cat out/meta.json
# 3. Compare speaker pairs to confirm separation
python scripts/compare_voices.py \
--audio1 out/Meeting_speaker1.wav \
--audio2 out/Meeting_speaker2.wav
# Expected: < 0.75 if separation worked correctly
To force CPU for diarization: --device cpu
pyannote/speaker-diarization-3.1 on HFHF_TOKEN env var, never CLI arg--two-stems vocalsexport HF_TOKEN=<your-token>HF_TOKEN=<your-token> python script.py ...demucs --two-stems vocals input.mp3references/scoring-guide.md troubleshooting section--min-speakers and --max-speakers flagspython -c "import torch; print(torch.backends.mps.is_available())"--device cpusetup_venv.sh to reinstall PyTorch