Audio Speaker Tools

Speaker separation, voice comparison, and audio processing tools. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks inc...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 220 · 0 current installs · 0 all-time installs

by@cmfinlan

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

Purpose & Capability

The code and SKILL.md implement speaker separation (pyannote/Demucs) and voice comparison (Resemblyzer), which matches the advertised purpose. However the registry metadata claims no required env vars or credentials while the runtime instructions and scripts require a HuggingFace token (HF_TOKEN) for model downloads. That mismatch (no declared HF_TOKEN in metadata) is an incoherence users should be aware of.

ℹ

Instruction Scope

SKILL.md and the scripts are explicit about workflows and tools (ffmpeg, Demucs, pyannote, Resemblyzer). The scripts convert audio, run diarization, export RTTM/segments, and compute embeddings — all within the stated domain. They will download pretrained models from HuggingFace (Pipeline.from_pretrained), which involves network activity to HF and requires the HF_TOKEN for gated model access. Minor inconsistency: SKILL.md says 'never as CLI arg' for HF_TOKEN, but diarize_and_slice_mps.py accepts a --token argument (it still defaults to HF_TOKEN). No other unexpected file reads, hidden endpoints, or data exfiltration code was found in the provided sources.

ℹ

Install Mechanism

This is instruction-plus-scripts (no platform install spec). A provided setup_venv.sh will create a virtualenv and pip-install packages including torch, demucs, pyannote.audio, resemblyzer, pydub, librosa. Installing via pip from PyPI is expected for this functionality but the script does not pin versions or verify package sources (no checksums). Installing PyTorch and ML packages can be slow and may pull many wheels — this is normal for the task but has typical supply‑chain risk if you don’t audit or pin versions.

Credentials

The runtime requires a HuggingFace token (HF_TOKEN) to load gated pyannote models; the scripts also accept HUGGINGFACE_TOKEN. The registry metadata lists no required env vars — this omission is an inconsistency and increases the chance a user will overlook the sensitive HF_TOKEN requirement. No other credentials are requested, which is proportionate to the purpose, but HF_TOKEN is sensitive and necessary for core functionality.

✓

Persistence & Privilege

The skill does not request permanent/always-on privileges (always:false), does not modify other skills, and does not claim to persist credentials on the agent. It runs as a set of scripts invoked by the user; nothing here implies elevated platform privileges.

What to consider before installing

Before installing or running this skill: - Expect to provide a HuggingFace token (HF_TOKEN) — the registry metadata failed to list it; do not overlook this sensitive credential. Use a token with minimal privileges and restrict it (rotate/revoke when done). - Audit the setup_venv.sh and pinned dependencies: it pip-installs many ML packages without version pins or checksums. Run in an isolated environment (container or VM) and consider pinning package versions before install. - Be aware the scripts will download pretrained models from HuggingFace (network I/O). If you need to avoid external downloads, mirror/verify required models first. - The functionality (separation + comparison + preparing samples for ElevenLabs) is consistent with the code, but cloning/verification of voices has privacy and legal implications — ensure you have consent for any voice processing or uploads to third parties (e.g., ElevenLabs). - If you plan to use HF_TOKEN, avoid passing it on the CLI or in shared logs; the scripts accept --token but the README recommends env var usage. Consider using a secrets manager and run the tool in an environment where stdout/stderr are not exposed. - If you want higher assurance, ask the publisher for provenance (source repo, homepage, signed releases) and for pinned dependency versions or a lockfile. Run a test in an isolated VM with non-sensitive sample audio first.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk97fpqdz3wmgc188xskxcs8f6n81z7da

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Audio Speaker Tools

Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

Overview

This skill provides three main workflows:

Speaker separation - Extract per-speaker audio from multi-speaker recordings
Voice comparison - Measure speaker similarity between two audio files
Audio processing - Segment extraction and voice isolation

Prerequisites

Setup Virtual Environment

Run once to create the venv and install dependencies:

bash scripts/setup_venv.sh

Default venv location: ./.venv

Requirements:

Python 3.9+
ffmpeg (brew install ffmpeg)
HuggingFace token (set as env var HF_TOKEN)

Scripts

1. Speaker Separation: `diarize_and_slice_mps.py`

Separate speakers from multi-speaker audio:

# Basic usage
HF_TOKEN=<your-hf-token> \
  /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir /path/to/output \
  --prefix MyShow

# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir ./out \
  --min-speakers 2 \
  --max-speakers 5 \
  --pad-ms 100

Process:

Converts input to 16kHz mono WAV
Runs Demucs vocal/background separation (optional, for cleaner input)
Runs pyannote speaker diarization (MPS-accelerated)
Extracts concatenated per-speaker WAV files

Output:

<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)
diarization.rttm (time-stamped speaker segments)
segments.jsonl (JSON segments metadata)
meta.json (pipeline info and speaker index)

Important:

Always pass HF token via HF_TOKEN env var, never as CLI arg
MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
Default output: ./separated/

2. Voice Comparison: `compare_voices.py`

Measure similarity between two voice samples using Resemblyzer:

# Basic comparison
python scripts/compare_voices.py \
  --audio1 sample1.wav \
  --audio2 sample2.wav

# JSON output
python scripts/compare_voices.py \
  --audio1 reference.wav \
  --audio2 clone.wav \
  --threshold 0.85 \
  --json

# Exit code = 0 if pass, 1 if fail

Scores:

< 0.75 = Different speakers
0.75-0.84 = Likely same speaker
0.85+ = Excellent match (ideal for voice cloning validation)

Use cases:

Voice clone quality assessment (compare clone vs. original)
Speaker verification (authenticate speaker identity)
Validate speaker separation (confirm separated speakers are distinct)

See: references/scoring-guide.md for detailed interpretation

3. Audio Trimming

Use ffmpeg directly for segment extraction:

# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3

# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3

Workflows

Workflow 1: Extract Clean Voice Sample for Cloning

Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning

# 1. Separate speakers
HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \
  --input podcast.mp3 --outdir ./out --prefix Podcast

# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)

# 3. Select best sample (5-30s, clean speech)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav

# 4. Upload to ElevenLabs as instant voice clone

See: references/elevenlabs-cloning.md for best practices

Workflow 2: Validate Voice Clone Quality

Goal: Measure how well a cloned voice matches the original

# 1. Generate test audio with ElevenLabs clone
# (done via ElevenLabs web UI or API)

# 2. Compare clone vs. reference
python scripts/compare_voices.py \
  --audio1 original_sample.wav \
  --audio2 elevenlabs_clone.wav \
  --threshold 0.85 \
  --json

# 3. Interpret score:
#    0.85+ = excellent, publish-ready
#    0.80-0.84 = acceptable, may need tweaking
#    < 0.80 = poor, try different sample or settings

See: references/scoring-guide.md for troubleshooting low scores

Workflow 3: Multi-Speaker Conversation Analysis

Goal: Separate and identify speakers in a conversation

# 1. Run diarization
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
  --input meeting.mp3 --outdir ./out --prefix Meeting

# 2. Check detected speakers (meta.json)
cat out/meta.json

# 3. Compare speaker pairs to confirm separation
python scripts/compare_voices.py \
  --audio1 out/Meeting_speaker1.wav \
  --audio2 out/Meeting_speaker2.wav

# Expected: < 0.75 if separation worked correctly

Technical Notes

Device Acceleration

pyannote diarization: MPS (Metal) by default, CPU fallback
Resemblyzer: CPU only (no GPU acceleration)
Demucs: MPS by default when available

To force CPU for diarization: --device cpu

Audio Formats

Input: Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.)
Processing: Internally converted to 16kHz mono WAV for diarization
Output: WAV format (44.1kHz stereo preserved from source)

HuggingFace Token

Required for: pyannote speaker diarization
Access: Must accept gated repo pyannote/speaker-diarization-3.1 on HF
Storage: Any secure secrets manager
Usage: Always pass via HF_TOKEN env var, never CLI arg

Sample Quality Tips

Shorter is better: 5-30s clean samples often score higher than 60+ second samples
Clean audio: Remove background noise with Demucs --two-stems vocals
Single speaker: Ensure isolated voice, not mixed conversation
Good recording: Studio mic > phone mic for voice comparison accuracy

References

elevenlabs-cloning.md - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations)
scoring-guide.md - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting)

Common Issues

"Missing HF token" error

Export token before running: export HF_TOKEN=<your-token>
Or pass inline: HF_TOKEN=<your-token> python script.py ...

Low voice comparison scores for same speaker

Try shorter, cleaner samples (5-30s)
Use Demucs to isolate vocals: demucs --two-stems vocals input.mp3
Ensure consistent recording quality (same mic, environment)
See references/scoring-guide.md troubleshooting section

Diarization not detecting all speakers

Adjust --min-speakers and --max-speakers flags
Check audio quality (clear speech, minimal overlap)
Try longer audio (30+ seconds) for better speaker modeling

MPS/Metal acceleration not working

Ensure PyTorch with MPS support: python -c "import torch; print(torch.backends.mps.is_available())"
Fallback to CPU: --device cpu
Re-run setup_venv.sh to reinstall PyTorch

Files

6 total

Select a file

Select a file to preview.

Comments

Loading comments…

Audio Speaker Tools

License

SKILL.md

Audio Speaker Tools

Overview

Prerequisites

Setup Virtual Environment

Scripts

1. Speaker Separation: diarize_and_slice_mps.py

2. Voice Comparison: compare_voices.py

3. Audio Trimming

Workflows

Workflow 1: Extract Clean Voice Sample for Cloning

Workflow 2: Validate Voice Clone Quality

Workflow 3: Multi-Speaker Conversation Analysis

Technical Notes

Device Acceleration

Audio Formats

HuggingFace Token

Sample Quality Tips

References

Common Issues

"Missing HF token" error

Low voice comparison scores for same speaker

Diarization not detecting all speakers

MPS/Metal acceleration not working

Files

Comments

1. Speaker Separation: `diarize_and_slice_mps.py`

2. Voice Comparison: `compare_voices.py`