Audio Speaker Tools
PassAudited by ClawScan on May 10, 2026.
Overview
This skill appears to do what it says—local speaker separation and voice comparison—but it handles sensitive voice data, uses a HuggingFace token, and installs unpinned ML packages.
This skill is reasonable for local audio diarization and voice comparison. Before installing, use a dedicated virtual environment, protect your HuggingFace token, choose output paths carefully, and only process or upload voice samples when you have permission from the speaker.
Findings (5)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Installing the skill may pull changing code from package repositories into a local virtual environment.
The setup script installs several third-party packages without version pins or a lockfile. This is expected for the audio/ML purpose, but package contents and versions may change over time.
pip install --quiet torch torchvision torchaudio ... pip install --quiet demucs pyannote.audio pydub resemblyzer librosa
Run setup only in a dedicated virtual environment, review packages if possible, and consider pinning versions before production or sensitive use.
A HuggingFace token may be exposed through shell history or process listings if passed as a CLI argument.
The diarization workflow uses a HuggingFace token to access the pyannote model. This is purpose-aligned, but it is a credential and the script also permits passing it on the command line.
ap.add_argument("--token", default=os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")) ... Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", token=args.token)Prefer the documented HF_TOKEN environment variable or a secrets manager, and use a least-privilege token where possible.
If pointed at the wrong output directory or filenames, generated files could overwrite existing local outputs.
The script invokes ffmpeg and uses -y to overwrite generated output files. This is expected for audio processing, but users should be deliberate about input and output paths.
cmd = ["ffmpeg", "-y", "-i", src, "-ac", "1", "-ar", "16000", "-vn", dst] ... subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Use a dedicated output folder, review paths before running, and avoid protected or shared directories.
A user's or speaker's voice sample may leave the local machine if the ElevenLabs workflow is followed.
The workflow can involve sending selected voice samples to an external voice-cloning provider. The upload is user-directed and consistent with the stated purpose, but voice samples are sensitive biometric data.
# 4. Upload to ElevenLabs as instant voice clone
Upload only authorized voice samples, confirm consent, and review the provider's privacy and retention terms.
A similarity score may be mistaken for definitive identity proof.
The guide presents practical thresholds for speaker verification and authentication. This is aligned with the skill, but users could over-rely on a similarity score for security decisions.
Voice Authentication (Security) - Strict: 0.85+ (low false positive rate)
Use these scores as advisory evidence only, not as the sole control for authentication or high-stakes identity decisions.
