Audio Speaker Tools

PassAudited by ClawScan on May 10, 2026.

Overview

This skill appears to do what it says—local speaker separation and voice comparison—but it handles sensitive voice data, uses a HuggingFace token, and installs unpinned ML packages.

This skill is reasonable for local audio diarization and voice comparison. Before installing, use a dedicated virtual environment, protect your HuggingFace token, choose output paths carefully, and only process or upload voice samples when you have permission from the speaker.

Findings (5)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

Note

ASI04: Agentic Supply Chain Vulnerabilities

What this means

Installing the skill may pull changing code from package repositories into a local virtual environment.

Why it was flagged

The setup script installs several third-party packages without version pins or a lockfile. This is expected for the audio/ML purpose, but package contents and versions may change over time.

Skill content

pip install --quiet torch torchvision torchaudio ... pip install --quiet demucs pyannote.audio pydub resemblyzer librosa

Recommendation

Run setup only in a dedicated virtual environment, review packages if possible, and consider pinning versions before production or sensitive use.

Note

ASI03: Identity and Privilege Abuse

What this means

A HuggingFace token may be exposed through shell history or process listings if passed as a CLI argument.

Why it was flagged

The diarization workflow uses a HuggingFace token to access the pyannote model. This is purpose-aligned, but it is a credential and the script also permits passing it on the command line.

Skill content

ap.add_argument("--token", default=os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")) ... Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", token=args.token)

Recommendation

Prefer the documented HF_TOKEN environment variable or a secrets manager, and use a least-privilege token where possible.

Note

ASI02: Tool Misuse and Exploitation

What this means

If pointed at the wrong output directory or filenames, generated files could overwrite existing local outputs.

Why it was flagged

The script invokes ffmpeg and uses -y to overwrite generated output files. This is expected for audio processing, but users should be deliberate about input and output paths.

Skill content

cmd = ["ffmpeg", "-y", "-i", src, "-ac", "1", "-ar", "16000", "-vn", dst] ... subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

Recommendation

Use a dedicated output folder, review paths before running, and avoid protected or shared directories.

Note

ASI07: Insecure Inter-Agent Communication

What this means

A user's or speaker's voice sample may leave the local machine if the ElevenLabs workflow is followed.

Why it was flagged

The workflow can involve sending selected voice samples to an external voice-cloning provider. The upload is user-directed and consistent with the stated purpose, but voice samples are sensitive biometric data.

Skill content

# 4. Upload to ElevenLabs as instant voice clone

Recommendation

Upload only authorized voice samples, confirm consent, and review the provider's privacy and retention terms.

Note

ASI09: Human-Agent Trust Exploitation

What this means

A similarity score may be mistaken for definitive identity proof.

Why it was flagged

The guide presents practical thresholds for speaker verification and authentication. This is aligned with the skill, but users could over-rely on a similarity score for security decisions.

Skill content

Voice Authentication (Security) - Strict: 0.85+ (low false positive rate)

Recommendation

Use these scores as advisory evidence only, not as the sole control for authentication or high-stakes identity decisions.