Voice Clone Bot

Security checks across malware telemetry and agentic risk

Overview

This voice-cloning skill is mostly purpose-aligned, but it grants broad local installation, persistence, filesystem-write, and voice-biometric authority without enough user control or safety boundaries.

Install only if you are comfortable with a local ML service that can download and run third-party code, register itself globally, run a background daemon, store large model files, and process sensitive voice samples. Use it only with explicit speaker consent, avoid exposing the local API beyond localhost, do not pass arbitrary output directories, and review or pin dependencies before production use.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Output HandlingUnvalidated Output Injection, Cross-Context Output, Unbounded Output
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger

Findings (36)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 90% confidence
Finding: The skill declares no permissions while instructing use of shell execution, file access, environment variables, and network-dependent auto-installation behavior. This creates a transparency and policy-enforcement gap: hosts may invoke the skill believing it is low-risk while it can download code, read local paths, and execute commands.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 97% confidence
Finding: The documented purpose is simple voice cloning, but the behavior includes self-installation, repository cloning, dependency management, model downloads, daemon lifecycle management, and destructive uninstall actions. This mismatch can mislead operators and users about the true trust boundary and allows significant host modification beyond the expected audio-generation task.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The installer modifies host-level agent state by creating a symlink in the global skills directory, which exceeds the narrowly described voice-cloning behavior in the manifest. Persistent self-registration is dangerous because it changes future agent behavior and trust boundaries without explicit user consent, making the skill easier to invoke unexpectedly or survive beyond the current install session.

Description-Behavior Mismatch

Low

Confidence: 84% confidence
Finding: The script creates a persistent model directory under the user's home directory, which is behavior not disclosed by the manifest and leaves long-lived artifacts on the host. While model caching can be legitimate for TTS systems, doing so silently in a global location can surprise users, consume disk space, and create persistence beyond the expected lifecycle of a single skill run.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: Auto-registering the skill by symlinking into the agent host's global plugin routing changes platform-wide execution behavior in a way unrelated to synthesizing speech itself. In the context of a voice-cloning skill, this is more concerning because the manifest describes an end-user media function, not privileged modification of host plugin discovery or routing state.

Context-Inappropriate Capability

Medium

Confidence: 88% confidence
Finding: Installing whatever is listed in requirements.txt allows arbitrary third-party code execution during package installation and may introduce dependencies unrelated to the advertised voice-cloning function. This is especially risky when combined with network access and no pinning, review, or user warning, because malicious or compromised packages can run setup hooks at install time.

Intent-Code Divergence

Medium

Confidence: 81% confidence
Finding: The script describes the environment as a strict isolated sandbox, but it also writes to global host directories and alters registration state, which is misleading about the actual security boundaries. Misrepresenting isolation can cause operators to grant more trust than warranted and overlook persistent host modifications.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The script `source`s a configuration file chosen by `TTS_CONFIG_FILE` or `.env`, which causes arbitrary shell code in that file to execute in the current process. Because this happens before any TTS logic, an attacker who can influence that file path or contents can run commands, alter environment variables, or hijack subsequent execution on the host.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: The script sources a config file from TTS_CONFIG_FILE or .env using shell source, which executes arbitrary shell code in that file rather than merely parsing configuration values. Because environment variables can redirect CONFIG_FILE to attacker-controlled content, running the uninstall script can trigger unintended command execution with the user's privileges, which is unrelated to the skill's voice-cloning purpose and significantly increases risk.

Context-Inappropriate Capability

Medium

Confidence: 98% confidence
Finding: The /clone endpoint allows the caller to supply req.output_dir and the server will create that directory and write the generated file there without restricting it to a safe base path. This enables arbitrary filesystem writes within the server account's permissions, which can overwrite or place files outside the intended generated_audio workspace and becomes especially risky in a network-exposed API handling untrusted input.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The skill advertises voice cloning from a reference sample, but the ChatTTS backend ignores `ref_audio` entirely and generates generic speech. In a voice-cloning skill, this is a security-relevant integrity issue because users and upstream agents may believe biometric voice cloning occurred when it did not, leading to deceptive output, consent problems, and misapplication of a non-cloning backend in sensitive contexts.

Description-Behavior Mismatch

Critical

Confidence: 99% confidence
Finding: The OpenVoice backend claims a two-step pipeline of base TTS plus tone conversion, but the implementation never synthesizes the text into `tmp_path` before attempting voice conversion. Converting an empty temporary file can produce failures, undefined behavior in downstream libraries, or misleading outputs while the system still presents the result as successful voice cloning, which is especially dangerous in a biometric speech skill.

Vague Triggers

High

Confidence: 92% confidence
Finding: The activation guidance is overly broad, including generic phrases like 'speak' and context-based assumptions about when audio is 'appropriate.' That can cause the skill to trigger unexpectedly, increasing the chance of unnecessary voice cloning, unintended media generation, and execution of its high-privilege backend behaviors.

Vague Triggers

High

Confidence: 90% confidence
Finding: The 'When to use this skill' section permits activation based on implication rather than explicit consent and does not define safeguards for selecting or reusing reference audio. In a voice-cloning context, ambiguous activation is especially risky because biometric voice data may be processed without a sufficiently clear user authorization step.

Missing User Warnings

High

Confidence: 98% confidence
Finding: The instructions describe cloning from a user's reference audio without any warning or consent workflow for privacy, impersonation, and biometric-data risks. Because voice prints are sensitive and easily abused for impersonation or fraud, omission of a consent and safety notice materially increases the danger in this specific skill context.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The document describes cloning a user's voice from their original recording and passing the reference audio path through the system, but provides no warning, consent model, retention policy, or abuse safeguards for biometric voice data. In a voice-cloning skill, omission of privacy and impersonation warnings materially increases the risk of non-consensual cloning, misuse of saved recordings, and unsafe downstream integration.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The installer performs filesystem changes, including symlink registration and global directory creation, without explicit confirmation or a clear warning about persistence. Silent host modification is dangerous because users may not realize the skill has altered future system behavior or left durable artifacts outside the project directory.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The script installs packages from the network without an explicit warning that external code will be downloaded and executed in the virtual environment. Even in a venv, package installation can run arbitrary install-time code and introduce supply-chain risk, so failing to disclose this materially weakens informed consent and safe operation.

Missing User Warnings

Medium

Confidence: 83% confidence
Finding: The script deletes files and directories recursively without any confirmation, dry-run mode, or prominent pre-action warning. In an uninstall context this raises the chance of accidental destructive actions, especially if paths are modified in future changes or if users invoke the script without understanding that generated audio and environment contents will be removed.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The script forcefully terminates processes on the configured port with kill -9 and also uses pkill -f "python app.py", which may kill unrelated user processes matching that pattern. Without confirmation or careful identification of the managed service, this can cause denial of service or data loss in other applications, and sourcing TTS_SERVER_PORT from config further increases the chance of targeting the wrong process.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: This is the same underlying issue as SDI-2: user input directly controls where files are written via os.makedirs(req.output_dir, exist_ok=True) and later file creation under that directory. An attacker can direct output to sensitive or unexpected locations, potentially planting files, filling disk in privileged paths, or interfering with other application data.

Unbounded Output

Medium

Category: Output Handling
Content: - **Do NOT** manually start `python app.py` or manage the backend. The `run_tts.sh` script auto-detects, auto-installs, and auto-starts everything. - **First run is slow** (~30-60 seconds) because it downloads model weights and loads them into memory. Subsequent calls are fast. - **Long texts work automatically.** The engine splits text into sentences, synthesizes each chunk, and stitches them seamlessly. No length limit. ## Controlling voice characteristics
Confidence: 86% confidence
Finding: No length limit

Unpinned Dependencies

Low

Category: Supply Chain
Content: fastapi uvicorn pydantic requests
Confidence: 96% confidence
Finding: fastapi

Unpinned Dependencies

Low

Category: Supply Chain
Content: fastapi uvicorn pydantic requests torch
Confidence: 96% confidence
Finding: uvicorn

Unpinned Dependencies

Low

Category: Supply Chain
Content: fastapi uvicorn pydantic requests torch torchaudio
Confidence: 95% confidence
Finding: pydantic

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal