mmVoiceMaker

Security checks across malware telemetry and agentic risk

Overview

This voice tool appears purpose-built for MiniMax text-to-speech work, but it needs Review because it can upload sensitive voice recordings and delete cloud voice assets without strong user-facing safeguards.

Install only if you are comfortable sending text, prompts, and voice recordings to MiniMax under your own API key. Use voice cloning only with recordings you own or have explicit permission to process, avoid putting secrets in prompt or preview text, do not echo or screenshot API keys, and do not run cleanup_unused_voices(dry_run=False) unless you intend to delete all custom cloned/designed voices in the account.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import
Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain

Findings (29)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: def run_ffmpeg_command(args: List[str], timeout: int = 300) -> bool: """Execute ffmpeg command with given arguments""" try: result = subprocess.run( ["ffmpeg"] + args, capture_output=True, text=True,
Confidence: 82% confidence
Finding: result = subprocess.run( ["ffmpeg"] + args, capture_output=True, text=True, timeout=timeout )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: try: # Use ffprobe to get format info result = subprocess.run( [ "ffprobe", "-v", "quiet", "-print_format", "json",
Confidence: 80% confidence
Finding: result = subprocess.run( [ "ffprobe", "-v", "quiet", "-print_format", "json", "-show_format", "-show_streams",

Tainted flow: 'url' from os.getenv (line 161, credential/environment) → requests.get (network output)

Critical

Category: Data Flow
Content: Returns: Saved file path """ response = requests.get(url, timeout=timeout) response.raise_for_status() with open(output_path, "wb") as f:
Confidence: 96% confidence
Finding: response = requests.get(url, timeout=timeout)

Intent-Code Divergence

Medium

Confidence: 93% confidence
Finding: The function documentation frames this as TTS segment generation, but the implementation executes an arbitrary caller-supplied function from `segment["tts_function"]`. In a skill/plugin context, that turns structured data into executable behavior and can allow untrusted code paths or unexpected side effects if segment data is influenced by users or other agents.

Description-Behavior Mismatch

Medium

Confidence: 92% confidence
Finding: The module exposes a bulk-deletion helper that iterates over all cloned and designed voices and deletes them with a single call when dry_run=False. Even if intended as maintenance functionality, this is a destructive capability that exceeds the narrowly described synthesis/post-processing purpose and creates unnecessary risk of accidental or unauthorized mass deletion if invoked by an agent or user without strong safeguards.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: cleanup_unused_voices is effectively a mass-deletion convenience function, but it does not actually determine whether voices are unused; it deletes every custom voice returned by the API. In an agent context, convenience wrappers for destructive operations are dangerous because they lower the barrier to irreversible data loss from prompt mistakes, misuse, or overly broad automation.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The voice cloning workflow accepts a user-supplied audio file and sends it to external processing via `quick_clone_voice`, but the CLI provides no explicit notice that potentially sensitive biometric voice data will be uploaded to a third-party API. This can lead to uninformed disclosure of personal or regulated data, especially because voiceprints are uniquely identifying and harder to revoke than passwords.

Missing User Warnings

Low

Confidence: 82% confidence
Finding: The voice design command sends arbitrary user prompt text to an external service through `design_voice` without clearly disclosing that the description leaves the local environment. While prompts are less sensitive than raw biometric audio, they may still contain confidential character, branding, or personal information that users do not expect to be transmitted externally.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The documentation exposes voice cloning and source-audio upload workflows without any notice about consent, privacy, retention, or the fact that biometric voice data is sent to a third-party API. In a voice-maker skill, this omission is security-relevant because users may upload sensitive recordings or clone voices without understanding legal and privacy risks, increasing the chance of unauthorized biometric data processing.

Missing User Warnings

Low

Confidence: 82% confidence
Finding: The deletion and cleanup commands describe removing custom voices but do not clearly warn that these actions may be irreversible. In a tool that manages user-created voice assets, unclear destructive-operation guidance can lead to accidental permanent loss of cloned or designed voices.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The guide documents voice cloning workflows but provides no warning that users may be uploading biometric voice data or that they must have the speaker’s consent. In an agent-facing skill, this omission can normalize cloning third-party voices without authorization, creating privacy, impersonation, and compliance risk even if the underlying feature is legitimate.

Missing User Warnings

Low

Confidence: 80% confidence
Finding: The environment check mentions API connectivity testing and use of MINIMAX_VOICE_API_KEY without clearly warning that running the command may contact an external service using configured credentials. This is a transparency and operational-safety issue: users or agents may trigger outbound requests unexpectedly in restricted or privacy-sensitive environments.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The guide explicitly instructs users to echo the MiniMax API key to the terminal, which encourages disclosure of a sensitive secret. Terminal output may be visible to others, captured in screen recordings, shell logs, or CI logs, making credential exposure more likely in a skill centered on authenticated API usage.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The voice cloning examples demonstrate uploading and cloning a speaker's voice without any surrounding guidance on consent, authorization, biometric sensitivity, or retention of uploaded samples. In a skill specifically designed for voice cloning, omission of these safeguards can normalize misuse and lead users to clone voices they do not own or have permission to use, creating privacy, impersonation, and fraud risks.

Missing User Warnings

Medium

Confidence: 85% confidence
Finding: The examples show direct deletion of cloned and designed voices, including a batch cleanup that deletes all custom voices, without any warning that the action may be irreversible. This can cause accidental destructive operations, especially because the file presents copy-paste runnable examples and the cleanup example encourages broad deletion with minimal friction.

Natural-Language Policy Violations

Medium

Confidence: 93% confidence
Finding: The guidance explicitly instructs voice matching based on gender as a default character-trait factor, without requiring user preference, necessity, or sensitivity checks. In a voice-cloning and synthesis skill, this can lead the agent to infer or impose gendered attributes on speakers and generate potentially inappropriate, biased, or misrepresentative outputs, especially for ambiguous or non-binary roles.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The troubleshooting example prints the first part of the API key to stdout for debugging. Even partial credential disclosure can leak into terminal history, logs, CI output, screen recordings, or support screenshots, increasing the chance of secret exposure.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: This guide provides step-by-step voice cloning workflows without any warning about consent, authorization, or privacy obligations for uploaded recordings. In a voice-cloning skill, that omission materially increases the risk of impersonation, non-consensual biometric voice use, and misuse of third-party recordings, especially because the examples make cloning look routine and immediately usable.

Missing User Warnings

Medium

Confidence: 82% confidence
Finding: The deletion and batch cleanup examples invoke destructive operations, including removing all custom voices, without clearly warning that deletion may be irreversible or difficult to recover from. This can lead to accidental loss of user-created assets and operational disruption, particularly because the examples normalize running the actual delete call immediately after the dry run.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The code logs raw segment text to stdout during processing, which can expose sensitive or private input content in terminal history, CI/CD logs, agent transcripts, or centralized log collection systems. In a voice synthesis workflow, segment text may contain unpublished scripts, personal data, secrets, or regulated content, so unconditional logging creates a real confidentiality risk.

Missing User Warnings

Medium

Confidence: 84% confidence
Finding: This helper silently fetches remote data and persists it to disk without any built-in warning, confirmation, provenance checks, or validation that the payload is actually safe audio. While not inherently malicious, this behavior increases the chance that untrusted remote content is stored and later consumed by other tools in the pipeline, especially in a voice-processing skill that may automatically handle external media.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: This function uploads local audio files to a third-party API using the configured bearer token, but there is no explicit consent prompt, privacy disclosure, or indication to callers that biometric voice data leaves the local environment. Because voice samples are sensitive personal data and may enable identification or cloning misuse, silent transmission creates a real privacy and compliance risk even if the feature is expected by the skill's purpose.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: This prompt-audio upload sends another local recording to the external service without any built-in warning or consent mechanism. In the context of a voice-cloning skill, prompt audio is especially sensitive because it is short, targeted voice-reference material that can directly improve impersonation quality, so undisclosed transmission increases privacy and misuse risk.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The cloning request transmits preview text, file identifiers, and cloning options to a remote API, again without any explicit disclosure in code or interface. While sending this data is functionally necessary for the skill, preview text may contain sensitive or proprietary content, and in a voice-cloning workflow the overall operation has elevated abuse potential because it helps create reusable synthetic identities.

Unpinned Dependencies

Low

Category: Supply Chain
Content: # MiniMax Voice Maker Skill Dependencies requests>=2.28.0 websockets>=10.0 ffmpeg-python>=0.2.0
Confidence: 95% confidence
Finding: requests>=2.28.0

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal