MiniMax Multimodal (Speech + Image)

Security checks across malware telemetry and agentic risk

Overview

This skill is a straightforward MiniMax speech and image API wrapper, with expected but privacy-sensitive uploads and account actions that users should understand before use.

Install only if you are comfortable using a MiniMax API key, consuming Token Plan quota, and sending selected prompts, audio, and images to MiniMax. Use voice cloning only for voices you are authorized to process, set MINIMAX_REGION deliberately, and double-check voice IDs before running delete.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (9)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 92% confidence
Finding: The skill documentation indicates use of environment variables and external network access, but no permissions are declared. This creates a transparency and governance gap: users or platforms may not realize the skill can read secrets and send data to third-party APIs, increasing the risk of unintended data exposure or unauthorized external communication.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 88% confidence
Finding: The stated description emphasizes speech synthesis, voice cloning, voice design, and image generation/editing, but the documented behavior also includes voice listing, voice deletion, and asynchronous task management/querying. Undisclosed management and destructive capabilities can surprise users and reviewers, making it easier for a skill to perform operations beyond the expected scope, including modifying or deleting remote resources.

Description-Behavior Mismatch

Medium

Confidence: 79% confidence
Finding: The file exposes voice listing, retrieval, and deletion capabilities that are broader than the declared skill scope. Extra undeclared actions increase attack surface and may enable destructive or privacy-impacting operations that users and reviewers do not expect from the manifest.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: Voice cloning is explicitly documented as uploading local audio to external endpoints, yet there is no warning about consent, biometric privacy, ownership, or third-party transmission. Because voice data is sensitive biometric-like information, users may unknowingly submit another person's voice or regulated personal data to an external service, creating privacy, legal, and impersonation risks.

Missing User Warnings

Low

Confidence: 90% confidence
Finding: The image-editing workflow accepts local files or URLs but does not warn that the source image may be transmitted to an external service. This can lead users to expose private, copyrighted, or sensitive images without understanding that they leave the local environment.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: When a local image file is provided, the code base64-encodes the file and sends it to the remote MiniMax API as a data URL without any explicit warning, confirmation, or visibility at the upload point. In a multimodal skill, users may reasonably treat a local reference image as local-only input, so this creates a meaningful privacy and data-handling risk if sensitive images are uploaded unintentionally.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: Voice cloning uploads a local audio file to a remote third-party service, which can transmit sensitive biometric voice data off-device without any explicit notice or consent flow in the code path. In this skill context, voice cloning makes the privacy risk more significant because the uploaded content may uniquely identify a person and be reused to synthesize speech.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: The delete operation can permanently remove a remote voice resource with a single command and no confirmation or safety interlock. In a skill handling user-created voices, accidental or induced deletion can cause data loss and service disruption.

Natural-Language Policy Violations

Low

Confidence: 75% confidence
Finding: Defaulting to the CN endpoint unless an environment variable is set can route user data to a region the user did not knowingly choose. Because this skill processes text and potentially sensitive voice/audio data, implicit region selection creates a transparency and compliance risk even if it is not a code-execution flaw.

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal