Music Craft — MiniMax

Security checks across malware telemetry and agentic risk

Overview

The skill is broadly a real MiniMax music workflow, but it needs review because it can automatically install packages and run remote model code beyond normal media generation.

Review before installing. Use this only if you are comfortable sending music prompts, lyrics, URLs, images, and audio to MiniMax or related services. Run it in a virtual environment or sandbox, preinstall dependencies yourself, avoid the auto-install YouTube path, keep MINIMAX_API_KEY scoped and rotated if exposed, and choose output paths carefully because some helpers can overwrite files.

SkillSpector

By NVIDIA

Vulnerability Patterns

Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import

Findings (37)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: def install_yt_dlp(): """Install yt-dlp via pip.""" print("Installing yt-dlp...", file=sys.stderr) subprocess.run( [sys.executable, '-m', 'pip', 'install', 'yt-dlp', '--quiet', '--break-system-packages'], check=True )
Confidence: 96% confidence
Finding: subprocess.run( [sys.executable, '-m', 'pip', 'install', 'yt-dlp', '--quiet', '--break-system-packages'], check=True )

subprocess module call

Medium

Category: Dangerous Code Execution
Content: captions = [] for kf_path in keyframes: try: proc = subprocess.run( ['mmx', 'vision', 'describe', kf_path], capture_output=True, text=True, timeout=60, )
Confidence: 82% confidence
Finding: proc = subprocess.run( ['mmx', 'vision', 'describe', kf_path], capture_output=True, text=True, timeout=60, )

Description-Behavior Mismatch

Medium

Confidence: 90% confidence
Finding: The file advertises itself as a single entry point for audio, video, image, and YouTube analysis, which is materially broader than the skill metadata focused on music generation. This kind of capability mismatch is dangerous because it can cause users or higher-level agents to invoke undisclosed modalities and network/data-processing behaviors they did not knowingly authorize.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The orchestrator includes web lyrics lookup and YouTube download behavior that are not disclosed by the skill metadata. Hidden external retrieval expands the attack and privacy surface because user content and inferred song-identification data may leave the local environment unexpectedly.

Context-Inappropriate Capability

Medium

Confidence: 72% confidence
Finding: The script is described primarily as local image analysis, but the optional --vlm path broadens behavior by invoking an external CLI that may perform richer processing and potentially transmit data depending on mmx configuration. In a skill handling user-supplied images, this can create an unexpected privacy and trust-boundary violation.

Context-Inappropriate Capability

Medium

Confidence: 97% confidence
Finding: Auto-installing yt-dlp via pip gives the script software installation capability beyond its stated download/conversion role. In a skill or agent setting, this is risky because it performs network-based package retrieval and code installation at runtime without administrative review, increasing supply-chain and environment-tampering exposure.

Description-Behavior Mismatch

Medium

Confidence: 84% confidence
Finding: The file implements optional VLM captioning of video frames, which is materially beyond the stated music-generation purpose of the skill. That scope expansion increases the chance of unexpected data handling and privacy exposure, especially because video frames can contain sensitive visual content unrelated to music generation.

Context-Inappropriate Capability

Medium

Confidence: 88% confidence
Finding: The skill spawns an external vision CLI to describe sampled frames, which is not justified by the declared music-craft use case and may transmit image data outside the local process boundary. In a security review, undisclosed external analysis of user media is risky because it can leak private content and violate least-privilege expectations.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The README explicitly encourages workflows using YouTube URLs and a third-party lyrics API but does not warn that user-provided content, metadata, or derived prompts may be transmitted to external services and may implicate copyright, privacy, or terms-of-service constraints. In a skill that processes reference audio and remote URLs, omission of these disclosures can lead users to unknowingly send protected or personal material to external providers.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The skill instructs creation of directories and writing prompts, lyrics, analyses, and generated media to the local filesystem without a strong upfront disclosure of side effects or a mandatory confirmation step. In an agent setting, this can lead to silent persistence of sensitive user data, clutter, overwrites, or writing into unintended paths if variables are mis-specified.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The skill clearly contemplates sending user audio, lyrics, prompts, images, and possibly derived analysis to MiniMax and other remote services, but does not provide a clear privacy/data-handling warning before those transfers. That creates a real risk of users unknowingly uploading copyrighted, personal, or sensitive media to third-party APIs.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The documentation includes a curl example that sends the API key in an Authorization header over the network without any nearby warning about handling secrets safely. Even though this is normal for authenticated API use, omitting guidance about trusted endpoints, shell history, logs, and secret leakage can lead users to expose credentials accidentally.

Natural-Language Policy Violations

Medium

Confidence: 89% confidence
Finding: The example workflow hard-codes a language transformation ('French original -> Spanish translation') as part of the recommended cover flow without explicitly requiring user confirmation of the target language. In a music-generation skill, this can cause the agent to override user intent, produce unwanted multilingual output, or mishandle copyrighted or sensitive lyrical content by translating it without clear authorization.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: This example instructs operators to fetch a user's draft lyrics from a URL and send the full contents to MiniMax's external lyrics API, but it does not require an explicit consent or disclosure step at the point of transfer. That creates a real privacy risk because users may not realize their potentially sensitive or unpublished text is being transmitted to a third-party service for processing.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The cover preprocess workflow sends a user audio URL/voice memo to MiniMax without an explicit warning at the moment of use that personal audio will be uploaded to an external provider. Voice recordings can contain sensitive biometric and contextual information, so silent third-party transmission is a meaningful privacy and compliance issue.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The workflow explicitly sends user song audio references and potentially full lyrics to a third-party API, but the documentation does not require user consent, disclose that content leaves the local environment, or describe retention/privacy implications. In a music tool, songs and lyrics may be copyrighted, private, or user-created, so silent transmission creates a real privacy and compliance risk.

Missing User Warnings

Low

Confidence: 87% confidence
Finding: The workflow instructs users to download media and create multiple temporary files under /tmp without warning that user content will be stored locally and may persist or be accessible to other local processes depending on system configuration. This is less severe than remote exfiltration, but it still exposes potentially sensitive or copyrighted content to unintended local disclosure or mishandling.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The web lyrics lookup sends artist/title and potentially Whisper-derived transcript text to an external service for matching, but this file does not present a clear privacy warning at the point of use. That can leak copyrighted or sensitive audio-derived text and listening history metadata without informed user consent.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: The VLM image option states it will call an external vision CLI, but there is no clear privacy warning that image content may be transmitted off-box to a remote model provider. For album art or user images, this can disclose personal, proprietary, or licensed content unexpectedly.

Missing User Warnings

Medium

Confidence: 80% confidence
Finding: When --vlm is used, the script sends user-selected image content to the mmx vision describe tool without an in-context privacy warning. If mmx is backed by a remote service or logs inputs, sensitive artwork or personal photos could be exposed unexpectedly.

Natural-Language Policy Violations

Medium

Confidence: 95% confidence
Finding: The script is explicitly designed to infer emotional and psychological state from voice audio, including labels such as vulnerable, desperate, wistful, longing, and aggressive, without any built-in consent gating, user disclosure, or limitation language. In this skill context, that is more dangerous because it operationalizes sensitive trait inference from biometric-style voice data and then turns those inferences into downstream generation instructions, increasing privacy and profiling risk.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: `AutoModel.from_pretrained(..., trust_remote_code=True)` permits execution of repository-supplied Python during model load, and loading from Hugging Face may fetch code over the network. In a skill that processes user-selected inputs and may run in semi-automated environments, this expands the trust boundary to a remote model repository and can result in arbitrary code execution if the model repo is malicious or compromised.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The script silently attempts to install yt-dlp with only a brief stderr notice, which is unsafe behavior for a tool that may run under automation. This can surprise operators, mutate the runtime environment, and pull external code unexpectedly, making policy bypass and supply-chain abuse more likely.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: Using ffmpeg with -y forces overwrite of the destination path without confirmation. Because the output path is user-supplied, this can destroy existing files in writable locations and is more concerning in agent workflows where paths may be composed automatically or passed through from other components.

Missing User Warnings

Low

Confidence: 84% confidence
Finding: For WAV inputs, shutil.copy2 writes directly to the computed destination path and can overwrite an existing file without explicit warning. This is primarily a data-loss issue rather than code execution, but it is still unsafe behavior when the path may be influenced by upstream automation or user mistakes.

VirusTotal

VirusTotal findings are pending for this skill version.

View on VirusTotal