Youtube Audio

Security checks across malware telemetry and agentic risk

Overview

This skill is not clearly malicious, but it is broader than its YouTube-audio name suggests and can send media, prompts, and token-backed actions to a remote video service.

Install only if you are comfortable using NemoVideo as a remote cloud media editor, not just a simple audio extractor. Use non-sensitive media, prefer a dedicated low-privilege NEMO_TOKEN, watch credit usage, and require explicit confirmation before uploads, prompt forwarding, rendering, or export.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (6)

Description-Behavior Mismatch

High
Confidence
95% confidence
Finding
The skill is presented as a narrow YouTube-audio extractor, but the documented behavior exposes a much broader remote video editing and rendering pipeline with uploads, timeline manipulation, and export. This mismatch can mislead users and host agents into sending broader content and commands to a third-party service than expected, increasing the risk of unauthorized data handling and scope creep.

Description-Behavior Mismatch

Medium
Confidence
92% confidence
Finding
The manifest and introductory text imply the input is a YouTube URL, while the workflow supports direct file upload and generic media URL ingestion. That discrepancy broadens the data the skill can exfiltrate to the backend and may bypass user expectations or policy controls written for URL-only handling.

Context-Inappropriate Capability

Medium
Confidence
89% confidence
Finding
The skill instructs the agent to obtain tokens, create sessions, and manage credit-related backend interactions even though the advertised purpose is simple audio extraction. This expands privileges and external account interaction beyond the minimum necessary, creating opportunities for unauthorized service use, hidden billing exposure, and opaque authentication flows.

Intent-Code Divergence

Medium
Confidence
90% confidence
Finding
The documentation repeatedly claims audio extraction, yet it also promises 1080p MP4 rendered video outputs. This inconsistency can cause users and higher-level agents to misunderstand what content will be generated, stored, or transmitted, which is especially risky when media processing occurs on a remote backend.

Vague Triggers

Medium
Confidence
87% confidence
Finding
The catch-all rule routes nearly any unmatched prompt into the SSE action, effectively granting broad backend command execution for vague or unrelated user input. In a skill backed by a remote editing service, that permissive routing increases the chance of unintended data transmission, unexpected actions, and abuse of capabilities outside the advertised scope.

Missing User Warnings

Medium
Confidence
96% confidence
Finding
The skill omits a clear upfront warning that user prompts, files, and media URLs are sent to a third-party cloud service for processing. This is a meaningful privacy and data-governance issue because users may disclose sensitive media or text under the assumption that processing is local or narrowly scoped.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal