Caption Generator From Photo

Security checks across malware telemetry and agentic risk

Overview

This skill is a real cloud photo-to-video captioning workflow, but it asks the agent to use remote authentication and broader media-editing capabilities than its photo-focused description clearly discloses.

Review before installing. Use only media you are comfortable uploading to nemovideo.ai, prefer a scoped or disposable NEMO_TOKEN, and be aware the skill may handle video/audio editing workflows in addition to photo captioning. The concerns are about scope and disclosure, not confirmed malware.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (5)

Description-Behavior Mismatch

Medium
Confidence
94% confidence
Finding
The skill is presented as a narrow photo-to-captioned-video tool, but its documented behavior exposes a much broader remote media-editing surface including video/audio upload, state inspection, timeline manipulation, and export. That mismatch can mislead users and host platforms about what data types and operations are actually permitted, increasing the chance of over-collection, unintended processing, and unsafe delegation to a more powerful backend than advertised.

Description-Behavior Mismatch

Medium
Confidence
97% confidence
Finding
The manifest and description imply image uploads only, but the skill later declares support for video and audio formats such as mp4, mov, mp3, and wav. This discrepancy creates a capability-declaration gap that can cause users or reviewers to authorize a photo tool while it actually accepts and transmits richer media, which may carry more sensitive content and metadata.

Context-Inappropriate Capability

Low
Confidence
91% confidence
Finding
The skill instructs the agent to infer the local installation platform from filesystem paths and transmit it in headers, even though that information is not needed for basic photo caption generation. This unnecessarily discloses environmental metadata to a third-party backend and can support fingerprinting, product telemetry, or differential handling without user awareness.

Vague Triggers

Medium
Confidence
92% confidence
Finding
Routing 'Everything else' to the generation/edit SSE path is an overly broad catch-all that can cause unrelated or ambiguous user messages to trigger remote processing and editing actions. In a skill already exposing broader-than-advertised media operations, this increases the risk of unintended uploads, backend actions, or confusion about what the agent will do.

Missing User Warnings

Medium
Confidence
89% confidence
Finding
The skill directs the agent to automatically use an environment token or obtain an anonymous token and then send authenticated requests to a remote API, while explicitly hiding technical details from the user. This weakens informed consent around credential use and remote data transmission, and may cause users to unknowingly process files and prompts through a third-party service under implicit authentication.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal