Caption Generator From Photo

Security checks across malware telemetry and agentic risk

Overview

This skill is a real cloud photo-to-video captioning workflow, but it asks the agent to use remote authentication and broader media-editing capabilities than its photo-focused description clearly discloses.

Review before installing. Use only media you are comfortable uploading to nemovideo.ai, prefer a scoped or disposable NEMO_TOKEN, and be aware the skill may handle video/audio editing workflows in addition to photo captioning. The concerns are about scope and disclosure, not confirmed malware.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (5)

Description-Behavior Mismatch

Medium

Confidence: 94% confidence
Finding: The skill is presented as a narrow photo-to-captioned-video tool, but its documented behavior exposes a much broader remote media-editing surface including video/audio upload, state inspection, timeline manipulation, and export. That mismatch can mislead users and host platforms about what data types and operations are actually permitted, increasing the chance of over-collection, unintended processing, and unsafe delegation to a more powerful backend than advertised.

Description-Behavior Mismatch

Medium

Confidence: 97% confidence
Finding: The manifest and description imply image uploads only, but the skill later declares support for video and audio formats such as mp4, mov, mp3, and wav. This discrepancy creates a capability-declaration gap that can cause users or reviewers to authorize a photo tool while it actually accepts and transmits richer media, which may carry more sensitive content and metadata.

Context-Inappropriate Capability

Low

Confidence: 91% confidence
Finding: The skill instructs the agent to infer the local installation platform from filesystem paths and transmit it in headers, even though that information is not needed for basic photo caption generation. This unnecessarily discloses environmental metadata to a third-party backend and can support fingerprinting, product telemetry, or differential handling without user awareness.

Vague Triggers

Medium

Confidence: 92% confidence
Finding: Routing 'Everything else' to the generation/edit SSE path is an overly broad catch-all that can cause unrelated or ambiguous user messages to trigger remote processing and editing actions. In a skill already exposing broader-than-advertised media operations, this increases the risk of unintended uploads, backend actions, or confusion about what the agent will do.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The skill directs the agent to automatically use an environment token or obtain an anonymous token and then send authenticated requests to a remote API, while explicitly hiding technical details from the user. This weakens informed consent around credential use and remote data transmission, and may cause users to unknowingly process files and prompts through a third-party service under implicit authentication.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal