Video To Text Ai Free

Security checks across malware telemetry and agentic risk

Overview

This skill appears to be a real cloud video-processing integration, but it does more than simple transcription and may send media or broad edit commands to a third-party backend.

Install only if you are comfortable with a third-party cloud video editor, not just a local transcript extractor. Avoid confidential recordings unless you trust NemoVideo, protect the NEMO_TOKEN, and explicitly confirm uploads, URL ingestion, edits, and exports before proceeding.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (7)

Description-Behavior Mismatch

High
Confidence
96% confidence
Finding
The skill is presented as a simple video-to-text transcription tool, but the instructions expose a much broader remote video editing and rendering pipeline, including timeline manipulation, overlays, audio/BGM changes, and export operations. This scope mismatch is dangerous because it expands the agent's effective permissions and data flows beyond what a user would reasonably expect, increasing the chance of unauthorized remote processing or misuse of uploaded media.

Description-Behavior Mismatch

Medium
Confidence
93% confidence
Finding
The skill promises transcript export, but the documented result is a rendered 1080p MP4 video, which is materially different from a text transcript. This can mislead users about what data will be produced, stored, and transmitted, and may cause them to send media under false assumptions about the processing purpose.

Context-Inappropriate Capability

Medium
Confidence
89% confidence
Finding
Allowing URL-based ingestion introduces a broader attack and privacy surface than user-uploaded local files alone, including server-side fetching of arbitrary remote content. For a transcription skill, this capability is not clearly necessary and could be abused to pull in unintended or sensitive resources, depending on backend protections.

Intent-Code Divergence

Medium
Confidence
91% confidence
Finding
The documentation frames the skill as transcription-focused, but later authorizes general editing behaviors such as adding BGM and manipulating a timeline. This inconsistency weakens user consent and makes it easier for the agent to perform broader cloud actions than the user intended when invoking a seemingly narrow transcription tool.

Vague Triggers

Medium
Confidence
88% confidence
Finding
Routing 'everything else' into the SSE edit path creates an overly broad catch-all trigger that can capture unrelated user requests and send them to a remote backend. In the context of a skill with hidden broader capabilities, this increases the likelihood of unintended activation, data disclosure, and actions outside the user's expectations.

Vague Triggers

Medium
Confidence
86% confidence
Finding
Generic trigger phrases such as 'convert my video files' or partial everyday language can collide with normal conversation and accidentally activate the skill. Because activation may lead to cloud processing and session creation, ambiguous triggers create privacy and consent risks disproportionate to the narrow advertised purpose.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The user-facing description emphasizes convenience but does not clearly warn that uploaded media, prompts, and session data are sent to a third-party cloud backend for processing. For a media-processing skill handling potentially sensitive audio/video, this is a significant transparency and privacy failure that can undermine informed consent.

VirusTotal

62/62 vendors flagged this skill as clean.

View on VirusTotal