qwencloud-vision

Security checks across malware telemetry and agentic risk

Overview

This cloud vision skill mostly matches its stated purpose, but it also includes update/install and agent-configuration behavior that goes beyond image and video analysis.

Install only if you are comfortable with a cloud provider processing the images, videos, OCR documents, and prompts you submit. Avoid using it on secrets, IDs, financial documents, private screenshots, or regulated data unless you have approval. Read the update-check and agent-compatibility sections carefully before allowing any npx install or CLAUDE.md/AGENTS.md modification.

SkillSpector

By NVIDIA

Vulnerability Patterns

Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Taint TrackingDirect Taint Flow, Variable-Mediated Taint Flow, Credential Exfiltration Chain
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (26)

Tainted flow: 'req' from os.getenv (line 687, credential/environment) → urllib.request.urlopen (network output)

Critical

Category: Data Flow
Content: """Download a file from *url* to *dest*, creating parent dirs as needed.""" dest.parent.mkdir(parents=True, exist_ok=True) req = urllib.request.Request(url, headers={"User-Agent": "qwencloud-ai/1.0"}) with urllib.request.urlopen(req, timeout=timeout) as resp: dest.write_bytes(resp.read()) return dest
Confidence: 88% confidence
Finding: with urllib.request.urlopen(req, timeout=timeout) as resp:

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: The skill includes an update workflow that tells the agent to install another skill with `npx skills add` and then run local control scripts. For an image/video understanding skill, this is functionality expansion unrelated to the core user task, and it creates a supply-chain and local-state-modification risk because the agent is instructed to fetch and install additional code automatically.

Description-Behavior Mismatch

Medium

Confidence: 94% confidence
Finding: The document instructs the agent to scan sibling directories for other qwencloud skills and modify user or agent configuration files to register them. That behavior exceeds the stated purpose of a vision-analysis skill and creates an unnecessary capability for persistence and environment modification, which could be abused to expand future auto-invocation surface without clear user intent.

Context-Inappropriate Capability

Medium

Confidence: 96% confidence
Finding: The compatibility guidance gives a vision skill instructions to append or replace blocks in CLAUDE.md or AGENTS.md, effectively granting it a configuration-management role unrelated to image or video understanding. Even though it says to ask the user first, embedding file-modification behavior in the skill increases the chance of privilege creep, accidental persistence, and unauthorized expansion of agent behavior.

Context-Inappropriate Capability

Medium

Confidence: 86% confidence
Finding: The guide instructs environment manipulation and shell-level troubleshooting that extends beyond the skill’s declared purpose of vision analysis. While some setup guidance is normal, directing proxy changes, SSL certificate fixes, and broader execution fallback behavior increases operational scope and can normalize unsafe system changes without clear user consent or bounded necessity.

Context-Inappropriate Capability

Medium

Confidence: 91% confidence
Finding: The fallback cascade authorizes broad code reading, reimplementation, and 'autonomous resolution' by inspecting scripts and references beyond what is necessary for a narrowly scoped vision skill. This materially expands agent autonomy and creates a path for unintended capability escalation, including generating new execution artifacts and adapting behavior outside the audited interface.

Intent-Code Divergence

Medium

Confidence: 97% confidence
Finding: The prompt guide explicitly tells the model to 'Show your complete reasoning before the final answer,' which is an instruction to disclose chain-of-thought. That can conflict with model safety policies around hidden reasoning and can be used as a semantic jailbreak pattern that encourages policy-violating internal rationale exposure. In a vision skill, this is unnecessary because the task can be completed with concise reasoning summaries or stepwise conclusions without exposing internal deliberations.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: This file implements update-checking, state tracking, and prompting/install guidance that are unrelated to the advertised image/video analysis purpose of the skill. More importantly, it discovers and executes a repository-local check_update.py from skill-controlled directories, creating a hidden execution path and persistence-like behavior that expands the trust boundary beyond the declared vision functionality.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The script explicitly includes the model's chain-of-thought in the returned JSON via the "reasoning" field. Exposing internal reasoning can leak sensitive intermediate analysis, prompt-derived data, or hidden decision traces beyond the skill's stated purpose of image/video understanding, and many providers discourage or prohibit surfacing raw chain-of-thought.

Description-Behavior Mismatch

Low

Confidence: 89% confidence
Finding: The code prints both reasoning_content and answer_content to stderr during streaming as an undocumented side effect. This can leak sensitive model output into logs, terminals, agent traces, or supervising systems even when the caller did not request disclosure, increasing the chance of unintended data exposure.

Missing User Warnings

Low

Confidence: 89% confidence
Finding: The skill tells the agent to append a placeholder API key entry to `.env` without an upfront warning that this modifies a project configuration file. While the placeholder itself is not a secret, silent or implicit config-file mutation can break local workflows, overwrite expectations about secret handling, and normalize unsafe automated edits to sensitive files.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The update path directs the agent to run a package installation command and local state-changing scripts without a prominent safety warning. This is dangerous because it can introduce unreviewed code into the environment and alter local skill state, all under a 'mandatory' post-execution flow that pressures automatic execution.

Missing User Warnings

Medium

Confidence: 84% confidence
Finding: The guide encourages users to send local images, videos, OCR content, and invoice data to a third-party cloud vision API, including automatic base64 conversion and upload flows, but it does not prominently warn that this transfers potentially sensitive local files off-device. In a vision skill context, users are especially likely to submit screenshots, documents, IDs, invoices, or internal media, so omission of privacy and data-handling cautions can lead to unintended exposure of confidential or regulated data.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The examples instruct users to inline local image data and send it, along with prompts, to a remote third-party API, but this file does not itself warn about privacy, confidentiality, or data-transfer implications. In a vision skill, users may reasonably test with screenshots, documents, or videos containing sensitive information, so omission of an explicit warning increases the risk of accidental data disclosure.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The curl guidance instructs users to send image, video, and prompt content to an external API endpoint but does not warn that potentially sensitive visual data will leave the local environment. In a vision-analysis skill, this omission is especially relevant because screenshots, documents, and videos often contain personal, confidential, or regulated information.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The examples show sending images to a third-party cloud OCR endpoint, including document types such as receipts, ID cards, contracts, and tickets, but provide no warning about privacy, consent, retention, or handling of sensitive data. In an OCR skill, this is materially risky because users may reasonably submit highly sensitive documents and the documentation normalizes remote transmission without safeguards.

Natural-Language Policy Violations

High

Confidence: 99% confidence
Finding: This is a true vulnerability because the guide instructs revealing 'complete chain-of-thought reasoning,' which directly pressures the model to expose internal reasoning traces. Such prompts are commonly used to bypass reasoning non-disclosure safeguards and may increase the chance of broader policy leakage or compliance with unsafe prompt patterns. The visual-reasoning context does not justify full reasoning disclosure; the skill only needs accurate answers and brief evidence summaries.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The documentation shows examples that send image and video URLs to a third-party API but does not warn users that referenced media may contain sensitive data or that remote URLs can disclose private content, metadata, or internal resources. In a vision-analysis skill, this omission is more dangerous because users are specifically encouraged to submit screenshots, documents, charts, and videos, which commonly contain confidential information.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: This code sends user prompts and referenced image/video content to a remote DashScope/Qwen API for processing, but the script provides no explicit runtime warning or consent mechanism about external transmission of potentially sensitive local files or extracted content. In a vision-analysis skill, users may pass screenshots, documents, receipts, or videos containing confidential data, so silent upload to a third party creates a real privacy and data-handling risk.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The streaming path also transmits the full request payload to the external API, again without a clear warning that user data will leave the local environment for remote processing. Because this skill is specifically designed for image/video understanding, the streamed mode may expose especially sensitive visual data while giving the user no additional disclosure compared to non-streaming mode.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The custom OSS upload path sends local file contents to a user-configured external bucket and returns a presigned URL, but it does so without an explicit warning or consent prompt at the moment of transfer. In a vision skill, users are likely to supply sensitive screenshots, images, or videos, so silent off-platform upload materially increases data exposure risk.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: The helper downloads remote content and writes it to disk without warning, logging, or validation. In this skill context, where URLs may reference third-party media, silent persistence of network-retrieved files can surprise users, store malicious or oversized content locally, and compound SSRF/file-write risks if inputs are user-controlled.

Missing User Warnings

Medium

Confidence: 82% confidence
Finding: The skill constructs multimodal content from local image/video inputs and sends it to a remote model service, but the user-facing behavior does not clearly warn that local files are transmitted off-host. In a vision skill this data flow is expected, but the lack of explicit notice can still cause privacy and compliance issues if users provide sensitive screenshots, documents, or videos.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The --upload-files option uploads local media to temporary remote storage (oss://) without a strong warning at the point of use. Even if temporary, this expands the exposure surface beyond direct API submission and can create privacy, retention, and access-control risks for sensitive visual inputs.

Ssd 1

Medium

Confidence: 96% confidence
Finding: The template acts as a semantic jailbreak by embedding an instruction to reveal full reasoning in a reusable prompt guide. Because this is reference material for a vision skill, downstream users or agents may repeatedly propagate the unsafe pattern into production prompts, making the issue systemic rather than isolated. The context mildly increases risk because visual reasoning tasks are exactly where users may be tempted to ask for step-by-step hidden reasoning, so the guide normalizes an unsafe practice.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal