Qwen3-tts

Security checks across malware telemetry and agentic risk

Overview

The skill provides real text-to-speech functionality, but its advertised offline/local behavior is inconsistent with remote HTTP modes that can transmit text and expose an unauthenticated server.

Install only if you intend to use both local ML TTS and possibly a remote TTS server. Keep QWEN_TTS_REMOTE unset for private/offline use, avoid sending sensitive text to remote endpoints, bind any server to localhost or a tightly controlled private network, and add access controls before exposing it.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration

Findings (10)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 81% confidence
Finding: The skill documentation advertises shell execution, environment-variable use, and network access (model download and mirror configuration), yet no permissions are declared. This creates a transparency and policy-enforcement gap: users or orchestrators may invoke a skill believing it is low-privilege when it can actually reach the network and execute setup/runtime commands.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 93% confidence
Finding: The declared behavior says the skill is a fully local/offline TTS wrapper around a specific 1.7B model, but the analyzed implementation reportedly also exposes a FastAPI server, supports remote client/server operation, and permits arbitrary model selection while defaulting to a different model. This mismatch is dangerous because it can expand the attack surface from a local single-purpose tool into a network-accessible service with externally influenced model loading, defeating user expectations and potentially enabling SSRF-like fetches, unauthorized exposure, or policy bypass.

Description-Behavior Mismatch

Medium

Confidence: 96% confidence
Finding: The server lets any client supply an arbitrary model identifier and passes it to from_pretrained, which can trigger network access and loading of unapproved artifacts at runtime. This contradicts the stated offline-only behavior and expands the trust boundary from a fixed local TTS model to attacker-influenced external model selection, creating supply-chain and availability risk.

Context-Inappropriate Capability

Medium

Confidence: 92% confidence
Finding: Including a client-controlled model field in the TTS request gives callers a capability beyond plain text-to-speech generation: they can cause the server to switch models and potentially fetch or initialize large external artifacts. In this context, that makes the endpoint more dangerous because an internet-facing TTS API should not expose backend model-loading behavior directly to untrusted users.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The script explicitly documents that a remote server URL is required, which contradicts the skill metadata claiming the TTS runs entirely offline after initial model download. This misrepresentation is security-relevant because users may provide sensitive text under the assumption it never leaves the host, but the content is transmitted to a network service instead.

Description-Behavior Mismatch

High

Confidence: 99% confidence
Finding: The core synthesis path sends user text, language, voice description, and instructions to an external HTTP endpoint instead of performing local inference. In the context of a skill advertised as an offline alternative to cloud TTS, this creates an unexpected data exfiltration path and breaks the trust boundary users were told to expect.

Context-Inappropriate Capability

Medium

Confidence: 90% confidence
Finding: Importing and depending on the requests library for the primary TTS path indicates a networked architecture that is unjustified by the skill's 'offline local TTS' description. While a network dependency alone is not always dangerous, here it materially increases risk because it enables silent transmission of user-provided speech content contrary to the stated design.

Description-Behavior Mismatch

Medium

Confidence: 90% confidence
Finding: The skill description emphasizes local/offline TTS, but the script also supports remote HTTP synthesis via --remote and QWEN_TTS_REMOTE. This mismatch can mislead users and higher-level agents into sending text to an external service when they expect purely local processing, creating unintended data exfiltration risk.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The remote TTS path transmits user-provided text, speaker settings, and instructions to an arbitrary server URL. In a skill positioned as a local/offline alternative, this broad network capability meaningfully increases the chance of privacy leaks, especially if agents or users rely on the offline claim when handling sensitive content.

Missing User Warnings

Medium

Confidence: 98% confidence
Finding: The instructions explicitly recommend starting the server with `--host 0.0.0.0`, exposing it on all network interfaces, and provide no warning about authentication, access control, or transport security. If the server has no auth and accepts arbitrary synthesis requests, any reachable host on the network could use or abuse the service, potentially exposing sensitive text submitted for synthesis or enabling unauthorized resource consumption.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal