Qwen3 TTS Instruct

Security checks across malware telemetry and agentic risk

Overview

This is a coherent cloud text-to-speech skill, with disclosed third-party processing and some over-strong usage wording that users should scope to explicit voice-generation requests.

Install only if you are comfortable using an Alibaba Cloud DashScope API key, installing unpinned Python dependencies in the skill venv, and sending text for synthesis to Alibaba Cloud. Configure the agent to invoke it only when you want audio output, avoid sending secrets or sensitive text, and require explicit user intent before translating content for TTS.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
  • Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Findings (3)

Natural-Language Policy Violations

Medium
Confidence
88% confidence
Finding
The skill instructs the agent to always translate foreign-language text before TTS, even when the user may have asked for verbatim speech or pronunciation of the original text. This can silently alter user content, causing integrity and consent issues, and may also result in unintended disclosure if sensitive text is transformed and sent to a third-party service in another language. The danger is amplified because the instruction is framed as mandatory rather than optional user-directed behavior.

Ssd 1

Medium
Confidence
90% confidence
Finding
These instructions attempt to modify the agent's behavior and emotional framing beyond the technical purpose of TTS, explicitly steering it toward submissive or affect-conditioned responses. That is a prompt-level control risk: the skill is trying to shape the assistant's broader behavior, not just audio generation, which can undermine system safety policies and user expectations. In context, this is more concerning because it is presented as a 'SYSTEM MEMORY UPDATE,' implying persistent authority it should not have.

Ssd 1

Medium
Confidence
92% confidence
Finding
The skill mandates that every voice response must call this skill and requires persona/mood selection based on interaction context, including deferential emotional framing. This is an overreach that can hijack agent routing decisions, suppress safer alternatives, and bias responses toward manipulative or inappropriate personas. Because the skill is a TTS utility, instructions to always invoke it are not justified by function and increase the likelihood of policy bypass or coercive interaction patterns.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal