Voice TTS/ASR

Security checks across malware telemetry and agentic risk

Overview

This voice skill mostly matches its ASR/TTS purpose, but it needs review because it automatically moves/deletes voice files, reads Telegram credentials, evaluates local config as code, and turns transcripts into agent-directing instructions.

Review before installing, especially for sensitive deployments. Use only if you are comfortable with voice content being processed by local Whisper and cloud Edge TTS, Telegram bot tokens being read from OpenClaw config or environment, Telegram messages being sent to supplied chat IDs, and inbound audio being copied then deleted. Safer changes would be strict JSON config parsing, transcript-only ASR output, opt-in non-destructive archiving, restricted output directories, and explicit channel allowlists.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
Findings (10)

Description-Behavior Mismatch

Medium
Confidence
82% confidence
Finding
The skill states that after successful transcription it copies inbound audio into the agent workspace and deletes the original file. That expands the skill from transient ASR/TTS into persistent data movement and destructive file modification, which creates privacy, retention, and integrity risks if users/operators were not expecting stored voice archives or automatic deletion.

Context-Inappropriate Capability

Medium
Confidence
84% confidence
Finding
The batch utility processes all unhandled voice files under ~/.openclaw/media/inbound/, which is broader than a single-message response workflow. Broad directory scanning can unintentionally process unrelated or stale user audio, increasing the chance of over-collection, privacy violations, and accidental actions on files outside the immediate user request context.

Description-Behavior Mismatch

Medium
Confidence
91% confidence
Finding
After successful transcription, the ASR tool performs filesystem side effects unrelated to speech recognition by copying audio into an agent workspace and deleting the original. This expands the skill's effective privileges beyond its stated purpose and creates an unexpected persistence/movement channel for user media, which is risky in an agent environment handling sensitive voice data.

Context-Inappropriate Capability

Medium
Confidence
94% confidence
Finding
The code writes into a workspace path derived from an environment variable and then deletes the source file, giving a voice-processing utility the ability to modify agent-controlled storage. In a multi-tool agent context, writing attacker-influenced or sensitive content into the workspace can affect later agent behavior or expose private data, while deletion makes recovery harder.

Missing User Warnings

Medium
Confidence
78% confidence
Finding
The documentation describes sending Telegram voice messages and resolving bot tokens from config or environment, but does not include a clear privacy/security warning about transmitting generated speech to an external service or handling sensitive credentials. This can lead operators to expose user content or misuse tokens without informed consent or appropriate safeguards.

Missing User Warnings

Medium
Confidence
83% confidence
Finding
Automatic archival and deletion modify user data, yet the docs present this behavior without a prominent caution. Hidden or underemphasized destructive behavior is risky because operators may deploy the skill expecting read-only transcription while it actually moves and removes source files.

Missing User Warnings

Medium
Confidence
88% confidence
Finding
The tool silently deletes the original inbound audio after copying it, with no user-facing warning, confirmation, or transactional safeguards. Even if intended as archival, this destructive behavior can cause data loss, break auditability, and surprise operators who expect ASR to be read-only.

Natural-Language Policy Violations

Medium
Confidence
82% confidence
Finding
The output wrapper hardcodes Chinese-language and response-format instructions regardless of the user's request or agent policy. This is an integrity issue because a utility that should return transcription data instead injects behavior-shaping directives that can override expected downstream handling.

Missing User Warnings

Medium
Confidence
88% confidence
Finding
The script accepts a user-controlled --output path, resolves it, creates parent directories, and then renames a generated temporary file into that location with no path restriction or safety checks. In an agent/tooling context, this enables arbitrary file writes anywhere the process has permission, which can overwrite application data, drop files into sensitive locations, or be chained into more serious compromise depending on runtime privileges.

Ssd 1

Medium
Confidence
96% confidence
Finding
The script wraps untrusted transcription output inside imperative instructions telling the downstream agent what it 'must' do, then appends attacker-controlled audio content. This creates a prompt-injection bridge from spoken input into agent control flow, allowing a user to steer follow-on actions such as forced tool usage or response formatting through what should be passive ASR output.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal