Voice TTS/ASR

Security checks across malware telemetry and agentic risk

Overview

This voice skill mostly matches its ASR/TTS purpose, but it needs review because it automatically moves/deletes voice files, reads Telegram credentials, evaluates local config as code, and turns transcripts into agent-directing instructions.

Review before installing, especially for sensitive deployments. Use only if you are comfortable with voice content being processed by local Whisper and cloud Edge TTS, Telegram bot tokens being read from OpenClaw config or environment, Telegram messages being sent to supplied chat IDs, and inbound audio being copied then deleted. Safer changes would be strict JSON config parsing, transcript-only ASR output, opt-in non-destructive archiving, restricted output directories, and explicit channel allowlists.

SkillSpector

By NVIDIA

Vulnerability Patterns

Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (10)

Description-Behavior Mismatch

Medium

Confidence: 82% confidence
Finding: The skill states that after successful transcription it copies inbound audio into the agent workspace and deletes the original file. That expands the skill from transient ASR/TTS into persistent data movement and destructive file modification, which creates privacy, retention, and integrity risks if users/operators were not expecting stored voice archives or automatic deletion.

Context-Inappropriate Capability

Medium

Confidence: 84% confidence
Finding: The batch utility processes all unhandled voice files under ~/.openclaw/media/inbound/, which is broader than a single-message response workflow. Broad directory scanning can unintentionally process unrelated or stale user audio, increasing the chance of over-collection, privacy violations, and accidental actions on files outside the immediate user request context.

Description-Behavior Mismatch

Medium

Confidence: 91% confidence
Finding: After successful transcription, the ASR tool performs filesystem side effects unrelated to speech recognition by copying audio into an agent workspace and deleting the original. This expands the skill's effective privileges beyond its stated purpose and creates an unexpected persistence/movement channel for user media, which is risky in an agent environment handling sensitive voice data.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The code writes into a workspace path derived from an environment variable and then deletes the source file, giving a voice-processing utility the ability to modify agent-controlled storage. In a multi-tool agent context, writing attacker-influenced or sensitive content into the workspace can affect later agent behavior or expose private data, while deletion makes recovery harder.

Missing User Warnings

Medium

Confidence: 78% confidence
Finding: The documentation describes sending Telegram voice messages and resolving bot tokens from config or environment, but does not include a clear privacy/security warning about transmitting generated speech to an external service or handling sensitive credentials. This can lead operators to expose user content or misuse tokens without informed consent or appropriate safeguards.

Missing User Warnings

Medium

Confidence: 83% confidence
Finding: Automatic archival and deletion modify user data, yet the docs present this behavior without a prominent caution. Hidden or underemphasized destructive behavior is risky because operators may deploy the skill expecting read-only transcription while it actually moves and removes source files.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The tool silently deletes the original inbound audio after copying it, with no user-facing warning, confirmation, or transactional safeguards. Even if intended as archival, this destructive behavior can cause data loss, break auditability, and surprise operators who expect ASR to be read-only.

Natural-Language Policy Violations

Medium

Confidence: 82% confidence
Finding: The output wrapper hardcodes Chinese-language and response-format instructions regardless of the user's request or agent policy. This is an integrity issue because a utility that should return transcription data instead injects behavior-shaping directives that can override expected downstream handling.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The script accepts a user-controlled --output path, resolves it, creates parent directories, and then renames a generated temporary file into that location with no path restriction or safety checks. In an agent/tooling context, this enables arbitrary file writes anywhere the process has permission, which can overwrite application data, drop files into sensitive locations, or be chained into more serious compromise depending on runtime privileges.

Ssd 1

Medium

Confidence: 96% confidence
Finding: The script wraps untrusted transcription output inside imperative instructions telling the downstream agent what it 'must' do, then appends attacker-controlled audio content. This creates a prompt-injection bridge from spoken input into agent control flow, allowing a user to steer follow-on actions such as forced tool usage or response formatting through what should be passive ASR output.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal