Local Voice Agent

Security checks across malware telemetry and agentic risk

Overview

This is a coherent local voice-agent skill, but users should keep the TTS endpoint local and understand that voice/text artifacts may be cached or temporarily stored.

Install only if you are comfortable with a voice assistant reading microphone/audio input and sending generated text to the configured TTS server. Keep tts.url set to localhost, avoid exposing Pocket-TTS on 0.0.0.0 unless you intentionally need network access, and review or disable cache/log settings if transcripts or spoken responses may contain private information.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (11)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 93% confidence
Finding: The skill advertises operational capabilities that include shell execution, filesystem access, and network use, but it does not declare corresponding permissions. This creates a trust and review gap: users or the host platform may assume the skill is lower risk than it actually is, while the documented install/run steps clearly invoke external commands, write to local paths, and communicate with HTTP services.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 88% confidence
Finding: The skill is described as '100% local processing' and 'no cloud APIs,' but the implementation and setup rely on HTTP services and external code retrieval from GitHub. Even if the TTS endpoint is localhost, the wording can mislead users about data flow, attack surface, and supply-chain exposure, causing them to make unsafe trust decisions about privacy and isolation.

Description-Behavior Mismatch

High

Confidence: 95% confidence
Finding: The configuration and comments contradict the skill's stated '100% local processing, no cloud APIs' guarantee by referencing a TTS server URL and describing it as a 'VPS instance.' Even though the default URL is localhost, the design normalizes remote/network-backed speech synthesis and can mislead users into exposing audio or text to external infrastructure they explicitly expected to avoid.

Intent-Code Divergence

Medium

Confidence: 90% confidence
Finding: The 'VPS instance' comment materially undermines the advertised offline/local-only trust model by implying that network-hosted TTS is an expected deployment mode. In a voice assistant context, this increases the chance that spoken content, prompts, or generated responses are routed off-device without users realizing the privacy boundary has changed.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: OpenClaw integration is enabled by default and specifies a model that does not appear to be a guaranteed local/offline model, directly conflicting with the advertised local-only behavior. In a voice-to-voice assistant, enabling external model-backed responses can expose transcripts, commands, and sensitive ambient speech-derived data to nonlocal processing, making the mismatch especially dangerous.

Description-Behavior Mismatch

Medium

Confidence: 83% confidence
Finding: The skill description claims 100% local processing, but this module sends text over HTTP to a configurable URL. If that URL is changed from localhost, sensitive spoken or generated content can be transmitted off-device without the user realizing it, which is a security/privacy trust issue amplified by the misleading claim.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The README encourages configuring `POCKET_TTS_URL` but does not clearly warn at the initial setup step that pointing it to a non-local endpoint can send synthesized text requests, prompts, and potentially sensitive derived voice content off-device. In a voice-assistant skill that users may use for reminders, personal notes, and hands-free commands, this omission can mislead privacy-conscious users into assuming the entire pipeline remains local when it may not.

Missing User Warnings

Medium

Confidence: 85% confidence
Finding: The skill describes sending recorded audio and generated text to a local HTTP API without an explicit warning about handling potentially sensitive voice data. Localhost transport is not equivalent to zero risk: other local users, containers, port exposure, logs, or misconfiguration to non-local addresses can result in unintended disclosure of recordings and transcripts.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The code persists transcriptions derived from audio to disk without any user-facing disclosure or consent flow. Because voice transcripts may contain sensitive personal, medical, financial, or authentication-related information, silent caching increases privacy and data-retention risk, especially on shared systems.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The function records from the user's microphone and may save audio to a file, but does not enforce an explicit consent or warning mechanism before doing so. In a voice-agent skill, silent or poorly disclosed microphone activation is particularly sensitive because it can capture private conversations and ambient data unexpectedly.

Missing User Warnings

Medium

Confidence: 81% confidence
Finding: User-provided text is transmitted to a TTS server with no disclosure or guardrails in this module. In a skill marketed as offline/local, this can expose sensitive prompts, personal data, or voice-agent content to an unintended service if configuration is altered or misunderstood.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal