Voice Chat Skill

Security checks across malware telemetry and agentic risk

Overview

This voice-chat skill mostly does what it claims, but it needs review because spoken audio can be sent to cloud speech services while parts of the documentation imply local handling.

Install only if you are comfortable granting microphone access and potentially sending speech or transcribed text to Google, ElevenLabs, or other configured providers. Prefer local/offline modes for sensitive conversations, avoid copying the shell=True playback sample, and treat console transcript output as potentially sensitive.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Output HandlingUnvalidated Output Injection, Cross-Context Output, Unbounded Output
  • Tool MisuseTool Parameter Abuse, Chaining Abuse, Unsafe Defaults
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (11)

Lp3

Medium
Category
MCP Least Privilege
Confidence
93% confidence
Finding
The skill demonstrates capabilities including network access, shell execution, file writes, and environment-variable use, but does not declare permissions or clearly constrain them. This reduces transparency and weakens review/runtime policy enforcement, making it easier for risky behavior to be introduced or overlooked.

Tp4

High
Category
MCP Tool Poisoning
Confidence
89% confidence
Finding
The documented behavior does not fully match the code examples: the default STT example uses Google's online recognition service, external APIs are included, and subprocess/audio-launch behavior goes beyond a simple local voice chat description. Behavior-description mismatch is security-relevant because users may grant trust or permissions based on incomplete expectations.

Intent-Code Divergence

Medium
Confidence
97% confidence
Finding
The security notes claim that voice data is processed locally, but the default implementation sends audio to Google's online speech recognition service. This is a direct privacy and trust issue because users may expose microphone data under a false assumption of local-only handling.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The documentation explicitly promotes cloud-based speech recognition and AI API integration but does not warn that user audio or transcribed text may be sent to third-party services. For a voice-chat skill, that omission is significant because microphone input is highly sensitive and users may reasonably assume processing is local unless told otherwise.

Missing User Warnings

Medium
Confidence
96% confidence
Finding
The skill sends captured microphone audio to Google's remote speech-recognition service without an explicit consent flow, privacy notice, or clear indication that spoken content leaves the local machine. Because the application continuously captures user speech in an interactive loop, sensitive spoken data could be transmitted off-device unexpectedly, creating a genuine privacy and data-exposure risk.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
In Google recognition mode, captured microphone audio is sent to Google's speech recognition service, but the program does not present a clear, prior privacy notice or obtain explicit user consent before transmission. Because this skill handles live voice input, it may inadvertently send sensitive spoken data to a third party, which is especially relevant in a voice-chat context.

Ssd 3

Medium
Confidence
91% confidence
Finding
The conversation loop stores all user utterances and AI responses in memory and prints the full history in plain text at the end of the session. In a voice assistant, users may speak passwords, personal details, or other sensitive information, so echoing the transcript back to the console increases the risk of shoulder-surfing, terminal logging exposure, and accidental disclosure.

External Transmission

Medium
Category
Data Exfiltration
Content
}
    }
    
    response = requests.post(url, json=data, headers=headers)
    if response.status_code == 200:
        with open("output.mp3", "wb") as f:
            f.write(response.content)
Confidence
90% confidence
Finding
requests.post(url, json=

External Transmission

Medium
Category
Data Exfiltration
Content
"""使用ElevenLabs TTS"""
    import requests
    
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"
    headers = {
        "xi-api-key": api_key or os.environ.get("ELEVENLABS_API_KEY"),
        "Content-Type": "application/json"
Confidence
88% confidence
Finding
https://api.elevenlabs.io/

Unvalidated Output Injection

High
Category
Output Handling
Content
if result.returncode == 0:
            print(f"✅ 语音文件已生成: {output_file}")
            # 播放音频
            subprocess.run(["start", output_file], shell=True)
        else:
            print(f"❌ TTS失败: {result.stderr}")
    finally:
Confidence
98% confidence
Finding
subprocess.run(["start", output

Tool Parameter Abuse

High
Category
Tool Misuse
Content
if result.returncode == 0:
            print(f"✅ 语音文件已生成: {output_file}")
            # 播放音频
            subprocess.run(["start", output_file], shell=True)
        else:
            print(f"❌ TTS失败: {result.stderr}")
    finally:
Confidence
97% confidence
Finding
subprocess.run(["start", output_file], shell=True

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal