Discord Voice Using Deepgram

Security checks across malware telemetry and agentic risk

Overview

This is a real Discord voice integration, but it lets spoken Discord input reach the agent's normal toolset and logs transcripts, so it needs careful review before use.

Install only in Discord servers and channels you control. Configure numeric Discord IDs for primaryUser or allowedUsers, keep autoJoinChannel off unless needed, tell channel participants that audio may be sent to Deepgram and transcripts may be logged, and avoid enabling voice-triggered access to powerful agent tools without separate approval or a restricted tool profile.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (8)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 90% confidence
Finding: The skill documentation declares required secrets and clearly relies on Discord and Deepgram network access, but no explicit permissions declaration is described despite those capabilities. This creates a transparency and governance problem: operators may install a networked, credential-using voice plugin without an explicit permission review, increasing the chance of unexpected data egress or misuse of tokens.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 95% confidence
Finding: The documented purpose understates materially important behavior: transcripts are routed into the agent/LLM, remote control surfaces exist via RPC/tools/CLI, speaker control can be changed, and the skill may auto-join channels. These hidden control and data-flow capabilities expand the attack surface and can lead administrators to enable a plugin without understanding that voice data can trigger autonomous actions or that other interfaces can control Discord behavior.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The code explicitly tells the embedded agent handling live voice transcripts that it has access to all normal tools and skills, and a comment shows a prior lane restriction was removed. In a voice-channel context, any participant whose speech is transcribed may be able to trigger powerful agent capabilities indirectly, creating a confused-deputy / over-privileged execution path far beyond simple conversational voice reply behavior.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The skill explains the audio pipeline but does not present a prominent privacy warning that Discord voice audio and resulting transcripts are sent to Deepgram, a third party, for processing. In a voice-channel context this is sensitive because bystanders or other participants may be captured, and users may not realize their speech leaves Discord and is additionally processed by an agent system.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The manifest clearly indicates that live Discord voice audio is streamed to Deepgram for STT/TTS, but it does not present any explicit user-facing privacy notice, consent language, or data-handling warning. In a voice-channel context, this creates a real privacy and compliance risk because users may be recorded and transmitted to a third-party service without clear disclosure.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The plugin logs full transcribed speech content verbatim, which can capture sensitive spoken data such as credentials, personal information, or private conversations. Because this is a Discord voice skill operating on live channel audio, the context increases privacy risk: users may not reasonably expect their speech to be persisted in logs outside Discord itself.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: This code initializes external STT/TTS providers and is designed to send live voice audio and generated speech to third-party services, but there is no user-facing notice or consent flow at the point the bot joins, listens, or processes speech. In a Discord voice-channel context, this can capture bystanders and other participants who may not realize their audio is being transmitted off-platform, creating a real privacy and compliance risk.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The session state and provider setup support recording and processing user audio, and later code transcribes buffered or streaming audio via external STT services without any explicit disclosure to users in the voice channel. Because this is continuous voice interaction in a shared channel, the skill context makes the issue more serious: multiple users' speech may be captured and sent to a third party unexpectedly.

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal