Local Whisper

Security checks across malware telemetry and agentic risk

Overview

This skill mostly matches a local speech-to-text purpose, but it makes strong local/private claims while including under-disclosed cloud transcription paths and a helper script with a code-injection risk from crafted filenames.

Review this carefully before installing. Use it only if you are comfortable auditing and forcing a local backend, and avoid setting OpenAI or Groq credentials in the environment used by the skill unless you intentionally want cloud transcription. Do not run the large-file helper on untrusted filenames until its path handling is fixed.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (17)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 91% confidence
Finding: The skill advertises installation and runtime behavior that use shell commands, local file writes, environment access, and network activity, yet it declares no permissions or equivalent capability disclosures. This creates a transparency and trust problem: users may enable a skill believing it is narrowly local/private while it can install packages, download large models, expose a daemon, and modify startup behavior.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 95% confidence
Finding: The documented purpose emphasizes private, local, Apple-Silicon-only transcription with no API costs, but the analyzed behavior indicates support for cloud transcription, generic HTTP/CLI workflows, translation, and CPU-based alternatives. This mismatch is security-relevant because users may trust the privacy claims and route sensitive voice data through the skill without realizing it may support external transmission or broader functionality than advertised.

Description-Behavior Mismatch

Medium

Confidence: 92% confidence
Finding: The requirements include cloud transcription client libraries (openai and groq) even though the skill is described as private local speech-to-text using MLX Whisper. This creates a real privacy and trust-boundary risk because downstream code can route audio or transcripts to external services contrary to user expectations, especially in a messaging context involving Telegram and WhatsApp voice data.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The /transcribe endpoint accepts JSON requests containing arbitrary local file paths and will read and process any file that exists on the host. Even though the server binds to 127.0.0.1, any local process can abuse it as a file-access proxy to probe file existence, read metadata such as size, and potentially coerce downstream decoders/transcription libraries into opening sensitive or unexpected files outside the intended Telegram/WhatsApp workflow.

Description-Behavior Mismatch

Medium

Confidence: 93% confidence
Finding: The script is presented as local/private transcription, but it preferentially sends requests to a configurable HTTP daemon via CLAWD_WHISPER_URL. Even though the payload contains a file path rather than raw audio bytes, this still discloses sensitive local filesystem information and may cause the daemon to access user data in ways the user did not expect, especially if the URL is changed to a non-local endpoint.

Description-Behavior Mismatch

High

Confidence: 95% confidence
Finding: The skill is described as local/private speech-to-text, but the implementation explicitly supports OpenAI and Groq cloud backends. In this context, that mismatch is security-relevant because users may provide sensitive voice messages expecting local-only processing, while the code can instead send audio off-device to third parties.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: The code opens a local audio file and submits it to external OpenAI or Groq transcription APIs, which causes user audio to leave the local device. Given the skill's stated purpose of private local transcription for messaging apps, this creates a meaningful confidentiality risk and a strong chance of violating user expectations or policy requirements.

Intent-Code Divergence

Medium

Confidence: 90% confidence
Finding: The module documentation frames the component as an MLX/local transcription module, but the documented support for cloud APIs contradicts that privacy-oriented positioning. This inconsistency increases the likelihood of insecure deployment and user misunderstanding about where sensitive audio data is processed.

Description-Behavior Mismatch

Medium

Confidence: 89% confidence
Finding: The CLI and file header describe a local Whisper fallback, but the exposed `--backend` option explicitly allows remote providers such as `openai` and `groq`. In a privacy-focused messaging transcription skill, this mismatch can cause users or downstream automation to send sensitive voice data to third-party services unexpectedly, creating confidentiality and compliance risk.

Intent-Code Divergence

Medium

Confidence: 87% confidence
Finding: The skill metadata and inline documentation emphasize private local transcription, yet the implementation supports non-local providers. This is dangerous because users may rely on the privacy claim when processing Telegram or WhatsApp audio, while the code path can route content to external services, undermining trust and potentially exposing sensitive communications.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The script sends the user-supplied audio file path to an HTTP endpoint without explicit notice or confirmation. This can leak sensitive path information such as usernames, project names, or directory structure, and if the daemon is remote or attacker-controlled, it may trigger unauthorized access attempts against local files referenced by the path.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: At the call site, the code uploads the provided audio file to an external API without any explicit user-facing warning, confirmation, or disclosure. In a skill advertised for private local transcription of Telegram and WhatsApp audio, silent transmission of message audio to third parties is particularly risky because it exposes potentially sensitive communications.

Unpinned Dependencies

Low

Category: Supply Chain
Content: # Local Whisper - Speech to text # Core python-dotenv>=1.0.0 # OpenAI Whisper API openai>=1.12.0
Confidence: 80% confidence
Finding: python-dotenv>=1.0.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: python-dotenv>=1.0.0 # OpenAI Whisper API openai>=1.12.0 # Groq API (optional - fast & cheap cloud) groq>=0.4.0
Confidence: 82% confidence
Finding: openai>=1.12.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: openai>=1.12.0 # Groq API (optional - fast & cheap cloud) groq>=0.4.0 # Local faster-whisper (optional - CPU-based) faster-whisper>=1.0.0
Confidence: 82% confidence
Finding: groq>=0.4.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: groq>=0.4.0 # Local faster-whisper (optional - CPU-based) faster-whisper>=1.0.0 # MLX Lightning Whisper (Apple Silicon - fastest local option) # Only works on macOS with M1/M2/M3/M4
Confidence: 78% confidence
Finding: faster-whisper>=1.0.0

Known Vulnerable Dependency: python-dotenv — 1 advisory(ies): CVE-2026-28684 (python-dotenv: Symlink following in set_key allows arbitrary file overwrite via )

Low

Category: Supply Chain
Confidence: 87% confidence
Finding: python-dotenv

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal