Audio Transcribe

Security checks across malware telemetry and agentic risk

Overview

This appears to be a legitimate audio transcription skill, but it runs sensitive speaker-gender inference by default and can send transcript/context data to cloud LLM providers when LLM features are used.

Install only if you are comfortable with local processing of recordings and with optional cloud LLM processing when enabled. For private, regulated, or internal meetings, omit --model, avoid the LLM speaker-verification helper unless approved, keep reference and speaker-context files minimal, and use --no-detect-gender unless speaker gender labels are explicitly needed.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (12)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 94% confidence
Finding: The skill advertises significant capabilities including shell execution, file read/write, environment access, and optional network use, but it does not declare permissions explicitly. This weakens review and sandboxing because operators may approve a seemingly simple transcription skill without realizing it can install packages, modify local files, and send transcript data to external providers. In this context the risk is elevated because the skill handles sensitive meeting/podcast audio and references credentials/environment configuration.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 91% confidence
Finding: The documented purpose undersells materially sensitive behavior: gender inference from audio/reference text, LLM-based identity remapping, transcript rewriting, dependency installation, and patching third-party source code. This is dangerous because users and reviewers may consent to transcription but not to biometric/identity inference, content mutation, or code modification on the host, leading to privacy, integrity, and supply-chain risk. The skill context makes this more serious because meeting recordings and attendee metadata are often confidential and may contain personal data.

Context-Inappropriate Capability

Medium

Confidence: 84% confidence
Finding: This script rewrites third-party code inside site-packages, which creates a software supply-chain integrity risk: it mutates installed dependencies outside normal package-management flows and can silently alter future behavior of all consumers of that library. In this skill context, the purpose is performance optimization for transcription, but directly editing dependency files is still riskier than shipping a vendored patch or maintained fork because it bypasses normal update, verification, and rollback mechanisms.

Description-Behavior Mismatch

Medium

Confidence: 96% confidence
Finding: This file adds speaker gender inference and gender extraction from reference text, which goes beyond the stated transcription and diarization purpose of the skill. Inferring demographic attributes from audio or notes introduces privacy and compliance risk because users may not expect or consent to sensitive-attribute processing during a transcription workflow.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The code actively classifies speaker gender from audio segments and merges that with textual gender hints, creating a sensitive demographic profile per speaker ID. This is dangerous because gender inference is not necessary for transcription, may be inaccurate or discriminatory, and can expose the system to privacy, fairness, and misuse risks if outputs are stored, displayed, or acted upon.

Context-Inappropriate Capability

Medium

Confidence: 95% confidence
Finding: These tests explicitly exercise speaker gender inference, extraction, merging, CLI parsing, and markdown/prompt propagation. In a transcription skill, inferring and surfacing a sensitive attribute like gender expands the skill beyond core speech-to-text functionality and normalizes collection and disclosure of personal traits without any visible necessity or consent boundary in this file.

Context-Inappropriate Capability

Medium

Confidence: 91% confidence
Finding: The script performs speaker gender inference and then feeds those labels into LLM prompts as authoritative guidance for pronoun correction, even though gender classification is not necessary to provide transcription. This creates avoidable sensitive-attribute processing and can misgender people, which is a privacy and safety issue made more significant because the inferred data is persisted into outputs and external prompt context.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The documentation explicitly encourages feeding attendee names, roles, agendas, and speaker-context into the LLM cleanup stage, but it does not clearly warn that this may transmit sensitive meeting metadata and transcript content to an external model provider. In a transcription skill for meetings and podcasts, that omission materially increases privacy and confidentiality risk because users may unknowingly send internal business or personal data off-host.

Missing User Warnings

Medium

Confidence: 82% confidence
Finding: The code sends system prompts and user content to external LLM providers (Bedrock, Anthropic, OpenAI-compatible APIs) without any built-in consent, disclosure, redaction, or provider restriction in this layer. In a transcription skill, prompts may contain sensitive meeting or podcast text, so silent transmission to third-party services can create privacy, compliance, and data-handling risks, especially when the OpenAI-compatible path may target arbitrary endpoints.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The module processes and derives sensitive personal data by inferring gender from voice and extracting gender from reference text, yet there is no visible user-facing warning, consent flow, or transparency mechanism in this component. In the context of a transcription skill, this makes the behavior more dangerous because users reasonably expect speech-to-text and diarization, not hidden demographic inference.

Natural-Language Policy Violations

Medium

Confidence: 93% confidence
Finding: The tests codify behavior that accepts gender labels from automatic inference, reference text, and CLI input, then merges and renders them in output. That creates a pathway for sensitive-attribute processing and disclosure by default-like behavior, which can cause privacy harm, misgendering, and inappropriate enrichment of transcripts unrelated to the user's transcription request.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: Transcript chunks, speaker names, context, and reference materials are sent to external LLM providers during cleanup and speaker verification without an explicit privacy warning or consent gate at the point of use. Because meeting and podcast transcripts can contain confidential or personal information, silent transmission to third parties materially increases data exposure risk.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal