Back to skill

Security audit

Douyin Video Transcribe

Security checks across malware telemetry and agentic risk

Overview

This transcription skill is plausible, but it can automatically start or create a persistent Docker Whisper service and may use configured cloud transcription paths without clear per-run consent.

Install only if you are comfortable with the agent downloading video/audio, writing local media and transcript files, running ffmpeg/ffprobe, and starting Docker containers. Prefer pre-provisioning and pinning the Whisper container yourself, avoid configuring cloud ASR keys unless you intend audio to leave the machine, and remove or stop the whisper-asr container when finished.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (12)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
self.DOCKER_IMAGE
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
        if result.returncode == 0:
            return True
        else:
Confidence
92% confidence
Finding
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
if container_status in ("exited", "created"):
            # 容器存在但未运行,启动它
            print(f"🔄 启动已有容器 {self.CONTAINER_NAME}...")
            result = subprocess.run(
                ["docker", "start", self.CONTAINER_NAME],
                capture_output=True, text=True, timeout=30
            )
Confidence
90% confidence
Finding
result = subprocess.run( ["docker", "start", self.CONTAINER_NAME], capture_output=True, text=True, timeout=30 )

Lp3

Medium
Category
MCP Least Privilege
Confidence
89% confidence
Finding
The skill instructs use of network access, shell commands, and file read/write behavior but does not declare any corresponding permissions or capability boundaries. That mismatch weakens user visibility and enforcement, making it easier for the skill to perform downloads, create files, and invoke local tools without clear consent or policy review.

Tp4

High
Category
MCP Tool Poisoning
Confidence
82% confidence
Finding
A description-behavior mismatch is security-relevant because users and reviewers may approve the skill for a narrow transcription purpose while it actually performs broader or different actions, such as container management or external service use. That breaks trust assumptions and can lead to unintended data exposure, broader execution privileges, and unsafe deployment decisions.

Description-Behavior Mismatch

Medium
Confidence
78% confidence
Finding
The documentation expands scope from Douyin-only processing to other platforms via yt-dlp, which introduces additional network retrieval and media-handling behavior outside the declared purpose. Scope drift matters because it can bypass user expectations, expand attack surface, and trigger legal or operational risks associated with broader scraping/downloading tools.

Context-Inappropriate Capability

Medium
Confidence
82% confidence
Finding
The skill can invoke Docker, ffmpeg, and ffprobe on the host, which expands it from pure transcription logic into host-tool orchestration. In an agent setting, that broader execution capability increases attack surface and may violate least-privilege expectations if users or operators do not realize the skill can start or rely on containerized services and parse attacker-supplied media.

Description-Behavior Mismatch

Medium
Confidence
97% confidence
Finding
The helper does more than access a local ASR endpoint: it provisions the service by starting or creating Docker containers automatically. In skill context, that materially broadens privilege and trust boundaries, allowing a transcription request to trigger software deployment and execution on the host.

Context-Inappropriate Capability

Medium
Confidence
96% confidence
Finding
Executing Docker CLI commands to inspect, start, and create containers gives the skill host-management capability that exceeds its stated purpose of transcription. In an agent setting, this is dangerous because a seemingly harmless media-processing request can alter host state and run external software, increasing risk of misuse or unexpected persistence.

Missing User Warnings

Medium
Confidence
91% confidence
Finding
The skill tells the agent to download remote media and save outputs locally, but it does not warn users about file creation, storage location, or overwrite behavior. This can lead to unanticipated disk writes, accidental overwrites, or persistence of sensitive media/transcripts on the host system.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The skill sends audio to an HTTP ASR endpoint without a privacy warning, which is risky because spoken content may contain sensitive personal or business information. Even when the endpoint is localhost, users should be told that audio leaves the immediate transcription process and is transmitted to a service boundary for processing.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
These code paths send local audio content to third-party cloud transcription APIs whenever corresponding API keys are configured, but there is no explicit consent or warning at the call site. Because the input may contain sensitive voice data, uploading it off-device without a clear user notice creates a real privacy and data-governance risk.

Missing User Warnings

High
Confidence
99% confidence
Finding
The fallback logic automatically tries cloud providers after local transcription methods fail, which can silently move user audio off the local system. In a transcription skill handling potentially sensitive media, this context makes the issue more dangerous because a user asking for local transcription may unknowingly have their content uploaded to external services.

VirusTotal

63/63 vendors flagged this skill as clean.

View on VirusTotal