Step Audiobook

Security checks across malware telemetry and agentic risk

Overview

This is a disclosed Step-based audiobook workflow, but it can send manuscripts and voice samples to external APIs and store generated artifacts locally.

Install only if you are comfortable sending story text, role/casting metadata, TTS text, and reference voice audio to Step or any LLM endpoint you configure. Use a limited or test STEP_API_KEY, keep generated voice/audio artifacts in a private workspace, review voice-library.yaml and clone-review.yaml before cloning, and run clone_selected_voices.py --dry-run before any paid clone.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands

Findings (21)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 92% confidence
Finding: The skill documentation describes capabilities to read/write local files, access environment variables, invoke shell tools, and send data to external APIs, but it does not declare permissions accordingly. This creates a transparency and consent problem: callers may invoke the skill without understanding that local content and secrets such as STEP_API_KEY can be accessed and that files may be modified or transmitted off-host.

Description-Behavior Mismatch

Medium

Confidence: 95% confidence
Finding: The file hard-codes a remote LLM endpoint and API-key-based provider configuration even though the skill is described as local. In this context, book text, character profiles, and casting inputs may be transmitted off-device to a third-party service, creating a privacy and deployment-boundary mismatch that users may not expect.

Description-Behavior Mismatch

Medium

Confidence: 92% confidence
Finding: In refresh-only mode, the script reads prior clone-review decisions and can persist them back into the shared voice-library.yaml via sync_clone_review_decisions_to_library() and save_yaml(). That couples a recommendation/reporting workflow with mutation of long-lived library state, so stale, mistaken, or maliciously edited review files can silently change future cloning behavior.

Description-Behavior Mismatch

Medium

Confidence: 94% confidence
Finding: During a full run, the script again synchronizes decisions from clone-review.yaml into the persistent library and saves the modified library if changes occurred. Because clone-review content is external input, this creates an unintended state-changing side effect in a tool whose primary purpose is recommendation, enabling tampering or accidental policy drift across subsequent runs.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The document states that refreshing will write `confirm_clone / skip` decisions back into `voice-library.yaml`, which is persistent state, but it does not prominently warn the operator that this action mutates shared library data. In a skill that manages a reusable voice library, silent write-back can cause unintended durable configuration changes, especially if users think they are only regenerating derived files for a single run.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The documentation states that raw story text is sent to an external LLM endpoint by default, but it does not warn users that their content leaves the local environment and may be processed by a third party. In an audiobook workflow, source text may contain unpublished manuscripts, licensed material, or personal data, so silent default transmission creates a real confidentiality and compliance risk.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The document states that user reference audio is sent to an external model/API for analysis, but it does not include an explicit warning that biometric voice data and potentially spoken content may leave the local environment. Because voice samples are sensitive personal data and may include identifying or confidential information, omission of disclosure and consent guidance creates a real privacy and compliance risk.

Missing User Warnings

Medium

Confidence: 77% confidence
Finding: The documentation explicitly instructs users to place reference audio into managed folders and describes automatic moving, creation, and archival of raw audio and derived analysis files, but it does not warn that these files are retained long-term and may contain sensitive biometric voice data. In this skill context, that omission matters because operators may assume temporary processing, while the workflow persists both original and normalized audio plus analysis artifacts across multiple locations.

Missing User Warnings

High

Confidence: 91% confidence
Finding: The document states that user reference audio is sent to an external model (`step-audio-r1.1`) for analysis but does not prominently disclose that this may transmit sensitive voice biometrics and speech content to a third-party service. In an audiobook voice-cloning workflow, this is especially risky because voice samples are highly sensitive personal data and users may not realize they are leaving the local environment.

Missing User Warnings

Medium

Confidence: 84% confidence
Finding: The documentation instructs the user to execute a paid voice-cloning action via `--confirm-paid-action` but does not clearly warn that this triggers an external service call with financial consequences. In an agent skill context, omission of an explicit cost/consent warning can cause unintended billable actions or cloning of voices without sufficiently deliberate user approval.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The documentation states that story text, role analysis, and related content are sent to an external Step API endpoint, but it does not clearly warn users that potentially sensitive manuscript content or voice-related data may leave the local environment. In a skill handling private stories and reference audio, omission of a clear privacy/transmission notice can cause users to disclose sensitive data unintentionally.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The workflow documents a paid voice-cloning flow, including a confirmation flag for paid action, but it does not prominently warn about financial cost, consent requirements, or the sensitivity of cloning a person's voice. Users could trigger a billable and potentially privacy-invasive action without fully understanding the implications.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The script sends raw story content and later a serialized provisional script to an external LLM via call_openai_compatible_json, but the file itself provides no consent gate, warning, or restriction on provider destination. In this skill context, stories may contain unpublished manuscripts, personal data, or proprietary text, so silent transmission to a configurable remote base_url creates a real confidentiality risk.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The HTTP error path reads the full remote response body and embeds it directly into the raised RuntimeError. If the upstream LLM/proxy includes sensitive request echoes, internal diagnostics, credentials, stack traces, or tenant data in error bodies, that data can be propagated into logs, UIs, or higher-level exception handlers without redaction. In a skill that sends prompts to external LLM services, this increases the chance of accidental sensitive-data disclosure.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: These functions build prompts that embed raw user-provided text, provisional scripts, and candidate data for LLM processing, which are likely sent to the configured external API. Without explicit disclosure, consent, or data-minimization controls, sensitive manuscript content or personal data could be unintentionally transmitted to a third party.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The script serializes source text or structured story content, including role evidence and sample dialogue, and sends it to an external OpenAI-compatible service. If the input contains unpublished manuscripts, personal data, or sensitive prompts, this can leak confidential content to third-party infrastructure without any consent gate or minimization visible in this file.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The selection stage sends role profiles plus candidate metadata to an external LLM, potentially exposing internal voice-library data, annotations, and operational status fields. In the audiobook skill context, these artifacts may include proprietary voice descriptions, workflow notes, and asset identifiers, so external transmission expands the confidentiality boundary and can disclose more than users expect.

Missing User Warnings

Medium

Confidence: 97% confidence
Finding: The script base64-encodes local voice audio and sends it to the external Step API for analysis, but there is no user-facing consent, warning, or policy check in this code path. Because voice recordings are biometric and often sensitive, silently transmitting them to a third party can create privacy, compliance, and data-governance exposure.

Missing User Warnings

Medium

Confidence: 86% confidence
Finding: The code sends segment input_text and instruction content to an external TTS service, which may include sensitive manuscript or user data, without any visible consent gate, disclosure, redaction, or classification checks in this workflow. In an audiobook skill, such transmission is expected functionally, but the absence of safeguards increases privacy and data-handling risk if confidential text is processed unintentionally.

External Transmission

Medium

Category: Data Exfiltration
Content: - Step 专属能力继续使用 Step 的原生接口，例如 `step-audio-r1.1`、官方音色接口、音色复刻接口、`stepaudio-2.5-tts` - 需要长文本理解、角色分析、选角推理的部分，默认使用 `step-3.5` - 当前默认的 `step-3.5` 调用走的是 Step 的 `step_plan` reasoning 接口，默认 endpoint 为 `https://api.stepfun.com/step_plan/v1` - 上述 LLM 理解层尽量通过可配置的兼容层实现，便于后续按需替换为其他兼容的 LLM - 各阶段中间产物默认落盘，既方便人工审阅和修改，也方便程序按阶段接续执行
Confidence: 94% confidence
Finding: https://api.stepfun.com/

External Transmission

Medium

Category: Data Exfiltration
Content: 当前需要重点说明的是：并不是整个 skill 的所有能力都可以替换成别的模型。 - 默认可替换的部分：长文本结构化、角色提取、选角推理等 LLM 理解环节 - 这些环节当前默认使用 `step-3.5`，且默认走 `https://api.stepfun.com/step_plan/v1` - 默认不替换的部分：`step-audio-r1.1` 音频分析、Step 官方音色拉取、Step 音色复刻、`stepaudio-2.5-tts` 合成 - 也就是说，`audiobook` 当前是“Step 音频能力 + 可替换 LLM 推理层”的组合架构，而不是把所有能力都抽象成完全可替换的通用 provider
Confidence: 93% confidence
Finding: https://api.stepfun.com/

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal