Skillv3.3.3

ClawScan security

local-voice-reply · ClawHub's context-aware review of the artifact, metadata, and declared behavior.

Scanner verdict

BenignMar 15, 2026, 8:11 PM

Verdict: benign
Confidence: high
Model: gpt-5-mini
Summary: The skill implements a local TTS/Opus reply server whose code and runtime requirements mostly match its description; it requests no credentials and runs locally, though it persists uploaded voices and may download model assets on first run.
Guidance: This skill appears to be what it claims: a local TTS server that produces Opus/Ogg outputs. Before installing, be aware: (1) you must have ffmpeg on PATH and install heavyweight Python deps (torch, torchaudio, chatterbox-tts) — initial startup may download large model files and use significant disk and GPU/CPU resources; (2) uploaded voice samples and generated audio are persisted locally under the skill's folders and by default in ~/.openclaw/media/outbound — only register voice samples you trust; (3) SKILL.md mentions helper scripts that are not present in the bundle—confirm whether those scripts are provided separately or replaced by your own invocation; (4) the service can read files referenced by its manifest if that file is edited, so avoid placing sensitive files under the skill's voice/manifest paths. If you need network isolation, prevent ChatterboxTTS.from_pretrained() from downloading by pre-providing model artifacts or blocking outbound network during startup.

Review Dimensions

Purpose & Capability: okName/description (local OPUS/Ogg voice replies for Feishu/Discord) aligns with included FastAPI server and TTS engine. Required tools (ffmpeg, Python libraries including torch/torchaudio/chatterbox-tts) are proportional to the stated functionality.
Instruction Scope: noteSKILL.md instructs running the local uvicorn server and calling /speak endpoints, saving outputs under .openclaw/media/outbound; those instructions are consistent with code. One small mismatch: SKILL.md references control scripts (scripts/send_voice_reply.ps1 and scripts/generate_cuda_voice.ps1) that are not present in the file manifest—this may be an omission or packaging error. The skill persists uploaded voices and cache data under its own server folders and writes outputs into the user's .openclaw media dir (or TARVIS_VOICE_OUTPUT_DIR).
Install Mechanism: noteNo install spec (instruction-only) and the server code is bundled with the skill — low install risk. However runtime requires large Python packages (torch/torchaudio/chatterbox-tts) and ffmpeg; ChatterboxTTS.from_pretrained() may download model artifacts over the network on first run, which is expected but can be large.
Credentials: okNo required credentials or secret env vars. Optional env vars (TARVIS_VOICE_OUTPUT_DIR, TARVIS_VOICE_DEVICE, TARVIS_VOICE_FFMPEG_TIMEOUT_SEC, TARVIS_VOICE_PHRASE_RAM_CACHE_ITEMS) are relevant to operation and proportionate. The code reads only these environment variables (plus standard log-level).
Persistence & Privilege: okThe skill persists uploaded voice samples under server/voices/, caching under server/voice_cache/, and writes generated .opus files to the configured outputs directory (default: ~/.openclaw/media/outbound/voice-server-v3). It does not request always:true or global privileges; persistence is limited to its own directories and the configured output path.