Voice Assistant
ReviewAudited by ClawScan on May 10, 2026.
Overview
The skill appears to provide the advertised voice assistant, but its local web UI appears to render transcript text as HTML, which could let malicious transcript or model text run code in the browser page.
Review before installing. The voice-provider and gateway data flows are expected for this skill, but avoid sensitive speech unless those services are approved. The browser UI should be fixed to escape transcript text before rendering, and transcript logging should be reduced or made opt-in.
Findings (5)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
A malicious or compromised response could run JavaScript in the voice UI, manipulate the session, or interact with the local page and WebSocket state.
Transcript or agent text is inserted into the DOM as HTML rather than as escaped text. Because that text can come from speech transcription or model output, crafted HTML such as event handlers could execute in the local browser page.
line.innerHTML = `<span class="role ${role}">${role === "user" ? "You" : "Agent"}:</span>${text}`;Render transcript content with textContent or createTextNode, or sanitize it with a trusted sanitizer before insertion. Consider adding a restrictive Content Security Policy.
Spoken commands may have the same power as typed commands to the agent, including tool use depending on the user's OpenClaw configuration.
The skill intentionally routes spoken text into the existing OpenClaw agent, including whatever tools and memory that gateway exposes. This is core to the stated purpose, but voice transcription errors or accidental speech could still trigger agent actions if the gateway allows them.
It's the same agent with all its context, tools, and memory — just with a voice.
Keep tool-approval safeguards enabled on the OpenClaw gateway, review sensitive transcriptions before action when possible, and avoid using voice mode for high-impact tasks without confirmation.
Installing users must trust the local server with their Deepgram or ElevenLabs API keys.
The server loads provider API keys from the environment and sends them to Deepgram or ElevenLabs. This is expected for the STT/TTS integrations and there is no evidence of unrelated credential transmission.
headers = {"Authorization": f"Token {DEEPGRAM_KEY}"} ... headers={"xi-api-key": ELEVENLABS_KEY}Use least-privilege provider keys where available, store them only in the local .env file, avoid committing .env, and rotate keys if the machine or logs are exposed.
Private spoken content and assistant responses may be processed by the configured third-party voice providers and the OpenClaw gateway.
The documented architecture sends microphone audio, transcripts, and generated response text across the browser, local server, external STT/TTS providers, and the OpenClaw gateway. This is purpose-aligned but sensitive.
Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text → OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text → TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker
Use providers and gateway endpoints you trust, verify OPENCLAW_GATEWAY_URL before use, and avoid speaking secrets or regulated data unless the configured services are approved for that data.
Sensitive things spoken to the assistant may appear in terminal or runtime logs.
Final speech transcripts are written to process logs. This is useful for debugging, but it can preserve sensitive spoken content outside the immediate voice session if logs are retained or shared.
log.info(f"STT final: {transcript}") ... log.info(f"STT final: {text}")Redact or disable transcript logging by default, move full transcript logs to debug mode, and inform users if logs may be collected.
