Voice Assistant

ReviewAudited by ClawScan on May 10, 2026.

Overview

The skill appears to provide the advertised voice assistant, but its local web UI appears to render transcript text as HTML, which could let malicious transcript or model text run code in the browser page.

Review before installing. The voice-provider and gateway data flows are expected for this skill, but avoid sensitive speech unless those services are approved. The browser UI should be fixed to escape transcript text before rendering, and transcript logging should be reduced or made opt-in.

Findings (5)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

Concern

ASI05: Unexpected Code Execution

What this means

A malicious or compromised response could run JavaScript in the voice UI, manipulate the session, or interact with the local page and WebSocket state.

Why it was flagged

Transcript or agent text is inserted into the DOM as HTML rather than as escaped text. Because that text can come from speech transcription or model output, crafted HTML such as event handlers could execute in the local browser page.

Skill content

line.innerHTML = `<span class="role ${role}">${role === "user" ? "You" : "Agent"}:</span>${text}`;

Recommendation

Render transcript content with textContent or createTextNode, or sanitize it with a trusted sanitizer before insertion. Consider adding a restrictive Content Security Policy.

Note

ASI02: Tool Misuse and Exploitation

What this means

Spoken commands may have the same power as typed commands to the agent, including tool use depending on the user's OpenClaw configuration.

Why it was flagged

The skill intentionally routes spoken text into the existing OpenClaw agent, including whatever tools and memory that gateway exposes. This is core to the stated purpose, but voice transcription errors or accidental speech could still trigger agent actions if the gateway allows them.

Skill content

It's the same agent with all its context, tools, and memory — just with a voice.

Recommendation

Keep tool-approval safeguards enabled on the OpenClaw gateway, review sensitive transcriptions before action when possible, and avoid using voice mode for high-impact tasks without confirmation.

Note

ASI03: Identity and Privilege Abuse

What this means

Installing users must trust the local server with their Deepgram or ElevenLabs API keys.

Why it was flagged

The server loads provider API keys from the environment and sends them to Deepgram or ElevenLabs. This is expected for the STT/TTS integrations and there is no evidence of unrelated credential transmission.

Skill content

headers = {"Authorization": f"Token {DEEPGRAM_KEY}"} ... headers={"xi-api-key": ELEVENLABS_KEY}

Recommendation

Use least-privilege provider keys where available, store them only in the local .env file, avoid committing .env, and rotate keys if the machine or logs are exposed.

Note

ASI07: Insecure Inter-Agent Communication

What this means

Private spoken content and assistant responses may be processed by the configured third-party voice providers and the OpenClaw gateway.

Why it was flagged

The documented architecture sends microphone audio, transcripts, and generated response text across the browser, local server, external STT/TTS providers, and the OpenClaw gateway. This is purpose-aligned but sensitive.

Skill content

Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text → OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text → TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker

Recommendation

Use providers and gateway endpoints you trust, verify OPENCLAW_GATEWAY_URL before use, and avoid speaking secrets or regulated data unless the configured services are approved for that data.

Note

ASI06: Memory and Context Poisoning

What this means

Sensitive things spoken to the assistant may appear in terminal or runtime logs.

Why it was flagged

Final speech transcripts are written to process logs. This is useful for debugging, but it can preserve sensitive spoken content outside the immediate voice session if logs are retained or shared.

Skill content

log.info(f"STT final: {transcript}") ... log.info(f"STT final: {text}")

Recommendation

Redact or disable transcript logging by default, move full transcript logs to debug mode, and inform users if logs may be collected.