Install
openclaw skills install clackDeploy and manage Clack, a voice relay server for OpenClaw. Bridges voice input (WebSocket) through STT → OpenClaw agent → TTS, enabling real-time voice conv...
openclaw skills install clackWebSocket relay server that enables real-time voice conversations with an OpenClaw agent.
Flow: Client audio (PCM 16kHz/16-bit/mono) → STT → OpenClaw Gateway → TTS → PCM audio back to client.
Per-session provider selection: The client can independently choose STT and TTS providers per call — any combination of on-device (Apple speech frameworks) and server-side providers (ElevenLabs, OpenAI, Deepgram). The server auto-detects all available providers based on configured API keys and exposes them via /info.
chatCompletions endpoint enabledRun the setup script. It creates a venv, installs deps, prompts for API keys, configures a systemd service, and optionally sets up SSL.
sudo bash scripts/setup.sh
The script auto-detects your OpenClaw gateway config and interactively prompts for provider API keys (ElevenLabs, OpenAI, Deepgram — all optional). On re-runs, existing keys can be kept, updated, or deleted.
bash scripts/setup.sh [--port 9878] [--domain clack.example.com]
| Flag | Default | Description |
|---|---|---|
--port | 9878 | Relay server port |
--domain | (none) | Domain for SSL setup (enables WSS) |
All connections are encrypted. The app supports two modes:
Domain with SSL (recommended):
bash scripts/setup.sh --domain clack.yourdomain.com
# → wss://clack.yourdomain.com/voice
Requires a DNS A record pointing the domain to your server IP. The setup script auto-configures SSL via Caddy. You can use a free domain from DuckDNS or your own.
Tailscale:
# Install Tailscale on your server, then connect from the app using your Tailscale IP
# → ws://100.x.x.x:9878/voice (encrypted at network level)
No domain or SSL setup needed. Tailscale encrypts all traffic at the network layer. Install Tailscale on both your server and phone, then use the server's Tailscale IP in the app.
Security note: Port 9878 should be firewalled from the public internet. Only allow access via localhost (for Caddy reverse proxy) and Tailscale. The app does not support unencrypted public connections.
The gateway must have chatCompletions enabled. Apply this config patch:
{"http": {"endpoints": {"chatCompletions": {"enabled": true}}}}
clack status # Check service status
clack restart # Restart the server
clack logs # Tail logs
clack pair # Generate a new pairing code
clack update # Pull latest code and restart
clack setup # Re-run interactive setup (add SSL later, update keys, etc.)
clack uninstall # Remove service and venv
📱 iOS — Available on the App Store (or build from source at github.com/fbn3799/clack-app) 🤖 Android — Coming soon!
All endpoints except GET /health and POST /pair require a valid auth token (RELAY_AUTH_TOKEN). Tokens are verified using constant-time HMAC comparison to prevent timing attacks.
GET /pair)POST /pair)All user-facing text inputs are sanitized before processing:
CLACK_MAX_INPUT_CHARS), echo detection filters feedback loops, hallucination detection discards nonsense STT outputEach voice call creates a clack:<uuid> session in OpenClaw. These are small, isolated sessions — one per call — so voice conversations don't pollute your main agent context.
The session picker in the iOS app provides context injection only. When you select a session key, it is added as text context to the LLM prompt — it does not change routing. All voice calls still create their own clack:<uuid> session.
Users can provide persistent context that gets injected into the system prompt for every voice call. This lets the AI know about the user's preferences, notes, or any background information.
{"type": "set_context", "text": "..."} during a voice sessionPUT /context?token=...&text=... or POST /context with JSON body {"text": "..."}Context is sanitized before saving — only natural-language characters are kept (letters, numbers, common punctuation). IP addresses and domains are stripped. The server returns the sanitized text in the response so the app can show the user exactly what will be sent as context.
Context persists across calls and server restarts. Clear it via DELETE /context or by sending an empty set_context message.
The relay maintains a shared history file across calls for continuity. History is stored as JSON in CLACK_HISTORY_DIR (default: /var/lib/clack/history).
CLACK_MAX_HISTORY)GET /history, clearable via DELETE /historyFor testing audio round-trips without using LLM credits:
CLACK_ECHO_MODE=true environment variable{"type":"start","config":{"echo":true}} from the clientIn echo mode, transcribed text is echoed back through TTS instead of being sent to the LLM. Audio is peak-normalized with capped gain to ensure consistent playback volume.
STT and TTS providers can be configured independently per session. The server auto-detects all available providers at startup based on which API keys are set (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY).
GET /info to discover available providerssttProvider and ttsProvider in the session config| STT | TTS | Use case |
|---|---|---|
| ElevenLabs | ElevenLabs | Full cloud — best quality |
| On-device | ElevenLabs | Save STT costs, keep premium voices |
| On-device | On-device | Fully local — zero API usage, works offline |
| OpenAI | Deepgram | Mix providers freely |
Cost optimization: Use on-device STT (free, unlimited) with a premium cloud TTS voice — get great output quality while eliminating transcription costs entirely. Or go fully on-device for zero API spend.
When STT is set to on-device, the client sends transcribed text instead of audio:
{"type": "text_input", "text": "What's the weather like?"}
When TTS is set to on-device, the server returns response_text only and skips audio synthesis.
CLACK_MAX_INPUT_CHARS) — transcripts exceeding this are truncated| Endpoint | Method | Auth | Description |
|---|---|---|---|
GET /health | GET | No | Health check — returns service status |
POST /pair | POST | No | Redeem pairing code → get auth token (rate-limited) |
GET /pair | GET | Yes | Generate one-time pairing code |
GET /info | GET | Yes | Server info: agent name, available STT/TTS providers |
GET /voices | GET | Yes | List available TTS voices |
GET /sessions | GET | Yes | List active sessions |
GET /history | GET | Yes | Get conversation history |
DELETE /history | DELETE | Yes | Clear conversation history |
GET /context | GET | Yes | Get current user context |
PUT /context | PUT | Yes | Set user context (query param text) |
POST /context | POST | Yes | Set user context (JSON body {"text": "..."}) |
DELETE /context | DELETE | Yes | Clear user context |
WebSocket /voice | WS | Yes | Voice relay connection |
Endpoint: ws://<host>:<port>/voice?token=<RELAY_AUTH_TOKEN>
| Message | Format | Description |
|---|---|---|
{"type":"start","config":{...}} | JSON | Start session. Config: voice, systemPrompt, echo, sttProvider, ttsProvider |
| Binary frames | bytes | Raw PCM audio (16kHz, 16-bit, mono) |
{"type":"text_input","text":"..."} | JSON | Local speech mode — send text directly |
{"type":"end_speech"} | JSON | Signal end of speech, triggers processing |
{"type":"interrupt"} | JSON | Cancel current TTS playback |
{"type":"ping"} | JSON | Keepalive |
{"type":"set_context","text":"..."} | JSON | Set user context (sanitized before saving) |
{"type":"auth","token":"..."} | JSON | Authenticate (alternative to query param) |
| Message | Format | Description |
|---|---|---|
{"type":"ready"} | JSON | Session ready |
{"type":"auth_ok"} / {"type":"auth_failed"} | JSON | Auth result |
{"type":"processing","stage":"..."} | JSON | Stage: transcribing, thinking, speaking, filtered |
{"type":"transcript","text":"...","final":true} | JSON | STT result |
{"type":"response_text","text":"..."} | JSON | LLM text response |
{"type":"response_start","format":"pcm_16000"} | JSON | Audio stream starting |
| Binary frames | bytes | TTS audio (PCM 16kHz, 16-bit, mono) |
{"type":"response_end"} | JSON | Audio stream done |
{"type":"tts_cancelled"} | JSON | TTS playback was interrupted |
{"type":"context_updated","text":"..."} | JSON | Context saved — text contains the sanitized version |
{"type":"context_cleared"} | JSON | Context was cleared |
clack:<uuid> session20 built-in ElevenLabs voices available. Default: Will. Pass voice name or ID in session config:
{"type": "start", "config": {"voice": "aria"}}
Available aliases: will, aria, roger, sarah, laura, charlie, george, callum, river, liam, charlotte, alice, matilda, jessica, eric, chris, brian, daniel, lily, bill.
| Variable | Default | Description |
|---|---|---|
RELAY_AUTH_TOKEN | — | Required. Client auth token (32-char) |
OPENCLAW_GATEWAY_URL | http://127.0.0.1:18789 | OpenClaw Gateway URL |
OPENCLAW_GATEWAY_TOKEN | — | Gateway bearer token |
STT_PROVIDER | elevenlabs | STT provider (elevenlabs, openai, deepgram) |
TTS_PROVIDER | elevenlabs | TTS provider (elevenlabs, openai, deepgram) |
TTS_VOICE | Will | Default voice (name or ID) |
VOICE_RELAY_PORT | 9878 | Server port |
CLACK_ECHO_MODE | false | Enable echo test mode server-wide |
CLACK_MAX_INPUT_CHARS | 300 | Max transcript length (chars) |
CLACK_HISTORY_DIR | /var/lib/clack/history | History file storage directory |
CLACK_MAX_HISTORY | 50 | Max conversation history messages |
CLACK_AGENT_NAME | Storm | Agent name shown in the iOS app |
Provider API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY) are stored in config.json with restricted file permissions, not as environment variables. The setup script manages these interactively.