Local Voice Agent

v1.0.2

Complete offline voice-to-voice AI assistant for OpenClaw (Whisper.cpp STT + Pocket-TTS). 100% local processing, no cloud APIs, no costs. Use for hands-free...

0· 12·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (offline voice agent) match the code and runtime requirements: wrapper scripts call whisper-cli, ffmpeg and a local Pocket‑TTS HTTP server; install.sh installs/copies the skill into the OpenClaw workspace. Required binaries (whisper-cli, python3, ffmpeg) are appropriate for the stated purpose.
Instruction Scope
SKILL.md and the scripts stay within the stated voice pipeline (record → STT → AI → TTS → playback). They instruct cloning whisper.cpp and optionally running a Pocket‑TTS server. One runtime capability to note: the TTS client (lib/tts.py) POSTs to whatever url is configured (default localhost), so if you change config to a remote TTS endpoint the skill will send text to that server. The OpenClaw integration is a placeholder that reads an optional session_key from config but does not implement remote OpenClaw calls in the provided code.
Install Mechanism
There is no automated package install from untrusted URLs; install.sh clones whisper.cpp from GitHub only on user approval and copies files into the user's OpenClaw workspace. No obscure download hosts, archive extraction, or arbitrary remote binaries are present in the package.
Credentials
The skill requests no environment variables or external credentials by default. It does read config/voices.yaml (which contains an optional openclaw.session_key field) and writes caches (~/.cache/voice-agent) and logs (~/.local/log/voice-agent.log). Because the TTS URL is configurable, changing it to a remote server would expose generated text and possibly user transcripts — keep TTS set to localhost to stay fully local.
Persistence & Privilege
Skill is not always-enabled and does not request elevated/system-wide privileges. install.sh copies the skill into the user's OpenClaw workspace and suggests a PATH change, which is normal for a skill. It does not modify other skills' configurations or system authentication.
Assessment
This package appears to do what it claims (local Whisper.cpp STT + Pocket‑TTS). Before installing: 1) Open config/voices.yaml and ensure openclaw.session_key is empty unless you intentionally provide a session token; 2) keep tts.url set to http://localhost:5000 unless you trust a remote TTS server (changing it will send text/audio data off‑device); 3) note the skill will create cache files (~/.cache/voice-agent) and a log (~/.local/log/voice-agent.log) which may contain transcripts — delete or secure them if needed; 4) run install.sh manually and review its file‑copy operations (it copies into ~/.openclaw/workspace/skills/voice-agent); 5) the install may clone and build whisper.cpp and download models from GitHub — review those third‑party repos if you need to. If you want extra caution, run the skill in a sandboxed account or VM and verify network activity (ensure Pocket‑TTS runs on localhost) before granting broader access.

Like a lobster shell, security has layers — review code before you run it.

latestvk976rkk2hx28dvwfx68jqez4hn844fj3

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

🎤 Clawdis
Binswhisper-cli, python3, ffmpeg

SKILL.md

Voice Agent - OpenClaw Skill

Complete voice-to-voice AI assistant for hands-free operation.

Architecture

User Voice → Whisper STT → Text → OpenClaw AI → Text → Pocket-TTS → Voice Response

Prerequisites

1. Whisper.cpp (Speech-to-Text)

# Clone and build
git clone https://github.com/ggerganov/whisper.cpp ~/.local/whisper.cpp
cd ~/.local/whisper.cpp
make -j4

# Download tiny model (fast, low-resource)
bash ./models/download-ggml-model.sh tiny

Test:

./build/bin/whisper-cli -m models/ggml-tiny.bin -f samples/jfk.wav

2. Pocket-TTS (Text-to-Speech)

Option A: Use existing server

export POCKET_TTS_URL="http://localhost:5000"

Option B: Install locally

# Clone your Pocket-TTS server
cd /path/to/pockettts
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 -m app.main --host 0.0.0.0 --port 5000

3. FFmpeg (Audio Conversion)

sudo apt-get install -y ffmpeg

Quick Start

Voice Command (One-shot)

# Record → Transcribe → Process → Speak
./bin/voice-agent "What's the weather today?"

Interactive Mode

# Continuous voice conversation
./bin/voice-agent --interactive

Voice File Processing

# Transcribe existing audio file
./bin/voice-to-text recording.wav

# Generate voice from text
./bin/text-to-voice "Hello world!" output.wav

Configuration

Edit config/voices.yaml:

# Default voices
stt:
  model: tiny  # tiny, small, medium (larger = more accurate, slower)
  language: en  # en, ne, hi, etc.

tts:
  url: http://localhost:5000
  voice: peter voice  # Your custom voice
  format: wav  # wav, mp3

# Performance
performance:
  threads: 4  # CPU threads for Whisper
  realtime: true  # Faster-than-realtime processing

API Endpoints

POST /v1/voice/command

Voice command processing:

curl -X POST "http://localhost:5000/v1/voice/command" \
  -F "audio=@recording.wav" \
  -F "action=openclaw"

Response:

{
  "transcription": "What's the weather today?",
  "response_text": "The weather in Kathmandu is partly cloudy, 22 degrees Celsius.",
  "audio_response": "/tmp/response.wav"
}

GET /v1/voices

List available TTS voices:

curl http://localhost:5000/v1/voices

Use Cases

1. Daily Briefings (Voice)

./bin/voice-agent "Give me my morning briefing"

2. Voice Notes

./bin/voice-agent "Remind me to call Peter at 3 PM"

3. Hands-Free Coding

./bin/voice-agent "Show me the status of my git repository"

4. Accessibility

Perfect for users who prefer voice interaction or have mobility constraints.

Scripts

bin/voice-to-text

Convert speech to text:

./bin/voice-to-text input.wav
./bin/voice-to-text input.ogg  # Auto-converts with ffmpeg
./bin/voice-to-text input.mp4  # Extracts audio from video

bin/text-to-voice

Convert text to speech:

./bin/text-to-voice "Hello world!" output.wav
./bin/text-to-voice --voice "usha lama" "Namaste!" greeting.wav

bin/voice-agent

Full voice pipeline:

./bin/voice-agent "What time is it?"
./bin/voice-agent --interactive  # Conversation mode
./bin/voice-agent --file recording.wav  # Process file

Troubleshooting

Whisper.cpp Errors

"failed to read audio file"

  • Convert to WAV first: ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav

"model not found"

  • Download model: bash models/download-ggml-model.sh tiny

Pocket-TTS Errors

"Connection refused"

  • Start TTS server: python3 -m app.main
  • Check URL: export POCKET_TTS_URL="http://localhost:5000"

"Voice not found"

  • List voices: curl http://localhost:5000/v1/voices
  • Clone custom voice if needed

Performance Issues

Slow transcription

  • Use smaller model: tiny instead of small
  • Reduce audio sample rate: ffmpeg -i input.wav -ar 16000 output.wav

Slow TTS

  • Use shorter text
  • Generate in background

Examples

See examples/ directory for:

  • morning-briefing.sh - Automated voice briefing
  • voice-reminder.sh - Voice-based reminders
  • conversation-mode.sh - Interactive voice chat

Performance

ModelRAMSpeed (1 min audio)Accuracy
tiny500MB~30 sec~90%
small1GB~60 sec~95%
medium2GB~120 sec~98%

Recommendation: Start with tiny, upgrade to small if needed.

License

MIT License - See LICENSE file

Credits

  • Whisper.cpp by Georgi Gerganov (ggerganov/whisper.cpp)
  • Pocket-TTS by Kyutai Labs (kyutai-labs/pocket-tts)
  • OpenClaw by OpenClaw Team (openclaw/openclaw)

Support

Files

12 total
Select a file
Select a file to preview.

Comments

Loading comments…