Install
openclaw skills install local-voice-agentComplete offline voice-to-voice AI assistant for OpenClaw (Whisper.cpp STT + Pocket-TTS). 100% local processing, no cloud APIs, no costs. Use for hands-free operation, voice commands, accessibility, or custom voice cloning.
openclaw skills install local-voice-agentComplete voice-to-voice AI assistant for hands-free operation.
User Voice → Whisper STT → Text → OpenClaw AI → Text → Pocket-TTS → Voice Response
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp ~/.local/whisper.cpp
cd ~/.local/whisper.cpp
make -j4
# Download tiny model (fast, low-resource)
bash ./models/download-ggml-model.sh tiny
Test:
./build/bin/whisper-cli -m models/ggml-tiny.bin -f samples/jfk.wav
Option A: Use existing server
export POCKET_TTS_URL="http://localhost:5000"
Option B: Install locally
# Clone your Pocket-TTS server
cd /path/to/pockettts
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 -m app.main --host 0.0.0.0 --port 5000
sudo apt-get install -y ffmpeg
# Record → Transcribe → Process → Speak
./bin/voice-agent "What's the weather today?"
# Continuous voice conversation
./bin/voice-agent --interactive
# Transcribe existing audio file
./bin/voice-to-text recording.wav
# Generate voice from text
./bin/text-to-voice "Hello world!" output.wav
Edit config/voices.yaml:
# Default voices
stt:
model: tiny # tiny, small, medium (larger = more accurate, slower)
language: en # en, ne, hi, etc.
tts:
url: http://localhost:5000
voice: peter voice # Your custom voice
format: wav # wav, mp3
# Performance
performance:
threads: 4 # CPU threads for Whisper
realtime: true # Faster-than-realtime processing
Voice command processing:
curl -X POST "http://localhost:5000/v1/voice/command" \
-F "audio=@recording.wav" \
-F "action=openclaw"
Response:
{
"transcription": "What's the weather today?",
"response_text": "The weather in Kathmandu is partly cloudy, 22 degrees Celsius.",
"audio_response": "/tmp/response.wav"
}
List available TTS voices:
curl http://localhost:5000/v1/voices
./bin/voice-agent "Give me my morning briefing"
./bin/voice-agent "Remind me to call Peter at 3 PM"
./bin/voice-agent "Show me the status of my git repository"
Perfect for users who prefer voice interaction or have mobility constraints.
Convert speech to text:
./bin/voice-to-text input.wav
./bin/voice-to-text input.ogg # Auto-converts with ffmpeg
./bin/voice-to-text input.mp4 # Extracts audio from video
Convert text to speech:
./bin/text-to-voice "Hello world!" output.wav
./bin/text-to-voice --voice "usha lama" "Namaste!" greeting.wav
Full voice pipeline:
./bin/voice-agent "What time is it?"
./bin/voice-agent --interactive # Conversation mode
./bin/voice-agent --file recording.wav # Process file
"failed to read audio file"
ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav"model not found"
bash models/download-ggml-model.sh tiny"Connection refused"
python3 -m app.mainexport POCKET_TTS_URL="http://localhost:5000""Voice not found"
curl http://localhost:5000/v1/voicesSlow transcription
tiny instead of smallffmpeg -i input.wav -ar 16000 output.wavSlow TTS
See examples/ directory for:
morning-briefing.sh - Automated voice briefingvoice-reminder.sh - Voice-based remindersconversation-mode.sh - Interactive voice chat| Model | RAM | Speed (1 min audio) | Accuracy |
|---|---|---|---|
| tiny | 500MB | ~30 sec | ~90% |
| small | 1GB | ~60 sec | ~95% |
| medium | 2GB | ~120 sec | ~98% |
Recommendation: Start with tiny, upgrade to small if needed.
MIT License - See LICENSE file