Voice messaging setup

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

Install

openclaw skills install @aksenkin/voice-stt-tts

Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper
✅ TTS (Text-to-Speech) — voice replies via Edge TTS
🎯 Result: voice → text → reply with voice

Installation

1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

bash

python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv:

bash

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

What gets installed:

faster-whisper — Python library for transcription
Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others.
Size: ~250 MB

Transcription Script

Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py

python

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

What the script does:

Accepts audio file path (--audio)
Loads Whisper model (--model): small by default
Sets language (--lang): en for English
Transcribes with VAD filter (Voice Activity Detection)
Outputs clean text to stdout

Make file executable:

bash

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

OpenClaw Configuration

1. Configure STT (`tools.media.audio`)

Add to ~/.openclaw/openclaw.json:

json5

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameters:

Parameter	Value	Description
`enabled`	`true`	Enable audio transcription
`maxBytes`	`20971520`	Max file size (20 MB)
`type`	`"cli"`	Model type: CLI command
`command`	Python path	Path to python in venv
`args`	argument array	Arguments for script
`{{MediaPath}}`	placeholder	Replaced with audio file path
`timeoutSeconds`	`120`	Transcription timeout (2 minutes)

2. Configure TTS (`messages.tts`)

Add to ~/.openclaw/openclaw.json:

json5

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameters:

Parameter	Value	Description
`auto`	`"inbound"`	Key mode! — reply with voice only on incoming voice messages
`provider`	`"edge"`	TTS provider (free, no API key)
`voice`	`"en-US-JennyNeural"`	Voice (see available below)
`lang`	`"en-US"`	Locale (en-US for US english)

3. Full configuration example

json5

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

Apply Changes

Restart Gateway

bash

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

Testing

Test STT (transcription)

Action: Send a voice message to your Telegram bot

Expected result:

text

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

Example response:

text

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply

Expected result:

Voice file arrives in Telegram
Voice note (round bubble)

Expected behavior:

Incoming voice → bot replies with voice
Text messages → bot replies with text (this is normal!)

Available Edge TTS Voices

Female voices

Voice	ID	Usage example
Jenny	`en-US-JennyNeural`	← current
Ana	`en-US-AnaNeural`	Softer

Male voices

Voice	ID	Usage example
Dmitry	`en-US-RogerNeural`	More bass

How to change voice:

bash

cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

json5

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

Troubleshooting

Problem: Voice not transcribed

Logs show:

text

[ERROR] Transcription failed

Possible causes:

File too large — > 20 MB

bash

# Solution: Increase maxBytes in config
maxBytes: 52428800  # 50 MB

Timeout — transcription took > 2 minutes

bash

# Solution: Increase timeoutSeconds
timeoutSeconds: 180  # 3 minutes

Model not downloaded — first run

bash

# Solution: Wait while it downloads (1-2 minutes)
# Models are cached in ~/.cache/huggingface/

Problem: No voice reply

Possible causes:

Reply too short (< 10 characters)
- TTS skips very short replies
- Solution: this is expected behavior
auto: "inbound" but text message
- TTS in inbound mode replies with voice only on voice messages
- Text messages get text replies — this is correct!

Edge TTS unavailable

bash

# Check
curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
# If error — temporarily unavailable

Performance

Transcription time (Raspberry Pi 4/ARM)

Whisper Model	Est. time	Quality
`tiny`	~5-10 sec	Low
`base`	~10-20 sec	Medium
`small`	~20-40 sec	High ← current
`medium`	~40-80 sec	Very high
`large`	~80-160 sec	Maximum

Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

Where Whisper models are stored

bash

~/.cache/huggingface/

Models download automatically on first run.

Done! 🎉

After completing these steps:

✅ faster-whisper installed in venv
✅ transcribe.py script created
✅ OpenClaw configured (STT + TTS)
✅ Gateway restarted
✅ Voice messages working

Now your Telegram bot:

🎙️ Accepts voice → transcribes via faster-whisper
🎤 Replies with voice → generates via Edge TTS
💬 Accepts text → replies with text (as usual)

Useful links:

OpenClaw docs: https://docs.openclaw.ai
TTS docs: https://docs.openclaw.ai/tts
Audio docs: https://docs.openclaw.ai/nodes/audio
Install skills: npx clawhub search voice

Created: 2026-03-01 for OpenClaw 2026.2.26

Voice messaging setup

Install

Voice Messages (STT + TTS) for OpenClaw 🎙️

What we configure

Installation

1. Create virtual environment (venv)

2. Install faster-whisper

Transcription Script

Path and content

Make file executable:

OpenClaw Configuration

1. Configure STT (tools.media.audio)

2. Configure TTS (messages.tts)

3. Full configuration example

Apply Changes

Restart Gateway

Testing

Test STT (transcription)

Test TTS (voice replies)

Available Edge TTS Voices

Female voices

Male voices

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

Troubleshooting

Problem: Voice not transcribed

Problem: No voice reply

Performance

Transcription time (Raspberry Pi 4/ARM)

Where Whisper models are stored

Done! 🎉

Related skills

1. Configure STT (`tools.media.audio`)

2. Configure TTS (`messages.tts`)