Voice messaging setup

v1.0.3

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

0· 387·1 current·1 all-time
byDmitry Aksenkin@aksenkin
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (STT + TTS using faster-whisper and Edge TTS) match the actions in SKILL.md: creating a venv, installing faster-whisper, creating a transcribe.py, and adding OpenClaw config entries for media.audio and messages.tts. Nothing requested or shown is unrelated to providing local transcription and TTS.
Instruction Scope
The instructions direct the agent to create files under ~/.openclaw/workspace/voice-messages, install packages into that venv, and modify ~/.openclaw/openclaw.json. These actions are expected for this purpose, but they do write to the user's home and update OpenClaw config — the user should review/backup that config before applying changes. The SKILL.md does not explicitly warn that model weights will be downloaded at runtime (faster-whisper/huggingface-hub), which is an important runtime behavior to be aware of.
Install Mechanism
No packaged install spec is present; the SKILL.md includes shell commands to create a Python venv and pip install faster-whisper. Using pip in an isolated venv is a reasonable install mechanism. The packages pulled (faster-whisper and its deps) come from PyPI/huggingface and are expected for transcription. There is no download from untrusted personal URLs or extract-from-URL steps in the manifest.
Credentials
The skill declares no environment variables or credentials, which is proportional. However, faster-whisper/huggingface-hub will perform network downloads of model artifacts (potentially large) and could prompt for HF auth if private models are used; the SKILL.md does not explicitly call this out. No unrelated secrets or config paths are requested.
Persistence & Privilege
The skill is instruction-only and not always-enabled; it does not request elevated privileges or modify other skills. It proposes editing the agent's openclaw.json configuration (its own runtime configuration), which is appropriate for enabling STT/TTS.
Assessment
This skill appears to do what it claims, but before running: (1) review and back up ~/.openclaw/openclaw.json — the instructions modify it; (2) expect pip to install large/native packages (onnxruntime, ctranslate2, ffmpeg may be needed) and for faster-whisper to download model weights from the Hugging Face hub (large disk and network usage); (3) prefer running the install steps manually in a terminal so you can inspect outputs and resolve missing system packages; (4) confirm the TTS 'edge' provider behavior in your OpenClaw environment (some providers may still call external services); (5) if you have security or bandwidth constraints, run this in an isolated machine or container. If you want me to, I can (a) extract the exact file changes the SKILL.md will make, (b) produce step-by-step shell commands you can run interactively, or (c) list additional system packages you may need (ffmpeg, build tools) for faster-whisper to install successfully.

Like a lobster shell, security has layers — review code before you run it.

latestvk97a3kw1nntwqh15wvghqpakqh823tnb

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

🎙️ Clawdis

SKILL.md

Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

  • STT (Speech-to-Text) — transcribe voice messages via faster-whisper
  • TTS (Text-to-Speech) — voice replies via Edge TTS
  • 🎯 Result: voice → text → reply with voice

Installation

1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv:

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

What gets installed:

  • faster-whisper — Python library for transcription
  • Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others.
  • Size: ~250 MB

Transcription Script

Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

What the script does:

  1. Accepts audio file path (--audio)
  2. Loads Whisper model (--model): small by default
  3. Sets language (--lang): en for English
  4. Transcribes with VAD filter (Voice Activity Detection)
  5. Outputs clean text to stdout

Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

OpenClaw Configuration

1. Configure STT (tools.media.audio)

Add to ~/.openclaw/openclaw.json:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameters:

ParameterValueDescription
enabledtrueEnable audio transcription
maxBytes20971520Max file size (20 MB)
type"cli"Model type: CLI command
commandPython pathPath to python in venv
argsargument arrayArguments for script
{{MediaPath}}placeholderReplaced with audio file path
timeoutSeconds120Transcription timeout (2 minutes)

2. Configure TTS (messages.tts)

Add to ~/.openclaw/openclaw.json:

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameters:

ParameterValueDescription
auto"inbound"Key mode! — reply with voice only on incoming voice messages
provider"edge"TTS provider (free, no API key)
voice"en-US-JennyNeural"Voice (see available below)
lang"en-US"Locale (en-US for US english)

3. Full configuration example

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

Apply Changes

Restart Gateway

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

Testing

Test STT (transcription)

Action: Send a voice message to your Telegram bot

Expected result:

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

Example response:

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply

Expected result:

  • Voice file arrives in Telegram
  • Voice note (round bubble)

Expected behavior:

  • Incoming voice → bot replies with voice
  • Text messages → bot replies with text (this is normal!)

Available Edge TTS Voices

Female voices

VoiceIDUsage example
Jennyen-US-JennyNeural← current
Anaen-US-AnaNeuralSofter

Male voices

VoiceIDUsage example
Dmitryen-US-RogerNeuralMore bass

How to change voice:

cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

Troubleshooting

Problem: Voice not transcribed

Logs show:

[ERROR] Transcription failed

Possible causes:

  1. File too large — > 20 MB

    # Solution: Increase maxBytes in config
    maxBytes: 52428800  # 50 MB
    
  2. Timeout — transcription took > 2 minutes

    # Solution: Increase timeoutSeconds
    timeoutSeconds: 180  # 3 minutes
    
  3. Model not downloaded — first run

    # Solution: Wait while it downloads (1-2 minutes)
    # Models are cached in ~/.cache/huggingface/
    

Problem: No voice reply

Possible causes:

  1. Reply too short (< 10 characters)

    • TTS skips very short replies
    • Solution: this is expected behavior
  2. auto: "inbound" but text message

    • TTS in inbound mode replies with voice only on voice messages
    • Text messages get text replies — this is correct!
  3. Edge TTS unavailable

    # Check
    curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
    # If error — temporarily unavailable
    

Performance

Transcription time (Raspberry Pi 4/ARM)

Whisper ModelEst. timeQuality
tiny~5-10 secLow
base~10-20 secMedium
small~20-40 secHigh ← current
medium~40-80 secVery high
large~80-160 secMaximum

Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

Where Whisper models are stored

~/.cache/huggingface/

Models download automatically on first run.

Done! 🎉

After completing these steps:

  1. ✅ faster-whisper installed in venv
  2. transcribe.py script created
  3. ✅ OpenClaw configured (STT + TTS)
  4. ✅ Gateway restarted
  5. ✅ Voice messages working

Now your Telegram bot:

  • 🎙️ Accepts voice → transcribes via faster-whisper
  • 🎤 Replies with voice → generates via Edge TTS
  • 💬 Accepts text → replies with text (as usual)

Useful links:


Created: 2026-03-01 for OpenClaw 2026.2.26

Files

1 total
Select a file
Select a file to preview.

Comments

Loading comments…