Install
openclaw skills install smallest-aiUltra-fast text-to-speech and speech-to-text via Smallest AI's Lightning v3.1 and Pulse models. Use when the user wants to generate speech, convert text to voice, read text aloud, create voice notes, transcribe audio to text, or clone a voice. Sub-100ms latency TTS. 64ms TTFT STT. Supports 30+ languages including Hindi and Spanish. Voices include sophia, robert, advika, vivaan, camilla, and 80+ more.
openclaw skills install smallest-aiText-to-speech (sub-100ms) via Lightning v3.1 and speech-to-text (64ms TTFT) via Pulse.
SMALLEST_API_KEY in your environment:export SMALLEST_API_KEY="your_key_here"
sophia (American English)robert (American English)en1.024000Follow these rules to select the voice:
defaultVoiceMale.defaultVoiceFemale.defaultVoiceFemale (sophia by default).advika (female) or vivaan (male).camilla (female) or carlos (male).anitha (female) or raju (male).Always pass the configured defaultLanguage, defaultSpeed, and defaultSampleRate as --lang, --speed, and --rate flags unless the user overrides them.
Generate speech audio from text using Lightning v3.1 model.
{baseDir}/scripts/tts.sh "Text to speak" --voice sophia --rate 24000 --speed 1.0 --lang en
pip install smallestai or just requests)python3 {baseDir}/scripts/tts.py "Text to speak" --voice sophia --speed 1.0 --lang en --out speech.wav
| Voice | Gender | Accent | Best For |
|---|---|---|---|
| sophia | Female | American | General use (default) |
| robert | Male | American | Professional, reports (default) |
| advika | Female | Indian | Hindi content, code-switch |
| vivaan | Male | Indian | Bilingual English/Hindi |
| camilla | Female | Mexican/Latin | Spanish content |
| zara | Female | American | Conversational |
| melody | Female | American | Storytelling, greetings |
| arjun | Male | Indian | English/Hindi bilingual |
| stella | Female | American | Expressive, warm |
80+ more voices available. List all with: {baseDir}/scripts/voices.sh
--voice <id>: Voice identifier (default: sophia)--rate <hz>: Sample rate — 8000 | 16000 | 24000 | 44100 (default: 24000)--speed <n>: Playback speed 0.5–2.0 (default: 1.0)--lang <code>: Language code (default: en). See {baseDir}/references/languages.md--out <path>: Output file (default: auto-named media/tts_<timestamp>.wav)Scripts print MEDIA: <filepath> on success. OpenClaw sends this as an audio attachment.
Supports 30+ languages. Pass --lang with ISO code:
{baseDir}/scripts/tts.sh "नमस्ते, कैसे हैं आप?" --voice advika --lang hi
{baseDir}/scripts/tts.sh "Bonjour le monde" --voice sophia --lang fr
{baseDir}/scripts/tts.sh "Hola, buenos días" --voice camilla --lang es
Code-switching (mixing languages) works automatically — no flag needed:
{baseDir}/scripts/tts.sh "Hey, मुझे meeting remind कर दो" --voice advika --lang hi
Transcribe audio files using Pulse model. Supports WAV, MP3, OGG, FLAC.
{baseDir}/scripts/stt.sh /path/to/audio.wav
{baseDir}/scripts/stt.sh /path/to/audio.wav --diarize --timestamps --emotions
python3 {baseDir}/scripts/stt.py /path/to/audio.wav --diarize --timestamps --lang en
--lang <code>: Language (default: en)--diarize: Identify different speakers--timestamps: Word-level timing--emotions: Detect emotional toneReturns JSON with transcription field. With --diarize, includes speaker labels per word.
Trigger this skill when the user:
SMALLEST_API_KEY