Install
openclaw skills install google-gemini-ttsGenerate spoken audio from text using Google's Gemini TTS models (default is Gemini 3.1 Flash TTS Preview, with fallback to Gemini 2.5 Flash/Pro preview TTS). Use when an agent needs to convert text to speech, produce voice replies, narrate briefings or newsletters, create podcast-style two-speaker conversations, generate audio with expressive style control (whispers, pauses, accents, emotion), or output WAV files for voice-enabled workflows. Supports 30 prebuilt voices, 70+ languages, single and multi-speaker modes, and natural-language style prompts. Requires a GEMINI_API_KEY from Google AI Studio (the script also accepts GOOGLE_API_KEY as an alternative name for the same key).
openclaw skills install google-gemini-ttsGenerate speech audio from text using Gemini TTS models. The default is Gemini 3.1 Flash TTS Preview, and the script still supports Gemini 2.5 preview TTS models when you pass -m.
scripts/gemini_tts.sh: CLI wrapper around the Gemini REST API# Show all options
scripts/gemini_tts.sh --help
# Single speaker, default voice (Kore)
scripts/gemini_tts.sh "Hello, welcome to the show!"
# Pick a voice
scripts/gemini_tts.sh -v Puck "This is Puck speaking."
# With style control
scripts/gemini_tts.sh -s "Say in a warm, calm tone:" "Take a deep breath."
# Save to a specific file
scripts/gemini_tts.sh -o /tmp/greeting.wav "Hey there!"
# Multi-speaker conversation
scripts/gemini_tts.sh --multi "Host:Kore,Guest:Puck" \
"Host: Welcome to the podcast! Guest: Thanks for having me."
The script prints the output WAV file path.
| Model | Best for |
|---|---|
gemini-3.1-flash-tts-preview (default) | Best default now: low-latency, natural output, expressive narration |
gemini-2.5-flash-preview-tts | Backward-compatible fast preview model |
gemini-2.5-pro-preview-tts | Long-form narration and higher-end creative work |
Current note: Gemini 3.1 Flash TTS Preview is live and should be the default path for this skill. Gemini 2.5 preview TTS models remain useful as compatibility fallbacks.
Preview model note:
gemini-3.1-flash-tts-previewis a preview model. If Google renames or retires it, pass-m gemini-2.5-flash-preview-ttsas a fallback, or check the current model list.
Switch model examples:
scripts/gemini_tts.sh -m gemini-2.5-pro-preview-tts "Your text here"
scripts/gemini_tts.sh -m gemini-2.5-flash-preview-tts "Your text here"
Available prebuilt voices:
Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat, Laomedeia, Achernar, Schedar, Rasalgethi, Nashira, Enif
The same 30-voice library is shared between gemini-3.1-flash-tts-preview and the gemini-2.5-flash-preview-tts / gemini-2.5-pro-preview-tts fallbacks, so a voice you pick for the default model will still work if you drop back to a fallback via -m.
Gemini 3.1 Flash TTS reads plain transcripts naturally, but gives you two complementary ways to steer the delivery when you want more control.
Drop bracketed directions into the transcript. They modify what follows, can appear anywhere, and can stack or repeat across a single script:
[excitedly] Massive update today — [whispers] but keep it between us. [laughs]
Tags are open-ended; anything in [ ] is treated as a direction to the model. A useful starting set:
[excitedly], [bored], [reluctantly], [amazed], [curious], [mischievously], [panicked], [sarcastic], [serious], [tired], [trembling][very fast], [very slowly], [asmr], [deep and loud shouting], [whispers][gasp], [giggles], [sighs], [snorts], [cough], [laughs], [crying][like dracula], [like a dog], [singing], [sarcastically, one painfully slow word at a time]For longer pieces where you want a consistent persona, prepend an AUDIO PROFILE / SCENE / DIRECTOR'S NOTES / TRANSCRIPT block. The four headers are load-bearing — the model uses them to separate performance context from the script it should actually speak:
# AUDIO PROFILE: Jaz, London morning-show radio DJ
## THE SCENE: 10 PM, neon-lit studio, "ON AIR" tally blazing.
Jaz is bouncing on their heels, hands on the faders, infectious energy.
### DIRECTOR'S NOTES
Style: vocal smile always audible; punchy consonants; elongated vowels on excitement words.
Accent: Brixton, London.
Pace: energetic, bouncing cadence, no dead air.
#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! [shouting] Turn it up!
Inline tags inside #### TRANSCRIPT override the baseline direction when you want a specific beat.
Full prompting reference: Gemini speech-generation docs.
Up to 2 speakers. Use --multi "Name1:Voice1,Name2:Voice2" and make sure the speaker names in the text match.
70+ languages are supported, including Arabic, Bengali, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, Vietnamese, and many more. See the Gemini speech-generation docs for the full locale list.
Basic smoke test once your API key is set:
export GEMINI_API_KEY=your_key_here # GOOGLE_API_KEY is also accepted
scripts/gemini_tts.sh -o /tmp/gemini-test.wav "This is a Gemini TTS smoke test."
file /tmp/gemini-test.wav
Expected result: a playable WAV file is created (24 kHz mono, 16-bit PCM WAV).