Install
openclaw skills install lb-pocket-tts-skillGenerate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.
openclaw skills install lb-pocket-tts-skillLightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
pip install pocket-tts
# or
uv add pocket-tts
# Basic generation (default voice)
pocket-tts generate --text "Hello world"
# Custom voice (local file, URL, or safetensors)
pocket-tts generate --voice ./my_voice.wav
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
pocket-tts generate --voice ./voice.safetensors
# Quality tuning
pocket-tts generate --temperature 0.7 --lsd-decode-steps 3
See docs/generate.md for full CLI reference.
# Start FastAPI server with web UI
pocket-tts serve
# Custom host/port
pocket-tts serve --host localhost --port 8080
See docs/serve.md for server options.
Convert audio files to .safetensors for faster loading:
# Single file
pocket-tts export-voice voice.mp3 voice.safetensors
# Batch conversion
pocket-tts export-voice voices/ embeddings/ --truncate
See docs/export_voice.md for export options.
from pocket_tts import TTSModel
import scipy.io.wavfile
# Load model
model = TTSModel.load_model()
# Get voice state
voice = model.get_state_for_audio_prompt(
"hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)
# Generate audio
audio = model.generate_audio(voice, "Hello world!")
# Save
scipy.io.wavfile.write("output.wav", model.sample_rate, audio.numpy())
model = TTSModel.load_model(
config="b6369a24", # Model variant
temp=0.7, # Temperature (0.5-1.0)
lsd_decode_steps=1, # Generation steps (1-5)
eos_threshold=-4.0 # End-of-sequence threshold
)
# From audio file/URL
voice = model.get_state_for_audio_prompt("./voice.wav")
voice = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
# From safetensors (fast loading)
voice = model.get_state_for_audio_prompt("./voice.safetensors")
# Stream audio chunks
for chunk in model.generate_audio_stream(voice, "Long text..."):
# Process/save/play each chunk as generated
print(f"Chunk: {chunk.shape[0]} samples")
# Preload multiple voices
voices = {
"casual": model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav"),
"announcer": model.get_state_for_audio_prompt("./announcer.safetensors"),
}
# Use different voices
audio1 = model.generate_audio(voices["casual"], "Hey there!")
audio2 = model.generate_audio(voices["announcer"], "Breaking news!")
See docs/python-api.md for complete API reference.
Pre-made voices from hf://kyutai/tts-voices/:
alba-mackenna/casual.wav (default, female)jessica-jian/casual.wav (female)voice-donations/Selfie.wav (male, marius)voice-donations/Butter.wav (male, javert)ears/p010/freeform_speech_01.wav (male, jean)vctk/p244_023.wav (female, fantine)vctk/p262_023.wav (female, eponine)vctk/p303_023.wav (female, azelma)Or clone any voice from your own audio samples.
.safetensors for instant loadingAll commands output WAV files: