Install
openclaw skills install salute-speechClawHub Security found sensitive or high-impact capabilities. Review the scan results before using.
Transcribe audio files using Sber Salute Speech async API. Russian-first STT with support for ru-RU, en-US, kk-KZ, ky-KG, uz-UZ.
openclaw skills install salute-speechTranscribe audio/video files to text with timestamps via Salute Speech async REST API.
SALUTE_AUTH_DATA must be set (Base64-encoded client_id:client_secret or raw authorization key from https://developers.sber.ru/studio/).verify_ssl=False) because Sber's certificate chain is non-standard. This is expected.| Audio encoding | Content-Type | Typical extensions |
|---|---|---|
MP3 | audio/mpeg | .mp3 |
PCM_S16LE | audio/wav | .wav |
OPUS | audio/ogg | .ogg, .opus |
FLAC | audio/flac | .flac |
ALAW | audio/alaw | .alaw |
MULAW | audio/mulaw | .mulaw |
ru-RU, en-US, kk-KZ (Kazakh), ky-KG (Kyrgyz), uz-UZ (Uzbek).
salute_transcribe.py with uv and appropriate arguments.uv run --with requests {baseDir}/salute_transcribe.py \
--file /path/to/audio.mp3 \
--output_dir ~/.openclaw/workspace/transcriptions \
--lang ru-RU
| Argument | Required | Default | Description |
|---|---|---|---|
--file | Yes | — | Path to audio/video file |
--output_dir | No | ~/.openclaw/workspace/transcribations | Output directory for results |
--lang | No | ru-RU | Language code: ru-RU, en-US, kk-KZ, ky-KG, uz-UZ |
--audio-encoding | No | MP3 | Codec: MP3, PCM_S16LE, OPUS, FLAC, ALAW, MULAW |
--model | No | general | Recognition model: general or callcenter |
--hyp-count | No | 1 | Number of alternative hypotheses: 1 or 2 |
--max-wait-time | No | 300 | Max seconds to wait for async result |
--print | No | off | Also print transcription to stdout |
When the file extension doesn't match audio/mpeg, adjust content_type in the script or add logic. Current default is audio/mpeg (MP3). For .wav files use audio/wav, etc.
For input file meetingABC.mp3 the script produces:
| File | Description |
|---|---|
meetingABC_recognition_orig.json | Raw API response (full JSON with all hypotheses, timing, confidence) |
meetingABC_pretty.txt | Formatted human-readable transcript with timestamps |
[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.
[00:20 - 00:45]:
Следующий фрагмент текста здесь.
--max-wait-time increased beyond 300s.callcenter model is optimized for telephony audio (8kHz, mono).enable_profanity_filter=False).