Install
openclaw skills install local-transcriptionLocal speech-to-text transcription with Qwen ASR — transcription routed across your Apple Silicon fleet. Transcribe meetings, voice notes, podcasts with local speech-to-text. Works like Whisper but runs locally via MLX. Fleet-routed transcription with queue management and dashboard visibility. 语音转文字 | transcripción de voz
openclaw skills install local-transcriptionYou're helping someone use speech-to-text transcription on audio files — meetings, voice memos, podcast episodes, phone recordings — without sending anything to the cloud. Every audio file stays on their devices. The fleet picks the best node to handle each speech-to-text transcription automatically.
Cloud speech-to-text transcription APIs charge per minute and send your audio to third-party servers. Meeting recordings contain sensitive business discussions. Voice notes contain personal thoughts. Podcast interviews contain unreleased content. None of that should leave your network. Local transcription keeps it private.
This skill routes speech-to-text transcription requests across your fleet of devices. If one machine is busy with a 3-hour transcription, the next speech-to-text request goes to a different device. Transcription queue management, health monitoring, and dashboard visibility — same infrastructure you'd get from a cloud speech-to-text API, running entirely on your hardware.
pip install ollama-herd
herd # start the transcription router (port 11435)
herd-node # start on each transcription device
uv tool install "mlx-qwen3-asr[serve]" --python 3.14 # install speech-to-text model
Enable speech-to-text transcription:
curl -X POST http://localhost:11435/dashboard/api/settings \
-H "Content-Type: application/json" \
-d '{"transcription": true}'
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
# Speech-to-text transcription of a meeting recording
curl -s http://localhost:11435/api/transcribe \
-F "audio=@meeting-recording.wav" | python3 -m json.tool
import httpx
def speech_to_text_transcription(audio_path):
"""Run speech-to-text transcription on an audio file."""
with open(audio_path, "rb") as f:
transcription_resp = httpx.post(
"http://localhost:11435/api/transcribe",
files={"audio": (audio_path, f)},
timeout=300.0,
)
transcription_resp.raise_for_status()
transcription_result = transcription_resp.json()
return transcription_result["text"]
# Run speech-to-text transcription
transcription_text = speech_to_text_transcription("meeting.wav")
print(transcription_text)
def transcription_with_timestamps(audio_path):
"""Speech-to-text transcription returning timestamped chunks."""
with open(audio_path, "rb") as f:
transcription_resp = httpx.post(
"http://localhost:11435/api/transcribe",
files={"audio": (audio_path, f)},
timeout=300.0,
)
transcription_resp.raise_for_status()
transcription_result = transcription_resp.json()
for transcription_chunk in transcription_result.get("chunks", []):
print(f"[{transcription_chunk['start']:.1f}s - {transcription_chunk['end']:.1f}s] {transcription_chunk['text']}")
return transcription_result
{
"transcription_text": "Hello, this is a test of the speech-to-text transcription system.",
"language": "English",
"transcription_chunks": [
{
"text": "Hello, this is a test of the speech-to-text transcription system.",
"start": 0.0,
"end": 3.2,
"chunk_index": 0,
"language": "English"
}
]
}
WAV, MP3, M4A, FLAC, MP4, OGG — any format FFmpeg supports. WAV files get a ~25% transcription speed boost via native fast-path.
| Header | Description |
|---|---|
X-Fleet-Node | Which device performed the speech-to-text transcription |
X-Fleet-Model | Transcription model used (qwen3-asr) |
X-Transcription-Time | Transcription processing time in milliseconds |
Qwen3-ASR — state-of-the-art open-source speech-to-text transcription in 2026. ~5% word error rate, runs natively on Apple Silicon via MLX. The 0.6B transcription model uses ~1.2GB memory and transcribes at 0.08x real-time factor (a 10-minute recording completes transcription in ~48 seconds).
The same router handles three other AI workloads alongside speech-to-text transcription. All endpoints are at http://localhost:11435:
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-oss:120b","messages":[{"role":"user","content":"Hello"}]}'
curl -o image.png http://localhost:11435/api/generate-image \
-H "Content-Type: application/json" \
-d '{"model":"z-image-turbo","prompt":"a sunset","width":1024,"height":1024,"steps":4}'
curl http://localhost:11435/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"search query"}'
# Transcription stats (last 24h)
curl -s http://localhost:11435/dashboard/api/transcription-stats | python3 -m json.tool
# Fleet health (includes speech-to-text transcription activity)
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Dashboard at http://localhost:11435/dashboard — speech-to-text transcription queues show with [STT] badge alongside LLM and image queues.
Agent Setup Guide — complete reference for all 4 model types including speech-to-text transcription with Python, JavaScript, and curl examples.
~/.fleet-manager/.tail ~/.fleet-manager/logs/herd.jsonl.uv tool install "mlx-qwen3-asr[serve]" --python 3.14.