Install
openclaw skills install local-stt-workflowLocal speech-to-text workflow for an OpenAI-compatible STT server, typically on http://127.0.0.1:8000/v1. Use when configuring, testing, debugging, or validating audio transcription with `/v1/audio/transcriptions` or `/v1/audio/translations`, especially for OpenClaw audio pipelines, multipart upload compatibility, model registration, streaming SSE behavior, response_format handling, local model-path fallback, and “did the request reach the server or not?” investigations.
openclaw skills install local-stt-workflowUse this skill to debug the full transcription path, not just the model.
Default assumption: the local STT server lives at http://127.0.0.1:8000/v1.
Current local model-path fallback worth remembering: if the server did not pull a model by name, it may be loading directly from a local path such as ./models/Qwen3-ASR-0.6B-bf16.
When exact route shape matters, the local OpenAPI document is available at:
http://localhost:8000/openapi.jsonUse this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.
Check the basics first:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
Confirm that the intended STT model exists, usually qwen3-asr.
If the model does not appear by pulled registry name, do not assume STT is broken — this server may be running a local-path model such as ./models/Qwen3-ASR-0.6B-bf16.
If the server is task-gated, ensure STT is enabled:
MLX_AUDIO_SERVER_TASKS=stt uv run python server.py
If the model is missing, register it before testing clients — but first check whether the server is intentionally loading from a local path and verify the exact accepted model IDs through /v1/models or http://localhost:8000/openapi.json.
Always isolate the server from the client stack.
Minimal direct transcription test:
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F response_format=json
Useful richer test:
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F response_format=verbose_json \
-F 'timestamp_granularities[]=segment' \
-F 'timestamp_granularities[]=word'
If direct curl works but OpenClaw does not, the bug is probably in the message ingestion or routing layer, not the STT backend.
Use this rule hard:
This distinction saves a shitload of time.
This server is designed around OpenAI-style multipart form upload.
Expected core fields for /v1/audio/transcriptions from the current local OpenAPI schema:
file, modellanguage, verbose, max_tokens, chunk_duration, frame_threshold, stream, context, prefill_step_size, textThis means the local server is not exposing the same form shape as OpenAI Whisper-style docs. Do not blindly assume response_format, prompt, or timestamp_granularities[] exist just because OpenAI supports them.
If a client is suspected of sending the wrong shape, inspect traffic with a temporary dump proxy or server logs.
Read references/stt-api.md when you need exact behavior for:
response_format=json|text|verbose_json|srt|vttstream=true SSE eventstimestamp_granularities[]include[]Do not guess field support from generic OpenAI docs when this local server may intentionally differ.
Current notable mismatch: the local schema exposes context and text, plus chunking/prefill controls like chunk_duration, frame_threshold, and prefill_step_size, which are not the usual OpenAI STT field set.
When OpenClaw STT appears broken:
tools.media.audio is configured, not messages.stthttp://127.0.0.1:8000/v1/v1/models/v1/audio/transcriptions/audio/transcriptions request at all, the problem is upstream of STTIf OpenClaw never hits the server, stop tweaking model params. That would be cargo-cult debugging.
Use this order:
GET /healthGET /v1/modelscurl transcription with the same audio filehttp://localhost:8000/openapi.jsonTypical signs:
.m4a returns 500mp3 or wav makes transcription succeed immediatelyConclusion: treat this as an input-container compatibility bug, not an ASR-quality failure. For now, transcode niche formats to mp3 or wav before testing recognition quality.
Typical signs:
curl returns { "text": ... }Conclusion: fix routing, not inference.
Typical signs:
curl works but app client does notConclusion: compare multipart field names and values.
Typical signs:
Conclusion: align expectations with references/stt-api.md.
references/stt-api.md — exact local API behavior, schema, response formats, SSE events, limits, and compatibility notes