talkies

API key required
Security

Self-hosted OpenAI-compatible speech service. /v1/audio/transcriptions fronts seven open ASR models (Whisper, Parakeet, Canary); /v1/audio/speech fronts two TTS engines — Kokoro-82M (41 baked voices) and Qwen3-TTS-0.6B (CUDA-only voice cloning from user-mounted .wav reference clips). Same wire format as OpenAI — change the base URL + slug. Stereo diarization, URL fetching, MCP endpoint, bearer auth.

Install

openclaw skills install talkies

talkies

Self-hosted speech service — ASR and TTS, one container. OpenAI-compatible wire shape on both endpoints; point an OpenAI client at it, change the model slug, done.

ASR (POST /v1/audio/transcriptions): six backends — whisper-large-v3, whisper-large-v3-turbo, parakeet-tdt-0.6b-v3, canary-180m-flash, canary-1b-flash, canary-qwen-2.5b.

TTS (POST /v1/audio/speech): two engines — kokoro-82m with 41 baked voices across en/es/fr/hi/it/pt, and qwen3-tts-0.6b for CUDA-only voice cloning from reference clips (three builtin samples plus any .wav you drop into /data/custom-voices/, including nested subdirs). Both discovered via GET /v1/audio/voices.

Extras: stereo diarization on transcription, URL file_path fetching, server-side file staging, MCP endpoint with 6 ASR-side tools, optional bearer-token auth.

For installation, configuration, and container setup, see references/setup.md.

When To Use

  • Transcribe audio files (any format ffmpeg decodes — WAV, MP3, M4A, FLAC, OGG, WebM, Opus, MP4 audio).
  • Generate SRT/VTT subtitles for video.
  • Transcribe podcasts, lectures, interviews, voicemails, calls.
  • Stereo two-mic recordings → per-speaker diarized output (L: / R: channel tagging).
  • German/French/Spanish ↔ English speech-to-text translation via Canary-1B-Flash.
  • Synthesize speech from text via Kokoro-82M — English (American + British), Spanish, French, Hindi, Italian, Portuguese.
  • Voice-clone speech via Qwen3-TTS-0.6B from a reference .wav you provide — drop into /data/custom-voices/, immediately appears under GET /v1/audio/voices with origin=custom.
  • Drop-in replacement for api.openai.com/v1/audio/transcriptions and api.openai.com/v1/audio/speech in existing client code.

When NOT To Use

  • Real-time / streaming output — both endpoints are request/response only.
  • Speaker identification from voice (only stereo-channel diarization is supported, not voice clustering).
  • Per-request prompt / temperature (transcribe) or instructions (speech) injection — fields accepted for compat, ignored.
  • Japanese / Chinese TTS — Kokoro upstream supports them but talkies filters those voices out (they need the misaki[ja] / misaki[zh] extras).
  • Kokoro on OpenAI aliases (alloy, echo, fable, onyx, nova, shimmer) — Kokoro exposes its native voice names only (af_*, bm_*, etc.). Map client-side. (Qwen3-TTS does ship alloy / echo / fable as builtin voice slugs, but they're voice-cloned samples, not OpenAI's voices — there's no audio compatibility.)
  • qwen3-tts-0.6b on CPU — voice cloning hard-fails without CUDA at load time. The faster_qwen3_tts upstream raises ValueError on non-CUDA devices; talkies surfaces this as a load failure on the first request.
  • qwen3-tts-0.6b speed parameter — Qwen3-TTS has no playback-rate control. Field is accepted for OpenAI compat but ignored (only Kokoro honors speed).
  • arm64 hosts — linux/amd64 only.

Setup

The container should already be running. Set the base URL:

export TALKIES_URL=http://localhost:8000

If the server has TALKIES_AUTH_TOKEN set, export it too:

export TALKIES_AUTH_TOKEN=<your-token>
# every request below needs: -H "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Verify: curl $TALKIES_URL/healthz returns {"ok": true, "device": "...", "models": [...]}.

For install / configuration / env vars / CPU vs CUDA images / custom model registry, see references/setup.md.

Quick Start

# Discover what's available.
curl -s $TALKIES_URL/v1/models | jq

# Simplest transcribe — file upload, JSON response.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Same call, but the audio lives at a URL — talkies downloads + caches it.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Full Whisper-shape JSON with per-segment + per-word timestamps.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# SRT subtitles.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@lecture.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > lecture.srt

# Discover TTS voices, then synthesize an MP3.
curl -s $TALKIES_URL/v1/audio/voices | jq
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "Hello from talkies.",
        "voice": "af_heart",
        "response_format": "mp3"
      }' \
  --output hello.mp3

Supported Models

ASR

SlugFamilyCPUCUDALanguagesStrength
whisper-large-v3faster-whisperyesyes99 auto-detectbest accuracy, slowest
whisper-large-v3-turbofaster-whisperyesyes99 auto-detectsweet spot — fast, accurate
parakeet-tdt-0.6b-v3NeMo TDTnoyesEnglish onlyvery fast on GPU
canary-180m-flashNeMo CanaryyesyesEnglish only (small)smallest, runs anywhere
canary-1b-flashNeMo Canarynoyesen/de/fr/es + translationmultilingual, translation
canary-qwen-2.5bNeMo SALMnoyesEnglish onlybest English accuracy (no timestamps)

Pick by use case:

  • General-purpose: whisper-large-v3-turbo.
  • English-only, max accuracy on GPU: canary-qwen-2.5b (but no per-segment timestamps).
  • Translation EN↔DE/FR/ES: canary-1b-flash (requires custom model registry — see Translation).

TTS

SlugFamilyCPUCUDALanguagesVoices
kokoro-82mKokoro (in-process, 24 kHz)yesyesen (US + UK), es, fr, hi, it, pt41 baked (discover via GET /v1/audio/voices)
qwen3-tts-0.6bQwen3-TTS (voice clone, 12 kHz)noyesen, zh, ko, ja, fr, de, ru, es, it, pt, pl, nl, ar, vi, th, id, ms (17)3 builtin samples + any .wav under /data/custom-voices/

Pick by use case:

  • General-purpose multi-voice TTS: kokoro-82m — fast, 41 baked voices, runs on CPU.
  • Voice cloning from a reference clip: qwen3-tts-0.6b — drop a .wav into /data/custom-voices/, immediately usable. CUDA required.

canary-qwen-2.5b produces no segment/word timestamps — verbose_json.segments and .words come back empty, srt/vtt collapse to a single full-duration cue. Transcription itself is whole-file. Use a Whisper or Canary multitask slug if you need timing.

API — POST /v1/audio/transcriptions

Multipart form. Same field names as OpenAI's transcription endpoint where they overlap.

Request Fields

FieldRequiredDefaultNotes
fileone of file/file_pathAudio file. Capped at TALKIES_MAX_UPLOAD_BYTES (default 100 MB).
file_pathone of file/file_pathEither a path under the staging area (/v1/files) or an http(s):// URL (downloaded + cached server-side). Not subject to the 100 MB upload cap; URL downloads capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
modelyesOne of the configured slugs (see GET /v1/models). Unknown → 404.
languagenomodel defaultISO-639-1 code. Whisper auto-detects when omitted; Canary uses its default_source_lang.
response_formatnojsonjson / text / verbose_json / srt / vtt.
timestamp_granularities[]noAccepted for OpenAI compat; ignored — verbose_json always emits both segment + word.
promptnoAccepted, ignored.
temperaturenoAccepted, ignored.
diarizationnofalseStereo-channel diarization. Requires 2-channel input — mono returns 400.

Exactly one of file or file_path must be set — passing both or neither returns 400.

Response Formats

response_formatContent-TypeShape
json (default)application/json{"text": "..."} — just the transcript.
texttext/plainThe transcript as plain text.
verbose_jsonapplication/jsonFull Whisper shape — task, language, duration, text, segments[], words[].
srtapplication/x-subripSubRip subtitle file, one cue per VAD-segmented chunk.
vtttext/vttWebVTT subtitle file, one cue per VAD-segmented chunk.

json shape:

{ "text": " full transcript as a single string" }

verbose_json shape — segments and words are always present (empty arrays for backends with no alignment output):

{
  "task": "transcribe",
  "language": "en",
  "duration": 6.42,
  "text": " full transcript",
  "segments": [{ "id": 0, "start": 0.0, "end": 2.31, "text": " ...", "tokens": [], "temperature": 0.0, "avg_logprob": null, "compression_ratio": null, "no_speech_prob": null }],
  "words": [{ "word": " the", "start": 0.0, "end": 0.12 }]
}

Whisper-only confidence fields (avg_logprob, compression_ratio, no_speech_prob) are emitted as null regardless of backend so clients reading them don't crash. tokens is always [].

Stereo Diarization

Pass diarization=true and upload a 2-channel file. Left channel = speaker L, right channel = speaker R. Each channel is transcribed independently, the two timelines are merged chronologically by segment start time.

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@interview-stereo.wav" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=verbose_json" | jq

What changes:

  • verbose_json — every segment/word gets "channel": "L" or "R". Segments re-numbered after merge.
  • text / response_format=text — rebuilt as alternating turn lines: L: ...\nR: ...\n.... Consecutive same-channel segments collapsed into one line per turn.
  • srt / vtt — each cue prefixed with L: / R:.

Caveats:

  • Exactly 2 channels required. Mono → 400. >2 channels → 400.
  • Latency ~2× the mono case (model runs sequentially on each channel).
  • The technique is exact for true two-mic setups (interview rigs, podcast splits). It does NOT magically separate speakers from a single-mic recording that's been rendered to stereo.

Translation

Canary multitask models can translate speech → text in a non-source language. canary-1b-flash covers en↔de, en↔fr, en↔es. The task is baked into the model slug, not passed per-request — you add a translation-specific slug via custom models.json (see Customizing the model registry):

{
  "models": {
    "canary-1b-flash-de2en": {
      "repo": "nvidia/canary-1b-flash",
      "executor": "canary_multitask",
      "default_source_lang": "de",
      "default_target_lang": "en",
      "default_task": "s2t_translation",
      "languages": ["de"]
    }
  }
}

Then call it normally — text carries the English translation:

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@german-clip.wav" \
  -F "model=canary-1b-flash-de2en" | jq

canary-180m-flash is English-ASR-only — don't point a translation slug at it. canary-qwen-2.5b is English ASR only too.

Long Files + VAD Chunking

Audio longer than 30 s (TALKIES_VAD_CHUNK_THRESHOLD) gets sliced through Silero VAD into ≤28 s speech regions before being handed to the backend. Timestamps are re-assembled by offsetting each chunk's segment/word timings — you get one continuous segments list spanning the whole file.

No client-side change. Long files just work. Verify by checking duration in verbose_json.

Error Contract

StatusShapeWhen
200per response_formatsuccess
400{"detail": "..."}bad audio, mono+diarization, >2 ch+diarization, both/neither of file/file_path, invalid file_path, URL download failure (DNS, HTTP error, size exceeded, SSRF blocked)
401{"detail": "..."}only when TALKIES_AUTH_TOKEN is set: missing/wrong bearer. Includes WWW-Authenticate: Bearer.
404{"detail": "..."}unknown model slug, file_path references missing file, DELETE /api/ps/{slug} on unloaded model, /v1/files/{path} GET/DELETE on missing
413{"detail": "..."}upload exceeded TALKIES_MAX_UPLOAD_BYTES (multipart file and PUT /v1/files/{path} only — not file_path URL)
422{"detail": [...]}Pydantic validation (missing fields, wrong types)
500{"detail": "..."}unhandled backend failure

API — POST /v1/audio/speech (TTS)

JSON body (not multipart). Returns the encoded audio bytes in the body with the matching Content-Type — no JSON envelope.

curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "The quick brown fox jumps over the lazy dog.",
        "voice": "af_heart",
        "response_format": "mp3",
        "speed": 1.0
      }' \
  --output fox.mp3

Request Body

FieldRequiredDefaultNotes
modelyesTTS model slug. kokoro-82m or qwen3-tts-0.6b. Unknown → 404. ASR slug → 400.
inputyesText to synthesize. Empty / whitespace-only → 400. No fixed length cap; for very long inputs split client-side.
voicenomodel default_voice (af_heart for kokoro-82m; alloy for qwen3-tts-0.6b)Voice catalog per model — call GET /v1/audio/voices and filter by .model. Unknown → 400 with catalog listed.
response_formatnomp3mp3 / opus / aac / flac / wav / pcm.
speedno1.0Playback rate, Kokoro only. Clamped to [0.25, 4.0]. Ignored by qwen3-tts-0.6b (no speed control in Qwen3-TTS).
instructionsnoAccepted, ignored (neither engine has an instruction-conditioning input).

Output Formats

response_format picks the encoder applied to Kokoro's raw 24 kHz mono PCM. ffmpeg does the conversion in-process; no temp files.

response_formatContent-TypeCodec / containerNotes
mp3 (default)audio/mpeglibmp3lame, 128 kbps CBRMost universal.
opusaudio/ogglibopus, 64 kbps VBR, Ogg containerBest quality-per-byte for speech.
aacaudio/aacAAC-LC, 128 kbps, ADTSiOS-friendly.
flacaudio/flacFLACLossless.
wavaudio/wavPCM s16le, 24 kHz mono, RIFF headerLossless, largest.
pcmapplication/octet-streamRaw PCM s16le, 24 kHz mono — no container, no headerReal-time chaining. Caller must know sample rate / format.

Voices

curl -s $TALKIES_URL/v1/audio/voices | jq

Returns {"voices": [{"voice", "model", "default", "origin"}]}. The origin field is only present for engines that distinguish baked-in vs user-supplied voices (currently qwen3-tts-0.6b"builtin" for image-baked samples, "custom" for /data/custom-voices/ mounts). Kokoro entries omit origin.

Kokoro voices encode <lang_code><gender>_<name>:

PrefixLanguage
af_ / am_American English (female / male)
bf_ / bm_British English (female / male)
ef_ / em_Spanish
ff_French
hf_ / hm_Hindi
if_ / im_Italian
pf_ / pm_Portuguese (Brazilian)

41 voices ship in the image. Japanese (jf_* / jm_*) and Chinese (zf_* / zm_*) are filtered out because they need the optional misaki[ja] / misaki[zh] extras (MeCab + pypinyin chains).

Qwen3-TTS voices come from two on-disk dirs merged into one catalog:

  • /opt/talkies/qwen3-voices/ — baked into the CUDA image. Ships three curated samples (alloy, echo, fable) so voice cloning works out-of-the-box. origin=builtin.
  • /data/custom-voices/ — host-mounted via the data volume. Drop foo/bar/me.wav and voice foo/bar/me immediately appears in GET /v1/audio/voices (catalog is rescanned per request — no restart). origin=custom.

Voice names are the wav's path relative to its parent dir with .wav stripped — nested subdirs are preserved. custom-voices/team-a/jane.wav → voice team-a/jane. Custom voices shadow builtin voices with the same name; dropping a custom-voices/alloy.wav overrides the builtin alloy sample (its origin flips to custom).

Optional sibling metadata next to each <name>.wav:

  • <name>.txt — reference transcript for the clip (ICL voice cloning works without it, but clone fidelity is noticeably better with a faithful transcript).
  • <name>.lang — language label string (defaults to English).

Path-traversal guard: hostile symlinks whose resolve() escapes the voices dir are skipped (the wav can't be used to read arbitrary host files as a voice prompt).

# Add a custom clone voice (server picks it up on next request — no restart).
mkdir -p ~/talkies-data/custom-voices/team-a
cp jane-reading.wav ~/talkies-data/custom-voices/team-a/jane.wav
echo "And the silken sad uncertain rustling of each purple curtain." \
  > ~/talkies-data/custom-voices/team-a/jane.txt

# Use it.
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "qwen3-tts-0.6b",
        "input": "Hello from a cloned voice.",
        "voice": "team-a/jane",
        "response_format": "wav"
      }' \
  --output cloned.wav

First synth is slow on Qwen3-TTS — the predictor + talker CUDA graphs are captured on first call (~30-60 s on a mid-range GPU). Subsequent generations are sub-second. The model and graphs stay resident until evicted by sibling load or the idle sweeper.

Error Contract (TTS)

StatusWhen
200success (audio bytes in body)
400empty input, unknown voice, unsupported response_format, model isn't TTS (e.g. POSTing whisper-large-v3 here)
401TALKIES_AUTH_TOKEN set, missing / wrong bearer
404unknown model slug
422Pydantic validation (missing required fields, wrong types)
500unhandled ffmpeg or kokoro internal failure
503TTS snapshot files missing under ${TALKIES_DATA_DIR}/models/<slug>/ (slug excluded from TALKIES_ENABLED_MODELS but still being called); or qwen3-tts-0.6b requested on a non-CUDA device (the backend hard-fails at load time)

Resource-Management Endpoints (Ollama-Style)

talkies mirrors a subset of speaches / Ollama, so a LiteLLM proxy can drive both.

EndpointBehavior
GET /healthzUnauthenticated liveness. Returns {ok, device, models}.
GET /v1/modelsOpenAI-style list of configured slugs. Each entry includes a modality field (asr or tts) so clients can filter.
GET /api/psCurrently-loaded models with per-model idle_seconds.
DELETE /api/ps/{model_id}Evict one model. Slug can be URL-encoded (/%2F). 404 if not loaded.
POST /unloadEvict every loaded model. Returns the list actually unloaded.

Behind these: an idle sweeper runs every TALKIES_SWEEPER_INTERVAL s (default 60) and unloads anything not used in TALKIES_MODEL_TTL s (default 600). Set TALKIES_MODEL_TTL=0 to disable.

There's also sibling eviction at request time — every transcribe or speech request evicts other loaded models so VRAM doesn't get split. ASR and TTS share the same pool; loading Kokoro evicts a resident Whisper and vice versa. One model resident at a time, per container. If you need two models simultaneously, run two containers.

# Which models are loaded right now.
curl -s $TALKIES_URL/api/ps | jq

# Free VRAM after a job — evict one model.
curl -s -X DELETE "$TALKIES_URL/api/ps/whisper-large-v3-turbo"

# Or evict everything.
curl -s -X POST $TALKIES_URL/unload | jq

Server-Side File Staging (/v1/files)

For repeated transcribes of the same file (different response_format, different model, iterating on params), stage the file once and reference it by path. Files land under ${TALKIES_DATA_DIR}/files/<path>.

EndpointBehavior
GET /v1/filesList every staged file. Returns {"files": [{"path", "size", "modified"}]}.
PUT /v1/files/{path}Upload raw bytes (--data-binary @local-file). Capped at TALKIES_MAX_UPLOAD_BYTES. Atomic write (.part → rename).
GET /v1/files/{path}Streams file back. Content-Type guessed by extension. 404 if missing.
DELETE /v1/files/{path}Removes file and prunes empty parent dirs. 404 if missing.
# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

# Reuse across multiple transcribe calls.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=lectures/2026-03-15/lecture.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

Path safety: null bytes, backslashes, . / .. segments and double slashes are rejected (400). Symlinks pointing outside the root are refused. Leading / is stripped — /foo/bar.mp3 and foo/bar.mp3 resolve identically.

URL file_path (Download + Cache)

file_path also accepts http:// / https:// URLs. First request downloads to ${TALKIES_DATA_DIR}/files/downloads/<sha256(url)[:16]>-<basename>, subsequent requests with the same URL hit the cache.

# First call: downloads, transcribes off the cached copy.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Second call: same URL → cache hit, no re-download.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=canary-1b-flash" \
  -F "response_format=srt" > ep-042.srt

Downloads appear in GET /v1/files listings under downloads/. Invalidate a single cached URL with DELETE /v1/files/downloads/<key>.

Constraints applied during download:

  • Size capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
  • 5 redirect hops max; SSRF guard re-applied at every hop.
  • 10 s connect, 300 s per-chunk read timeout.
  • SSRF off by default. Set TALKIES_BLOCK_PRIVATE_DOWNLOADS=true to reject URLs whose hostname resolves to private/loopback/link-local/multicast/reserved IPs.

MCP Endpoint (/v1/mcp)

talkies exposes a Model Context Protocol server over Streamable HTTP at /v1/mcp. Same FastAPI process, same BACKENDS / REGISTRY, same auth middleware — a model loaded by the MCP transcribe tool is the same instance the HTTP endpoint sees.

MCP exposes the ASR surface only. TTS (/v1/audio/speech) is HTTP-only — generated audio bytes don't round-trip through JSON-RPC cleanly. list_models filters out TTS slugs so transcribe only ever sees ASR backends.

ToolWhat it does
list_modelsDiscover ASR slugs (TTS slugs are filtered out). Returns [{slug, executor, default_source_lang, default_target_lang, default_task, loaded}].
transcribeRun ASR on a file_path (URL or staged path). Args: model, language?, response_format? (json/verbose_json/text/srt/vtt), diarization?. JSON formats return a JSON-encoded string; text/srt/vtt return raw.
list_filesSame payload as GET /v1/files.
put_fileUpload to staging. Body is base64 (content_base64). Decoded size capped at TALKIES_MAX_UPLOAD_BYTES. For big files, prefer PUT /v1/files/{path} over HTTP — JSON-RPC + base64 chews token budget.
get_fileRead a staged file as base64. Same size cap. Same advice — for big bytes, hit GET /v1/files/{path} over HTTP.
delete_fileRemove a staged file, prune empty parents.

The transport requires Accept: application/json, text/event-stream. Wire it into Claude Code:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp

With auth:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp \
  --header "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Note: the canonical mount path is /v1/mcp/ (trailing slash). Bare /v1/mcp is rewritten internally to /v1/mcp/ so clients that don't follow Starlette's 307 redirect work too.

Raw JSON-RPC

For debugging or non-MCP-aware callers, hit it as JSON-RPC over HTTP POST:

# tools/list
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

# tools/call
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0", "id": 2, "method": "tools/call",
    "params": {
      "name": "transcribe",
      "arguments": {
        "file_path": "https://example.com/clip.mp3",
        "model": "whisper-large-v3-turbo",
        "response_format": "json"
      }
    }
  }'

Bearer-Token Auth

If TALKIES_AUTH_TOKEN is set on the server, every route except /healthz and CORS preflight (OPTIONS) requires Authorization: Bearer <token>. Wrong/missing token returns 401 with WWW-Authenticate: Bearer. Compared with hmac.compare_digest (constant-time).

curl -H "Authorization: Bearer $TALKIES_AUTH_TOKEN" $TALKIES_URL/v1/models

Empty / unset token = wide open. For untrusted networks, combine the token with a reverse proxy doing TLS + rate limiting.

Typical Workflows

Quick one-off transcribe

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" | jq -r .text

Generate subtitles for a video

ffmpeg -i video.mp4 -vn -acodec libmp3lame audio.mp3
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > video.srt
# burn in:  ffmpeg -i video.mp4 -vf subtitles=video.srt -c:a copy video-subbed.mp4

Iterate on the same file with different settings

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/work/lecture.mp3

# Try different models / formats without re-uploading.
for fmt in json verbose_json srt; do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=work/lecture.mp3" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=$fmt" > "lecture.$fmt"
done

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/work/lecture.mp3

Diarized interview transcript

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@interview-stereo.wav" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=text"
# stdout:
#   L: hi how's it going
#   R: not bad you
#   L: cool man

Synthesize speech from text

# Default voice, MP3 output.
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-82m","input":"Greetings, human."}' \
  --output greetings.mp3

# Pick a voice from GET /v1/audio/voices, choose a format.
curl -s $TALKIES_URL/v1/audio/voices | jq -r '.voices[].voice'
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "Buongiorno, mondo.",
        "voice": "if_sara",
        "response_format": "opus"
      }' \
  --output ciao.opus

Free VRAM after a job

curl -s -X POST $TALKIES_URL/unload | jq

Bulk transcribe from URLs

for url in $(cat urls.txt); do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=$url" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=text"
  echo "---"
done

The first hit on each URL downloads + caches; re-running the loop is free.

For a fuller bulk-transcribe driver (mix of local paths + URLs, per-input output files, error reporting, optional diarization) see scripts/bulk_transcribe.sh:

TALKIES_URL=http://localhost:8000 \
TALKIES_MODEL=whisper-large-v3-turbo \
TALKIES_FORMAT=srt \
TALKIES_OUTDIR=./subs \
  bash scripts/bulk_transcribe.sh inputs.txt

Tips

  1. Use whisper-large-v3-turbo as your default — it's the speed/quality sweet spot for general-purpose ASR. Switch to whisper-large-v3 only when you need the last few % of accuracy on hard audio.
  2. URL file_path over multipart upload — if the audio is already at a URL, send the URL. Saves bandwidth (the file isn't going up and then back down), gets cached server-side, no upload size cap.
  3. Stage repeated files via PUT /v1/files/{path} and call with file_path= to avoid re-uploading on every retry/iteration.
  4. response_format=text for the "just give me the string" case — no jq -r .text needed, content-type is text/plain.
  5. One model at a time — every transcribe request evicts other loaded models. Don't try to fan out two calls against two different models on the same container; the second one evicts the first and reloads. Use two containers if you actually need concurrency on different models.
  6. POST /unload after a job — explicit eviction frees VRAM/RAM faster than waiting for the 10-min idle sweeper. Useful in CI / batch scripts.
  7. canary-qwen-2.5b has no timestampsverbose_json.segments / .words come back empty, srt/vtt collapse to one cue. Use a Whisper or Canary multitask slug if you need timing data.
  8. Diarization requires true stereo — if your "stereo" file is the same mono signal copied to both channels, diarization won't separate speakers. The technique is exact for two-mic setups, useless otherwise.
  9. Long files just work — VAD chunking happens transparently. Don't pre-split. Send the whole file.
  10. prompt / temperature / instructions are ignored even though the request schemas accept them. Don't expect them to do anything.
  11. Watch /api/ps to see what's resident. A request that hangs at "loading model" is doing the first cold load — subsequent calls are fast.
  12. Customizing the model registry for translation slugs or to restrict the served set — see references/setup.md.
  13. Kokoro uses native voice names — no OpenAI aliases. Hit GET /v1/audio/voices once to discover what's shipped; pass the voice field accordingly. The 41 voices cover en (US + UK), es, fr, hi, it, pt; ja/zh are filtered out.
  14. Voice cloning is qwen3-tts-0.6b — drop a .wav (10-30 s of clean speech is plenty) into /data/custom-voices/<anywhere>.wav. Optionally drop a sibling .txt with a faithful transcript for higher clone fidelity. The voice appears in GET /v1/audio/voices on the next request — no restart. CUDA required.
  15. Qwen3-TTS first synth is slow — CUDA graph capture runs once after model load (~30-60 s). Subsequent synths are sub-second. If you're benchmarking, throw away the first call.
  16. Qwen3-TTS ignores speed — the model has no playback-rate control. Pass it for OpenAI compat; nothing happens. Only Kokoro honors speed.
  17. Different TTS sample rates — Kokoro emits 24 kHz mono PCM; Qwen3-TTS emits 12 kHz mono PCM. ffmpeg re-encodes both into your chosen response_format, but if you select pcm (raw, no container), you must know the source rate per model to play it back correctly.
  18. TTS response_format=pcm is for chaining — raw int16 mono PCM, no container, no header. Use it when piping into another encoder or a real-time playback path. Otherwise stick with mp3 (default) or opus for size.
  19. TTS evicts loaded ASR and vice versa — they share the same one-model-resident pool. Synthesizing with Kokoro after a transcribe burst incurs Kokoro's cold load. Same applies to Qwen3-TTS (plus the CUDA-graph capture re-runs on cold reload).