talkies

API key required

Self-hosted OpenAI-compatible speech service. /v1/audio/transcriptions fronts seven open ASR models (Whisper, Parakeet, Canary); /v1/audio/speech fronts two TTS engines — Kokoro-82M (41 baked voices) and Qwen3-TTS-0.6B (CUDA-only voice cloning from user-mounted .wav reference clips). Same wire format as OpenAI — change the base URL + slug. Stereo diarization, URL fetching, MCP endpoint, bearer auth.

Install

openclaw skills install talkies

talkies

Self-hosted speech service — ASR and TTS, one container. OpenAI-compatible wire shape on both endpoints; point an OpenAI client at it, change the model slug, done.

ASR (POST /v1/audio/transcriptions): six backends — whisper-large-v3, whisper-large-v3-turbo, parakeet-tdt-0.6b-v3, canary-180m-flash, canary-1b-flash, canary-qwen-2.5b.

TTS (POST /v1/audio/speech): two engines — kokoro-82m with 41 baked voices across en/es/fr/hi/it/pt, and qwen3-tts-0.6b for CUDA-only voice cloning from reference clips (three builtin samples plus any .wav you drop into /data/custom-voices/, including nested subdirs). Both discovered via GET /v1/audio/voices.

Extras: stereo diarization on transcription, URL file_path fetching, server-side file staging, MCP endpoint with 6 ASR-side tools, optional bearer-token auth.

For installation, configuration, and container setup, see references/setup.md.

When To Use

Transcribe audio files (any format ffmpeg decodes — WAV, MP3, M4A, FLAC, OGG, WebM, Opus, MP4 audio).
Generate SRT/VTT subtitles for video.
Transcribe podcasts, lectures, interviews, voicemails, calls.
Stereo two-mic recordings → per-speaker diarized output (L: / R: channel tagging).
German/French/Spanish ↔ English speech-to-text translation via Canary-1B-Flash.
Synthesize speech from text via Kokoro-82M — English (American + British), Spanish, French, Hindi, Italian, Portuguese.
Voice-clone speech via Qwen3-TTS-0.6B from a reference .wav you provide — drop into /data/custom-voices/, immediately appears under GET /v1/audio/voices with origin=custom.
Drop-in replacement for api.openai.com/v1/audio/transcriptions and api.openai.com/v1/audio/speech in existing client code.

When NOT To Use

Real-time / streaming output — both endpoints are request/response only.
Speaker identification from voice (only stereo-channel diarization is supported, not voice clustering).
Per-request prompt / temperature (transcribe) or instructions (speech) injection — fields accepted for compat, ignored.
Japanese / Chinese TTS — Kokoro upstream supports them but talkies filters those voices out (they need the misaki[ja] / misaki[zh] extras).
Kokoro on OpenAI aliases (alloy, echo, fable, onyx, nova, shimmer) — Kokoro exposes its native voice names only (af_*, bm_*, etc.). Map client-side. (Qwen3-TTS does ship alloy / echo / fable as builtin voice slugs, but they're voice-cloned samples, not OpenAI's voices — there's no audio compatibility.)
qwen3-tts-0.6b on CPU — voice cloning hard-fails without CUDA at load time. The faster_qwen3_tts upstream raises ValueError on non-CUDA devices; talkies surfaces this as a load failure on the first request.
qwen3-tts-0.6b speed parameter — Qwen3-TTS has no playback-rate control. Field is accepted for OpenAI compat but ignored (only Kokoro honors speed).
arm64 hosts — linux/amd64 only.

Setup

The container should already be running. Set the base URL:

export TALKIES_URL=http://localhost:8000

If the server has TALKIES_AUTH_TOKEN set, export it too:

export TALKIES_AUTH_TOKEN=<your-token>
# every request below needs: -H "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Verify: curl $TALKIES_URL/healthz returns {"ok": true, "device": "...", "models": [...]}.

For install / configuration / env vars / CPU vs CUDA images / custom model registry, see references/setup.md.

Quick Start

# Discover what's available.
curl -s $TALKIES_URL/v1/models | jq

# Simplest transcribe — file upload, JSON response.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Same call, but the audio lives at a URL — talkies downloads + caches it.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Full Whisper-shape JSON with per-segment + per-word timestamps.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# SRT subtitles.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@lecture.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > lecture.srt

# Discover TTS voices, then synthesize an MP3.
curl -s $TALKIES_URL/v1/audio/voices | jq
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "Hello from talkies.",
        "voice": "af_heart",
        "response_format": "mp3"
      }' \
  --output hello.mp3

Supported Models

ASR

Slug	Family	CPU	CUDA	Languages	Strength
`whisper-large-v3`	faster-whisper	yes	yes	99 auto-detect	best accuracy, slowest
`whisper-large-v3-turbo`	faster-whisper	yes	yes	99 auto-detect	sweet spot — fast, accurate
`parakeet-tdt-0.6b-v3`	NeMo TDT	no	yes	English only	very fast on GPU
`canary-180m-flash`	NeMo Canary	yes	yes	English only (small)	smallest, runs anywhere
`canary-1b-flash`	NeMo Canary	no	yes	en/de/fr/es + translation	multilingual, translation
`canary-qwen-2.5b`	NeMo SALM	no	yes	English only	best English accuracy (no timestamps)

Pick by use case:

General-purpose: whisper-large-v3-turbo.
English-only, max accuracy on GPU: canary-qwen-2.5b (but no per-segment timestamps).
Translation EN↔DE/FR/ES: canary-1b-flash (requires custom model registry — see Translation).

TTS

Slug	Family	CPU	CUDA	Languages	Voices
`kokoro-82m`	Kokoro (in-process, 24 kHz)	yes	yes	en (US + UK), es, fr, hi, it, pt	41 baked (discover via `GET /v1/audio/voices`)
`qwen3-tts-0.6b`	Qwen3-TTS (voice clone, 12 kHz)	no	yes	en, zh, ko, ja, fr, de, ru, es, it, pt, pl, nl, ar, vi, th, id, ms (17)	3 builtin samples + any `.wav` under `/data/custom-voices/`

Pick by use case:

General-purpose multi-voice TTS: kokoro-82m — fast, 41 baked voices, runs on CPU.
Voice cloning from a reference clip: qwen3-tts-0.6b — drop a .wav into /data/custom-voices/, immediately usable. CUDA required.

canary-qwen-2.5b produces no segment/word timestamps — verbose_json.segments and .words come back empty, srt/vtt collapse to a single full-duration cue. Transcription itself is whole-file. Use a Whisper or Canary multitask slug if you need timing.

API — `POST /v1/audio/transcriptions`

Multipart form. Same field names as OpenAI's transcription endpoint where they overlap.

Request Fields

Field	Required	Default	Notes
`file`	one of `file`/`file_path`	—	Audio file. Capped at `TALKIES_MAX_UPLOAD_BYTES` (default 100 MB).
`file_path`	one of `file`/`file_path`	—	Either a path under the staging area (`/v1/files`) or an `http(s)://` URL (downloaded + cached server-side). Not subject to the 100 MB upload cap; URL downloads capped by `TALKIES_MAX_DOWNLOAD_BYTES` (default 1 GiB).
`model`	yes	—	One of the configured slugs (see `GET /v1/models`). Unknown → 404.
`language`	no	model default	ISO-639-1 code. Whisper auto-detects when omitted; Canary uses its `default_source_lang`.
`response_format`	no	`json`	`json` / `text` / `verbose_json` / `srt` / `vtt`.
`timestamp_granularities[]`	no	—	Accepted for OpenAI compat; ignored — `verbose_json` always emits both segment + word.
`prompt`	no	—	Accepted, ignored.
`temperature`	no	—	Accepted, ignored.
`diarization`	no	`false`	Stereo-channel diarization. Requires 2-channel input — mono returns 400.

Exactly one of file or file_path must be set — passing both or neither returns 400.

Response Formats

`response_format`	Content-Type	Shape
`json` (default)	`application/json`	`{"text": "..."}` — just the transcript.
`text`	`text/plain`	The transcript as plain text.
`verbose_json`	`application/json`	Full Whisper shape — `task`, `language`, `duration`, `text`, `segments[]`, `words[]`.
`srt`	`application/x-subrip`	SubRip subtitle file, one cue per VAD-segmented chunk.
`vtt`	`text/vtt`	WebVTT subtitle file, one cue per VAD-segmented chunk.

json shape:

{ "text": " full transcript as a single string" }

verbose_json shape — segments and words are always present (empty arrays for backends with no alignment output):

{
  "task": "transcribe",
  "language": "en",
  "duration": 6.42,
  "text": " full transcript",
  "segments": [{ "id": 0, "start": 0.0, "end": 2.31, "text": " ...", "tokens": [], "temperature": 0.0, "avg_logprob": null, "compression_ratio": null, "no_speech_prob": null }],
  "words": [{ "word": " the", "start": 0.0, "end": 0.12 }]
}

Whisper-only confidence fields (avg_logprob, compression_ratio, no_speech_prob) are emitted as null regardless of backend so clients reading them don't crash. tokens is always [].

Stereo Diarization

Pass diarization=true and upload a 2-channel file. Left channel = speaker L, right channel = speaker R. Each channel is transcribed independently, the two timelines are merged chronologically by segment start time.

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@interview-stereo.wav" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=verbose_json" | jq

What changes:

verbose_json — every segment/word gets "channel": "L" or "R". Segments re-numbered after merge.
text / response_format=text — rebuilt as alternating turn lines: L: ...\nR: ...\n.... Consecutive same-channel segments collapsed into one line per turn.
srt / vtt — each cue prefixed with L: / R:.

Caveats:

Exactly 2 channels required. Mono → 400. >2 channels → 400.
Latency ~2× the mono case (model runs sequentially on each channel).
The technique is exact for true two-mic setups (interview rigs, podcast splits). It does NOT magically separate speakers from a single-mic recording that's been rendered to stereo.

Translation

Canary multitask models can translate speech → text in a non-source language. canary-1b-flash covers en↔de, en↔fr, en↔es. The task is baked into the model slug, not passed per-request — you add a translation-specific slug via custom models.json (see Customizing the model registry):

{
  "models": {
    "canary-1b-flash-de2en": {
      "repo": "nvidia/canary-1b-flash",
      "executor": "canary_multitask",
      "default_source_lang": "de",
      "default_target_lang": "en",
      "default_task": "s2t_translation",
      "languages": ["de"]
    }
  }
}

Then call it normally — text carries the English translation:

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@german-clip.wav" \
  -F "model=canary-1b-flash-de2en" | jq

canary-180m-flash is English-ASR-only — don't point a translation slug at it. canary-qwen-2.5b is English ASR only too.

Long Files + VAD Chunking

Audio longer than 30 s (TALKIES_VAD_CHUNK_THRESHOLD) gets sliced through Silero VAD into ≤28 s speech regions before being handed to the backend. Timestamps are re-assembled by offsetting each chunk's segment/word timings — you get one continuous segments list spanning the whole file.

No client-side change. Long files just work. Verify by checking duration in verbose_json.

Error Contract

Status	Shape	When
200	per `response_format`	success
400	`{"detail": "..."}`	bad audio, mono+diarization, >2 ch+diarization, both/neither of `file`/`file_path`, invalid file_path, URL download failure (DNS, HTTP error, size exceeded, SSRF blocked)
401	`{"detail": "..."}`	only when `TALKIES_AUTH_TOKEN` is set: missing/wrong bearer. Includes `WWW-Authenticate: Bearer`.
404	`{"detail": "..."}`	unknown model slug, `file_path` references missing file, `DELETE /api/ps/{slug}` on unloaded model, `/v1/files/{path}` GET/DELETE on missing
413	`{"detail": "..."}`	upload exceeded `TALKIES_MAX_UPLOAD_BYTES` (multipart `file` and `PUT /v1/files/{path}` only — not `file_path` URL)
422	`{"detail": [...]}`	Pydantic validation (missing fields, wrong types)
500	`{"detail": "..."}`	unhandled backend failure

API — `POST /v1/audio/speech` (TTS)

JSON body (not multipart). Returns the encoded audio bytes in the body with the matching Content-Type — no JSON envelope.

curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "The quick brown fox jumps over the lazy dog.",
        "voice": "af_heart",
        "response_format": "mp3",
        "speed": 1.0
      }' \
  --output fox.mp3

Request Body

Field	Required	Default	Notes
`model`	yes	—	TTS model slug. `kokoro-82m` or `qwen3-tts-0.6b`. Unknown → 404. ASR slug → 400.
`input`	yes	—	Text to synthesize. Empty / whitespace-only → 400. No fixed length cap; for very long inputs split client-side.
`voice`	no	model `default_voice` (`af_heart` for `kokoro-82m`; `alloy` for `qwen3-tts-0.6b`)	Voice catalog per model — call `GET /v1/audio/voices` and filter by `.model`. Unknown → 400 with catalog listed.
`response_format`	no	`mp3`	`mp3` / `opus` / `aac` / `flac` / `wav` / `pcm`.
`speed`	no	`1.0`	Playback rate, Kokoro only. Clamped to `[0.25, 4.0]`. Ignored by `qwen3-tts-0.6b` (no speed control in Qwen3-TTS).
`instructions`	no	—	Accepted, ignored (neither engine has an instruction-conditioning input).

Output Formats

response_format picks the encoder applied to Kokoro's raw 24 kHz mono PCM. ffmpeg does the conversion in-process; no temp files.

`response_format`	Content-Type	Codec / container	Notes
`mp3` (default)	`audio/mpeg`	libmp3lame, 128 kbps CBR	Most universal.
`opus`	`audio/ogg`	libopus, 64 kbps VBR, Ogg container	Best quality-per-byte for speech.
`aac`	`audio/aac`	AAC-LC, 128 kbps, ADTS	iOS-friendly.
`flac`	`audio/flac`	FLAC	Lossless.
`wav`	`audio/wav`	PCM s16le, 24 kHz mono, RIFF header	Lossless, largest.
`pcm`	`application/octet-stream`	Raw PCM s16le, 24 kHz mono — no container, no header	Real-time chaining. Caller must know sample rate / format.

Voices

curl -s $TALKIES_URL/v1/audio/voices | jq

Returns {"voices": [{"voice", "model", "default", "origin"}]}. The origin field is only present for engines that distinguish baked-in vs user-supplied voices (currently qwen3-tts-0.6b — "builtin" for image-baked samples, "custom" for /data/custom-voices/ mounts). Kokoro entries omit origin.

Kokoro voices encode <lang_code><gender>_<name>:

Prefix	Language
`af_` / `am_`	American English (female / male)
`bf_` / `bm_`	British English (female / male)
`ef_` / `em_`	Spanish
`ff_`	French
`hf_` / `hm_`	Hindi
`if_` / `im_`	Italian
`pf_` / `pm_`	Portuguese (Brazilian)

41 voices ship in the image. Japanese (jf_* / jm_*) and Chinese (zf_* / zm_*) are filtered out because they need the optional misaki[ja] / misaki[zh] extras (MeCab + pypinyin chains).

Qwen3-TTS voices come from two on-disk dirs merged into one catalog:

/opt/talkies/qwen3-voices/ — baked into the CUDA image. Ships three curated samples (alloy, echo, fable) so voice cloning works out-of-the-box. origin=builtin.
/data/custom-voices/ — host-mounted via the data volume. Drop foo/bar/me.wav and voice foo/bar/me immediately appears in GET /v1/audio/voices (catalog is rescanned per request — no restart). origin=custom.

Voice names are the wav's path relative to its parent dir with .wav stripped — nested subdirs are preserved. custom-voices/team-a/jane.wav → voice team-a/jane. Custom voices shadow builtin voices with the same name; dropping a custom-voices/alloy.wav overrides the builtin alloy sample (its origin flips to custom).

Optional sibling metadata next to each <name>.wav:

<name>.txt — reference transcript for the clip (ICL voice cloning works without it, but clone fidelity is noticeably better with a faithful transcript).
<name>.lang — language label string (defaults to English).

Path-traversal guard: hostile symlinks whose resolve() escapes the voices dir are skipped (the wav can't be used to read arbitrary host files as a voice prompt).

# Add a custom clone voice (server picks it up on next request — no restart).
mkdir -p ~/talkies-data/custom-voices/team-a
cp jane-reading.wav ~/talkies-data/custom-voices/team-a/jane.wav
echo "And the silken sad uncertain rustling of each purple curtain." \
  > ~/talkies-data/custom-voices/team-a/jane.txt

# Use it.
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "qwen3-tts-0.6b",
        "input": "Hello from a cloned voice.",
        "voice": "team-a/jane",
        "response_format": "wav"
      }' \
  --output cloned.wav

First synth is slow on Qwen3-TTS — the predictor + talker CUDA graphs are captured on first call (~30-60 s on a mid-range GPU). Subsequent generations are sub-second. The model and graphs stay resident until evicted by sibling load or the idle sweeper.

Error Contract (TTS)

Status	When
200	success (audio bytes in body)
400	empty `input`, unknown `voice`, unsupported `response_format`, model isn't TTS (e.g. POSTing `whisper-large-v3` here)
401	`TALKIES_AUTH_TOKEN` set, missing / wrong bearer
404	unknown `model` slug
422	Pydantic validation (missing required fields, wrong types)
500	unhandled ffmpeg or kokoro internal failure
503	TTS snapshot files missing under `${TALKIES_DATA_DIR}/models/<slug>/` (slug excluded from `TALKIES_ENABLED_MODELS` but still being called); or `qwen3-tts-0.6b` requested on a non-CUDA device (the backend hard-fails at load time)

Resource-Management Endpoints (Ollama-Style)

talkies mirrors a subset of speaches / Ollama, so a LiteLLM proxy can drive both.

Endpoint	Behavior
`GET /healthz`	Unauthenticated liveness. Returns `{ok, device, models}`.
`GET /v1/models`	OpenAI-style list of configured slugs. Each entry includes a `modality` field (`asr` or `tts`) so clients can filter.
`GET /api/ps`	Currently-loaded models with per-model `idle_seconds`.
`DELETE /api/ps/{model_id}`	Evict one model. Slug can be URL-encoded (`/` → `%2F`). 404 if not loaded.
`POST /unload`	Evict every loaded model. Returns the list actually unloaded.

Behind these: an idle sweeper runs every TALKIES_SWEEPER_INTERVAL s (default 60) and unloads anything not used in TALKIES_MODEL_TTL s (default 600). Set TALKIES_MODEL_TTL=0 to disable.

There's also sibling eviction at request time — every transcribe or speech request evicts other loaded models so VRAM doesn't get split. ASR and TTS share the same pool; loading Kokoro evicts a resident Whisper and vice versa. One model resident at a time, per container. If you need two models simultaneously, run two containers.

# Which models are loaded right now.
curl -s $TALKIES_URL/api/ps | jq

# Free VRAM after a job — evict one model.
curl -s -X DELETE "$TALKIES_URL/api/ps/whisper-large-v3-turbo"

# Or evict everything.
curl -s -X POST $TALKIES_URL/unload | jq

Server-Side File Staging (`/v1/files`)

For repeated transcribes of the same file (different response_format, different model, iterating on params), stage the file once and reference it by path. Files land under ${TALKIES_DATA_DIR}/files/<path>.

Endpoint	Behavior
`GET /v1/files`	List every staged file. Returns `{"files": [{"path", "size", "modified"}]}`.
`PUT /v1/files/{path}`	Upload raw bytes (`--data-binary @local-file`). Capped at `TALKIES_MAX_UPLOAD_BYTES`. Atomic write (`.part` → rename).
`GET /v1/files/{path}`	Streams file back. Content-Type guessed by extension. 404 if missing.
`DELETE /v1/files/{path}`	Removes file and prunes empty parent dirs. 404 if missing.

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

# Reuse across multiple transcribe calls.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=lectures/2026-03-15/lecture.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

Path safety: null bytes, backslashes, . / .. segments and double slashes are rejected (400). Symlinks pointing outside the root are refused. Leading / is stripped — /foo/bar.mp3 and foo/bar.mp3 resolve identically.

URL `file_path` (Download + Cache)

file_path also accepts http:// / https:// URLs. First request downloads to ${TALKIES_DATA_DIR}/files/downloads/<sha256(url)[:16]>-<basename>, subsequent requests with the same URL hit the cache.

# First call: downloads, transcribes off the cached copy.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Second call: same URL → cache hit, no re-download.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=canary-1b-flash" \
  -F "response_format=srt" > ep-042.srt

Downloads appear in GET /v1/files listings under downloads/. Invalidate a single cached URL with DELETE /v1/files/downloads/<key>.

Constraints applied during download:

Size capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
5 redirect hops max; SSRF guard re-applied at every hop.
10 s connect, 300 s per-chunk read timeout.
SSRF off by default. Set TALKIES_BLOCK_PRIVATE_DOWNLOADS=true to reject URLs whose hostname resolves to private/loopback/link-local/multicast/reserved IPs.

MCP Endpoint (`/v1/mcp`)

talkies exposes a Model Context Protocol server over Streamable HTTP at /v1/mcp. Same FastAPI process, same BACKENDS / REGISTRY, same auth middleware — a model loaded by the MCP transcribe tool is the same instance the HTTP endpoint sees.

MCP exposes the ASR surface only. TTS (/v1/audio/speech) is HTTP-only — generated audio bytes don't round-trip through JSON-RPC cleanly. list_models filters out TTS slugs so transcribe only ever sees ASR backends.

Tool	What it does
`list_models`	Discover ASR slugs (TTS slugs are filtered out). Returns `[{slug, executor, default_source_lang, default_target_lang, default_task, loaded}]`.
`transcribe`	Run ASR on a `file_path` (URL or staged path). Args: `model`, `language?`, `response_format?` (`json`/`verbose_json`/`text`/`srt`/`vtt`), `diarization?`. JSON formats return a JSON-encoded string; text/srt/vtt return raw.
`list_files`	Same payload as `GET /v1/files`.
`put_file`	Upload to staging. Body is base64 (`content_base64`). Decoded size capped at `TALKIES_MAX_UPLOAD_BYTES`. For big files, prefer `PUT /v1/files/{path}` over HTTP — JSON-RPC + base64 chews token budget.
`get_file`	Read a staged file as base64. Same size cap. Same advice — for big bytes, hit `GET /v1/files/{path}` over HTTP.
`delete_file`	Remove a staged file, prune empty parents.

The transport requires Accept: application/json, text/event-stream. Wire it into Claude Code:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp

With auth:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp \
  --header "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Note: the canonical mount path is /v1/mcp/ (trailing slash). Bare /v1/mcp is rewritten internally to /v1/mcp/ so clients that don't follow Starlette's 307 redirect work too.

Raw JSON-RPC

For debugging or non-MCP-aware callers, hit it as JSON-RPC over HTTP POST:

# tools/list
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

# tools/call
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0", "id": 2, "method": "tools/call",
    "params": {
      "name": "transcribe",
      "arguments": {
        "file_path": "https://example.com/clip.mp3",
        "model": "whisper-large-v3-turbo",
        "response_format": "json"
      }
    }
  }'

Bearer-Token Auth

If TALKIES_AUTH_TOKEN is set on the server, every route except /healthz and CORS preflight (OPTIONS) requires Authorization: Bearer <token>. Wrong/missing token returns 401 with WWW-Authenticate: Bearer. Compared with hmac.compare_digest (constant-time).

curl -H "Authorization: Bearer $TALKIES_AUTH_TOKEN" $TALKIES_URL/v1/models

Empty / unset token = wide open. For untrusted networks, combine the token with a reverse proxy doing TLS + rate limiting.

Typical Workflows

Quick one-off transcribe

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3-turbo" | jq -r .text

Generate subtitles for a video

ffmpeg -i video.mp4 -vn -acodec libmp3lame audio.mp3
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > video.srt
# burn in:  ffmpeg -i video.mp4 -vf subtitles=video.srt -c:a copy video-subbed.mp4

Iterate on the same file with different settings

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/work/lecture.mp3

# Try different models / formats without re-uploading.
for fmt in json verbose_json srt; do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=work/lecture.mp3" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=$fmt" > "lecture.$fmt"
done

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/work/lecture.mp3

Diarized interview transcript

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file=@interview-stereo.wav" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=text"
# stdout:
#   L: hi how's it going
#   R: not bad you
#   L: cool man

Synthesize speech from text

# Default voice, MP3 output.
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-82m","input":"Greetings, human."}' \
  --output greetings.mp3

# Pick a voice from GET /v1/audio/voices, choose a format.
curl -s $TALKIES_URL/v1/audio/voices | jq -r '.voices[].voice'
curl -s $TALKIES_URL/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
        "model": "kokoro-82m",
        "input": "Buongiorno, mondo.",
        "voice": "if_sara",
        "response_format": "opus"
      }' \
  --output ciao.opus

Free VRAM after a job

curl -s -X POST $TALKIES_URL/unload | jq

Bulk transcribe from URLs

for url in $(cat urls.txt); do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=$url" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=text"
  echo "---"
done

The first hit on each URL downloads + caches; re-running the loop is free.

For a fuller bulk-transcribe driver (mix of local paths + URLs, per-input output files, error reporting, optional diarization) see scripts/bulk_transcribe.sh:

TALKIES_URL=http://localhost:8000 \
TALKIES_MODEL=whisper-large-v3-turbo \
TALKIES_FORMAT=srt \
TALKIES_OUTDIR=./subs \
  bash scripts/bulk_transcribe.sh inputs.txt

Tips

Use whisper-large-v3-turbo as your default — it's the speed/quality sweet spot for general-purpose ASR. Switch to whisper-large-v3 only when you need the last few % of accuracy on hard audio.
URL file_path over multipart upload — if the audio is already at a URL, send the URL. Saves bandwidth (the file isn't going up and then back down), gets cached server-side, no upload size cap.
Stage repeated files via PUT /v1/files/{path} and call with file_path= to avoid re-uploading on every retry/iteration.
response_format=text for the "just give me the string" case — no jq -r .text needed, content-type is text/plain.
One model at a time — every transcribe request evicts other loaded models. Don't try to fan out two calls against two different models on the same container; the second one evicts the first and reloads. Use two containers if you actually need concurrency on different models.
POST /unload after a job — explicit eviction frees VRAM/RAM faster than waiting for the 10-min idle sweeper. Useful in CI / batch scripts.
canary-qwen-2.5b has no timestamps — verbose_json.segments / .words come back empty, srt/vtt collapse to one cue. Use a Whisper or Canary multitask slug if you need timing data.
Diarization requires true stereo — if your "stereo" file is the same mono signal copied to both channels, diarization won't separate speakers. The technique is exact for two-mic setups, useless otherwise.
Long files just work — VAD chunking happens transparently. Don't pre-split. Send the whole file.
prompt / temperature / instructions are ignored even though the request schemas accept them. Don't expect them to do anything.
Watch /api/ps to see what's resident. A request that hangs at "loading model" is doing the first cold load — subsequent calls are fast.
Customizing the model registry for translation slugs or to restrict the served set — see references/setup.md.
Kokoro uses native voice names — no OpenAI aliases. Hit GET /v1/audio/voices once to discover what's shipped; pass the voice field accordingly. The 41 voices cover en (US + UK), es, fr, hi, it, pt; ja/zh are filtered out.
Voice cloning is qwen3-tts-0.6b — drop a .wav (10-30 s of clean speech is plenty) into /data/custom-voices/<anywhere>.wav. Optionally drop a sibling .txt with a faithful transcript for higher clone fidelity. The voice appears in GET /v1/audio/voices on the next request — no restart. CUDA required.
Qwen3-TTS first synth is slow — CUDA graph capture runs once after model load (~30-60 s). Subsequent synths are sub-second. If you're benchmarking, throw away the first call.
Qwen3-TTS ignores speed — the model has no playback-rate control. Pass it for OpenAI compat; nothing happens. Only Kokoro honors speed.
Different TTS sample rates — Kokoro emits 24 kHz mono PCM; Qwen3-TTS emits 12 kHz mono PCM. ffmpeg re-encodes both into your chosen response_format, but if you select pcm (raw, no container), you must know the source rate per model to play it back correctly.
TTS response_format=pcm is for chaining — raw int16 mono PCM, no container, no header. Use it when piping into another encoder or a real-time playback path. Otherwise stick with mp3 (default) or opus for size.
TTS evicts loaded ASR and vice versa — they share the same one-model-resident pool. Synthesizing with Kokoro after a transcribe burst incurs Kokoro's cold load. Same applies to Qwen3-TTS (plus the CUDA-graph capture re-runs on cold reload).

talkies

Install

talkies

When To Use

When NOT To Use

Setup

Quick Start

Supported Models

ASR

TTS

API — POST /v1/audio/transcriptions

Request Fields

Response Formats

Stereo Diarization

Translation

Long Files + VAD Chunking

Error Contract

API — POST /v1/audio/speech (TTS)

Request Body

Output Formats

Voices

Error Contract (TTS)

Resource-Management Endpoints (Ollama-Style)

Server-Side File Staging (/v1/files)

URL file_path (Download + Cache)

MCP Endpoint (/v1/mcp)

Raw JSON-RPC

Bearer-Token Auth

Typical Workflows

Quick one-off transcribe

Generate subtitles for a video

Iterate on the same file with different settings

Diarized interview transcript

Synthesize speech from text

Free VRAM after a job

Bulk transcribe from URLs

Tips

Related skills

API — `POST /v1/audio/transcriptions`

API — `POST /v1/audio/speech` (TTS)

Server-Side File Staging (`/v1/files`)

URL `file_path` (Download + Cache)

MCP Endpoint (`/v1/mcp`)