Documentation Auto

API key required
Automation

Comprehensive Gladia speech-to-text reference auto-synced from docs.gladia.io. Use as a general-purpose fallback when other specialized skills don't match, or when the user needs a broad overview of Gladia capabilities, endpoints, decision guidance, or workflows. Always prefer the official SDK; fall back to raw REST/WebSocket only when SDK cannot satisfy the requirement.

Install

openclaw skills install documentation-auto

SDK-first: always use the official SDK — see sdk-integration for policy, setup, and fallback criteria.

References

Consult these sibling skills as needed:

  • ../sdk-integration/SKILL.md -- SDK setup, client initialization, error handling, and SDK vs raw API decision guide
  • ../sdk-integration/references/sdk-versions.md -- Current SDK versions (auto-synced by CI)
  • ../troubleshooting/SKILL.md -- Common errors, gotchas, and verification checklist
  • ../live-transcription/SKILL.md -- Live streaming transcription
  • ../pre-recorded-transcription/SKILL.md -- Pre-recorded file transcription

name: Gladia description: Use when building speech-to-text transcription features, processing audio or video files, implementing real-time transcription, extracting insights from audio (translation, summarization, speaker identification), or integrating audio intelligence into applications. metadata: mintlify-proj: gladia version: "1.0"

Gladia Skill

Product summary

Gladia is a speech-to-text API that transcribes audio and video files in two modes: pre-recorded (asynchronous, batch) and live (real-time, WebSocket-based). The API returns structured transcripts with word-level timing, confidence scores, and optional audio intelligence features (translation, diarization, summarization, entity recognition, sentiment analysis, PII redaction, subtitles). Use the JavaScript/TypeScript SDK (@gladiaio/sdk) or Python SDK (gladiaio-sdk) for simplified integration, or call REST/WebSocket endpoints directly. Authenticate with x-gladia-key header. Primary docs: https://docs.gladia.io

When to use

  • Pre-recorded transcription: Transcribe uploaded audio/video files (MP3, WAV, MP4, YouTube links, etc.) asynchronously. Typical latency: seconds to minutes depending on file length.
  • Live transcription: Stream audio in real-time via WebSocket for immediate transcripts (e.g., call centers, live events, voice assistants).
  • Audio intelligence: Extract metadata from transcripts — translate to multiple languages, identify speakers, detect sentiment, redact PII, generate summaries, create subtitles, recognize named entities.
  • Custom vocabulary: Improve accuracy for domain-specific terms, brand names, proper nouns by providing phonetic hints.
  • Multi-speaker scenarios: Use diarization to attribute speech to individual speakers, or send multi-channel audio to preserve speaker identity.

Quick reference

Authentication

# All requests require x-gladia-key header
curl -H "x-gladia-key: YOUR_API_KEY" https://api.gladia.io/v2/pre-recorded

Pre-recorded workflow (SDK)

import { GladiaClient } from "@gladiaio/sdk";
const client = new GladiaClient({ apiKey: "YOUR_KEY" });
const result = await client.preRecorded().transcribe("audio_url_or_local_path");

Live workflow (SDK)

const session = client.liveV2().startSession({
  encoding: "wav/pcm",
  sample_rate: 16000,
  bit_depth: 16,
  channels: 1,
  language_config: { languages: ["en"] }
});
session.on("message", (msg) => console.log(msg));
session.sendAudio(audioChunk);
session.stopRecording();

Audio formats

TypeExamples
AudioMP3, WAV, FLAC, AAC, OGG, Opus
VideoMP4, MOV, AVI, WebM, Matroska
OnlineYouTube, TikTok, Instagram, Facebook, Vimeo, LinkedIn

Limits

LimitValue
Pre-recorded max duration135 minutes (free/paid); 4h15 (enterprise)
Pre-recorded max file size1000 MB
Live session max duration3 hours
Free tier monthly usage10 hours
Concurrent pre-recorded jobs (free)3
Concurrent pre-recorded jobs (paid)25
Concurrent live sessions (free)1
Concurrent live sessions (paid)30

Audio intelligence features

FeaturePre-recordedLivePurpose
DiarizationIdentify speakers
TranslationMulti-language output
SummarizationGenerate summaries/bullet points
Sentiment analysisDetect emotions and tone
Named entity recognitionExtract people, orgs, dates
PII redactionAnonymize sensitive data
SubtitlesGenerate SRT/VTT files
Custom vocabularyImprove domain-specific terms
Custom spellingNormalize misspellings
ChapterizationSegment long audio into chapters
Audio-to-LLMRun custom prompts on transcript

Decision guidance

When to use pre-recorded vs. live

ScenarioPre-recordedLive
Batch processing uploaded files
Real-time streaming (calls, events)
Need diarization
Need immediate partial results✓ (with receive_partial_transcripts: true)
Need summarization
Multi-hour content✓ (up to 135 min)✓ (up to 3 hours per session)

When to use SDK vs. raw API

ApproachBest for
SDKRapid development, automatic error handling, built-in polling/retry logic
Raw APICustom workflows, specific language/framework, fine-grained control

When to use diarization vs. multi-channel audio

ApproachUse when
DiarizationSingle audio file with multiple speakers; you want the API to separate them
Multi-channelMultiple audio sources (e.g., separate participant feeds); you can merge them into one multi-channel stream

When to use custom vocabulary vs. custom spelling

FeatureUse when
Custom vocabularyWord is mispronounced/garbled; you provide phonetic hints (e.g., "Nietzsche" → ["Niche", "Neechee"])
Custom spellingWord is recognized but misspelled (e.g., "Salesforce" → "Sales Force"); literal text matching

Workflow

Pre-recorded transcription (typical task)

  1. Prepare audio: Ensure file is under 1000 MB and 135 minutes. Supported formats: MP3, WAV, MP4, YouTube URL, etc.
  2. Choose delivery method: Use SDK for simplicity, or raw API for control.
  3. Configure transcription:
    • Set language_config.languages explicitly if known (avoids detection overhead).
    • Enable diarization: true if multiple speakers.
    • Add custom_vocabulary for domain terms.
    • Enable audio intelligence features (translation, summarization, etc.) as needed.
  4. Submit job: Call transcribe() (SDK) or POST /v2/pre-recorded (API).
  5. Retrieve results: Poll GET /v2/pre-recorded/:id or configure webhooks/callbacks.
  6. Parse response: Extract transcription.utterances[] for text and timing, plus any audio intelligence results.

Live transcription (typical task)

  1. Initialize session: Call POST /v2/live with audio config (encoding, sample_rate, bit_depth, channels).
  2. Connect WebSocket: Use returned URL to open WebSocket connection.
  3. Configure messages: Set messages_config to specify which message types to receive (transcripts, partial transcripts, post-processing events).
  4. Stream audio: Send audio chunks via sendAudio() (SDK) or binary/base64 JSON (raw API).
  5. Handle messages: Listen for transcript messages; check is_final to distinguish partials from finals.
  6. Stop recording: Call stopRecording() to trigger post-processing (diarization, translation, etc.).
  7. Retrieve final result: Poll GET /v2/live/:id or wait for callback with complete result.

Adding custom vocabulary

  1. Identify problem terms: Transcribe without custom vocabulary; note mis-transcribed words.
  2. Categorize: Garbled/phonetically wrong → custom vocabulary; recognizable but misspelled → custom spelling.
  3. Build vocabulary list:
    {
      "custom_vocabulary": true,
      "custom_vocabulary_config": {
        "vocabulary": [
          "Gladia",
          { "value": "Salesforce", "pronunciations": ["sell force"], "intensity": 0.5 }
        ],
        "default_intensity": 0.4
      }
    }
    
  4. Test: Transcribe again; confirm targets appear and check for false positives.
  5. Refine: Lower intensity, add pronunciations, or move stubborn terms to custom spelling.

Common gotchas

  • Language detection overhead: Always set language_config.languages explicitly if you know the language. Auto-detection adds latency and can fail if audio starts with silence or music.
  • Code switching without language list: Never enable code_switching: true with an empty languages array — the model will evaluate against 100+ languages, causing frequent misdetections. Always provide a constrained list (3–5 languages).
  • Diarization hints are not hard constraints: number_of_speakers, min_speakers, max_speakers are hints, not guarantees. The model may detect a different count.
  • Custom vocabulary intensity tuning: Start at default_intensity: 0.4 and adjust per-entry only. Raising intensity globally increases false positives. Add pronunciations variants before raising intensity.
  • Live session 3-hour limit: A single WebSocket session cannot exceed 3 hours. For longer events, close the session and start a new one before hitting the limit.
  • Pre-recorded 135-minute limit: Files longer than 135 minutes will fail. Split into ~60-minute chunks using ffmpeg or similar tools.
  • Audio format conversion overhead: Large video files (e.g., AVI, MOV) take ~1 minute to convert to WAV/PCM. Plan for this latency.
  • Polling without webhooks: If you poll GET /v2/pre-recorded/:id in a tight loop, you'll hit rate limits. Use webhooks or callbacks instead, or poll with exponential backoff.
  • Multi-channel billing: Transcribing multi-channel audio is billed as duration × number_of_channels. A 10-minute 3-channel stream costs 30 minutes of usage.
  • Partial transcripts in live mode: Partial transcripts are low-latency but less accurate. Always check is_final: true before using a transcript for critical decisions.
  • Missing audio_url on upload: After uploading a file, the response includes audio_url — use this URL in the transcription request, not the local file path.
  • WebSocket reconnection: If the WebSocket disconnects, reconnect to the same URL (returned from init) to resume the session without losing context.

Verification checklist

Before submitting transcription work:

  • API key is valid and passed in x-gladia-key header.
  • Audio file is under 1000 MB and 135 minutes (pre-recorded) or 3 hours (live).
  • Audio format is supported (MP3, WAV, MP4, etc.).
  • Language is set explicitly in language_config.languages if known.
  • If using code switching, languages list is constrained to 3–5 expected languages.
  • Diarization is enabled if multiple speakers need attribution.
  • Custom vocabulary entries have realistic intensity (0.4–0.6) and pronunciations.
  • Webhooks or callbacks are configured if polling is not feasible.
  • Live sessions are closed before 3 hours; pre-recorded jobs are split if over 135 minutes.
  • Response includes expected fields: transcription.utterances[], metadata, and any requested audio intelligence results.
  • Confidence scores and timing (start, end) are present for quality validation.
  • Multi-channel audio is correctly interleaved if merging multiple sources.

Resources


For additional documentation and navigation, see: https://docs.gladia.io/llms.txt


This file is auto-synced from https://docs.gladia.io/.well-known/agent-skills/gladia/skill.md Do not edit manually — changes will be overwritten by CI. For additional documentation and navigation, see: https://docs.gladia.io/llms.txt