Install
openclaw skills install documentation-autoComprehensive Gladia speech-to-text reference auto-synced from docs.gladia.io. Use as a general-purpose fallback when other specialized skills don't match, or when the user needs a broad overview of Gladia capabilities, endpoints, decision guidance, or workflows. Always prefer the official SDK; fall back to raw REST/WebSocket only when SDK cannot satisfy the requirement.
openclaw skills install documentation-autoSDK-first: always use the official SDK — see sdk-integration for policy, setup, and fallback criteria.
Consult these sibling skills as needed:
Gladia is a speech-to-text API that transcribes audio and video files in two modes: pre-recorded (asynchronous, batch) and live (real-time, WebSocket-based). The API returns structured transcripts with word-level timing, confidence scores, and optional audio intelligence features (translation, diarization, summarization, entity recognition, sentiment analysis, PII redaction, subtitles). Use the JavaScript/TypeScript SDK (@gladiaio/sdk) or Python SDK (gladiaio-sdk) for simplified integration, or call REST/WebSocket endpoints directly. Authenticate with x-gladia-key header. Primary docs: https://docs.gladia.io
# All requests require x-gladia-key header
curl -H "x-gladia-key: YOUR_API_KEY" https://api.gladia.io/v2/pre-recorded
import { GladiaClient } from "@gladiaio/sdk";
const client = new GladiaClient({ apiKey: "YOUR_KEY" });
const result = await client.preRecorded().transcribe("audio_url_or_local_path");
const session = client.liveV2().startSession({
encoding: "wav/pcm",
sample_rate: 16000,
bit_depth: 16,
channels: 1,
language_config: { languages: ["en"] }
});
session.on("message", (msg) => console.log(msg));
session.sendAudio(audioChunk);
session.stopRecording();
| Type | Examples |
|---|---|
| Audio | MP3, WAV, FLAC, AAC, OGG, Opus |
| Video | MP4, MOV, AVI, WebM, Matroska |
| Online | YouTube, TikTok, Instagram, Facebook, Vimeo, LinkedIn |
| Limit | Value |
|---|---|
| Pre-recorded max duration | 135 minutes (free/paid); 4h15 (enterprise) |
| Pre-recorded max file size | 1000 MB |
| Live session max duration | 3 hours |
| Free tier monthly usage | 10 hours |
| Concurrent pre-recorded jobs (free) | 3 |
| Concurrent pre-recorded jobs (paid) | 25 |
| Concurrent live sessions (free) | 1 |
| Concurrent live sessions (paid) | 30 |
| Feature | Pre-recorded | Live | Purpose |
|---|---|---|---|
| Diarization | ✓ | ✗ | Identify speakers |
| Translation | ✓ | ✓ | Multi-language output |
| Summarization | ✓ | ✗ | Generate summaries/bullet points |
| Sentiment analysis | ✓ | ✓ | Detect emotions and tone |
| Named entity recognition | ✓ | ✓ | Extract people, orgs, dates |
| PII redaction | ✓ | ✗ | Anonymize sensitive data |
| Subtitles | ✓ | ✗ | Generate SRT/VTT files |
| Custom vocabulary | ✓ | ✓ | Improve domain-specific terms |
| Custom spelling | ✓ | ✓ | Normalize misspellings |
| Chapterization | ✓ | ✗ | Segment long audio into chapters |
| Audio-to-LLM | ✓ | ✗ | Run custom prompts on transcript |
| Scenario | Pre-recorded | Live |
|---|---|---|
| Batch processing uploaded files | ✓ | ✗ |
| Real-time streaming (calls, events) | ✗ | ✓ |
| Need diarization | ✓ | ✗ |
| Need immediate partial results | ✗ | ✓ (with receive_partial_transcripts: true) |
| Need summarization | ✓ | ✗ |
| Multi-hour content | ✓ (up to 135 min) | ✓ (up to 3 hours per session) |
| Approach | Best for |
|---|---|
| SDK | Rapid development, automatic error handling, built-in polling/retry logic |
| Raw API | Custom workflows, specific language/framework, fine-grained control |
| Approach | Use when |
|---|---|
| Diarization | Single audio file with multiple speakers; you want the API to separate them |
| Multi-channel | Multiple audio sources (e.g., separate participant feeds); you can merge them into one multi-channel stream |
| Feature | Use when |
|---|---|
| Custom vocabulary | Word is mispronounced/garbled; you provide phonetic hints (e.g., "Nietzsche" → ["Niche", "Neechee"]) |
| Custom spelling | Word is recognized but misspelled (e.g., "Salesforce" → "Sales Force"); literal text matching |
language_config.languages explicitly if known (avoids detection overhead).diarization: true if multiple speakers.custom_vocabulary for domain terms.transcribe() (SDK) or POST /v2/pre-recorded (API).GET /v2/pre-recorded/:id or configure webhooks/callbacks.transcription.utterances[] for text and timing, plus any audio intelligence results.POST /v2/live with audio config (encoding, sample_rate, bit_depth, channels).messages_config to specify which message types to receive (transcripts, partial transcripts, post-processing events).sendAudio() (SDK) or binary/base64 JSON (raw API).transcript messages; check is_final to distinguish partials from finals.stopRecording() to trigger post-processing (diarization, translation, etc.).GET /v2/live/:id or wait for callback with complete result.{
"custom_vocabulary": true,
"custom_vocabulary_config": {
"vocabulary": [
"Gladia",
{ "value": "Salesforce", "pronunciations": ["sell force"], "intensity": 0.5 }
],
"default_intensity": 0.4
}
}
language_config.languages explicitly if you know the language. Auto-detection adds latency and can fail if audio starts with silence or music.code_switching: true with an empty languages array — the model will evaluate against 100+ languages, causing frequent misdetections. Always provide a constrained list (3–5 languages).number_of_speakers, min_speakers, max_speakers are hints, not guarantees. The model may detect a different count.default_intensity: 0.4 and adjust per-entry only. Raising intensity globally increases false positives. Add pronunciations variants before raising intensity.GET /v2/pre-recorded/:id in a tight loop, you'll hit rate limits. Use webhooks or callbacks instead, or poll with exponential backoff.duration × number_of_channels. A 10-minute 3-channel stream costs 30 minutes of usage.is_final: true before using a transcript for critical decisions.audio_url — use this URL in the transcription request, not the local file path.Before submitting transcription work:
x-gladia-key header.language_config.languages if known.languages list is constrained to 3–5 expected languages.intensity (0.4–0.6) and pronunciations.transcription.utterances[], metadata, and any requested audio intelligence results.start, end) are present for quality validation.For additional documentation and navigation, see: https://docs.gladia.io/llms.txt
This file is auto-synced from https://docs.gladia.io/.well-known/agent-skills/gladia/skill.md Do not edit manually — changes will be overwritten by CI. For additional documentation and navigation, see: https://docs.gladia.io/llms.txt