Install
openclaw skills install speech-to-text-transcriptionTranscribe audio and video files to text with speaker detection, timestamps, and format conversion.
openclaw skills install speech-to-text-transcriptionOn first use, read setup.md and start helping with transcription needs.
User has audio or video files that need transcription. Agent handles local files, URLs, voice memos, podcasts, interviews, meetings, and lectures.
Memory lives in ~/speech-to-text-transcription/. See memory-template.md for structure.
~/speech-to-text-transcription/
├── memory.md # Provider preferences, defaults
├── transcripts/ # Saved transcriptions
└── temp/ # Processing workspace
| Topic | File |
|---|---|
| Setup process | setup.md |
| Memory template | memory-template.md |
Before transcription, identify the input:
| Scenario | Best Provider | Why |
|---|---|---|
| Quick local transcription | Whisper (local) | No API key, free, private |
| High accuracy needed | OpenAI Whisper API | Best quality |
| Speaker identification | AssemblyAI | Native diarization |
| Real-time/streaming | Deepgram | Low latency |
| Long content (>2 hours) | Split + batch | Avoid timeouts |
Files over 25MB or 2 hours:
After transcription:
Default to plain text. Offer alternatives:
.txt — clean text, no timestamps.srt / .vtt — subtitles with timing.json — structured with word-level timing.md — formatted with speaker labelsRequired: ffmpeg (for audio processing)
Optional API keys (only if using cloud providers):
OPENAI_API_KEY — for OpenAI Whisper APIASSEMBLYAI_API_KEY — for AssemblyAI (speaker diarization)DEEPGRAM_API_KEY — for Deepgram (real-time)Local Whisper works without any API keys.
# Install
pip install openai-whisper
# Basic transcription
whisper audio.mp3 --model base --output_format txt
# With timestamps
whisper audio.mp3 --model medium --output_format srt
Models: tiny (fast) → base → small → medium → large (accurate)
curl -X POST https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1"
# Upload
curl -X POST https://api.assemblyai.com/v2/upload \
-H "Authorization: $ASSEMBLYAI_API_KEY" \
--data-binary @audio.mp3
# Transcribe with speakers
curl -X POST https://api.assemblyai.com/v2/transcript \
-H "Authorization: $ASSEMBLYAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"audio_url": "URL", "speaker_labels": true}'
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
ffmpeg -i noisy.wav -af "afftdn=nf=-25" clean.wav
# Split into 10-minute chunks
ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
Data that stays local:
Data that leaves your machine (if using APIs):
This skill does NOT:
| Endpoint | Data Sent | Purpose |
|---|---|---|
| api.openai.com/v1/audio | Audio file | Whisper API transcription |
| api.assemblyai.com/v2 | Audio file | AssemblyAI transcription |
| api.deepgram.com/v1 | Audio stream | Deepgram transcription |
Only called when user explicitly chooses cloud provider. Local Whisper sends nothing.
By using cloud transcription providers, audio data is sent to OpenAI, AssemblyAI, or Deepgram. Only install if you trust these services with your audio. For sensitive content, use local Whisper.
Install with clawhub install <slug> if user confirms:
audio — General audio processingffmpeg — Video and audio conversionpodcast — Podcast creation and editingclawhub star speech-to-text-transcriptionclawhub sync