Install
openclaw skills install audio-to-text-and-video-to-textTranscribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.
openclaw skills install audio-to-text-and-video-to-textConverts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.
This skill handles the full pipeline:
which ffmpeg to verify — it's usually pre-installed in claude.ai's environment)OPENAI_API_KEY — the user must provide thisopenai, pydub (install via pip if needed)When a user provides a media file, run the transcription script:
# Install dependencies if missing
pip install openai pydub --break-system-packages -q
# Run transcription
python /home/claude/transcription/scripts/transcribe.py \
--input "/path/to/media/file" \
--output "/mnt/user-data/outputs/transcript.txt" \
--api-key "$OPENAI_API_KEY"
See scripts/transcribe.py for the full implementation.
| Category | Formats |
|---|---|
| Audio | mp3, wav, m4a, ogg, flac, aac, opus, wma |
| Video | mp4, mov, avi, mkv, webm, wmv, m4v |
ffmpeg handles extraction from any of these.
| Flag | Default | Description |
|---|---|---|
--model | whisper-1 | Whisper model to use (whisper-1, gpt-4o-transcribe) |
--language | auto-detect | ISO 639-1 language code (e.g. en, ar, fr) |
--format | txt | Output format: txt, srt, vtt, json |
--timestamps | off | Include timestamps in output |
--chunk-size | 20 | Max chunk size in MB (must be ≤ 25) |
--prompt | none | Context hint to improve accuracy (e.g. domain vocab) |
Ask the user to upload the file or provide a local path. Check:
ls /mnt/user-data/uploads/
which ffmpeg && ffmpeg -version 2>&1 | head -1
pip install openai pydub --break-system-packages -q 2>&1 | tail -3
If OPENAI_API_KEY is not set in the environment, ask the user:
"Please provide your OpenAI API key — it starts with
sk-. You can get one at https://platform.openai.com/api-keys"
python /home/claude/transcription/scripts/transcribe.py \
--input "<file_path>" \
--output "/mnt/user-data/outputs/transcript.txt"
After transcription, offer to:
Use the transcript text directly in the conversation for these steps.
The script automatically splits files > 20 MB into overlapping chunks (with 1-second overlap for continuity). Each chunk is transcribed separately and the results are merged.
For very long recordings (> 1 hour), warn the user it may take a few minutes and show progress.
| Error | Fix |
|---|---|
AuthenticationError | Invalid API key — ask user to verify |
RateLimitError | Wait 60s and retry, or use --chunk-size 10 |
InvalidRequestError: file too large | Reduce --chunk-size below 25 |
ffmpeg not found | sudo apt install ffmpeg or brew install ffmpeg |
No audio stream found | File may be corrupt or wrong format |
User: "Can you transcribe this meeting recording?"
[uploads meeting.mp4]
→ Check file exists in /mnt/user-data/uploads/
→ Run transcribe.py on it
→ Save transcript to /mnt/user-data/outputs/
→ present_files() to the user
→ Offer to summarize or extract action items
/mnt/user-data/outputs/ so users can download itpresent_files() to share the transcript file with the user after savingsrt or vtt format if they're adding captions to video--prompt flag is useful for technical/domain-specific content: pass a few domain keywords to improve accuracy