whisperx

WhisperX provides local speech-to-text transcription using OpenAI Whisper, with high-quality offline recognition, no API key required, word-level timestamps, and optional speaker diarization.

Audits

Pending

ClawScanPending

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPending

Multi-engine malware detections and file reputation.

Install

openclaw skills install whisperx

WhisperX Speech Recognition Skill

Local offline speech-to-text - A WhisperX-powered speech recognition skill for OpenClaw. Up to 30x faster than standard OpenAI Whisper, runs fully offline with no API key required.

Features

Pure ASR: Converts voice messages to text only — no voice replies generated
Fully offline: Model runs locally, no internet or API key needed
Word-level timestamps: Precise per-word time alignment
90+ languages: Includes auto language detection
Speaker diarization: Optional, requires a HuggingFace token

Installation

# Install ffmpeg (macOS)
brew install ffmpeg

# Install ffmpeg (Ubuntu/Debian)
apt-get install ffmpeg

# Install WhisperX
pip install whisperx
# or using uvx:
uvx whisperx

GPU users: ensure CUDA 12.8 is installed for faster inference.

Usage

# Basic transcription (auto-detect language)
whisperx path/to/audio.wav

# Specify model and language
whisperx  --model small --language zh path/to/audio.wav

# CPU mode (low memory)
whisperx --model small --device cpu --compute_type int8  path/to/audio.wav

Notes

Dependencies: whisperx, ffmpeg
Supported formats: MP3, WAV, OGG, FLAC, M4A, OPUS, AAC, and all other ffmpeg-supported formats
Model cache: Downloaded automatically to ~/.cache/whisper/ on first run
Recommended models: base or small for CPU; large-v3 for GPU