NEXUS Voice Transcriber

Voice note transcription and archival for OpenClaw agents. Powered by Deepgram Nova-3 or local Whisper. Transcribes audio messages, saves both audio files and text transcripts.

Audits

Pass

Install

openclaw skills install nexus-voice-transcriber

Setup

On first use, read references/whisper-models.md and references/troubleshooting.md.
Ensure dependencies: ffmpeg, python3, and required Python packages (openai-whisper, deepgram-sdk optional).

When to Use

  • User sends a voice note / audio file / video file that needs transcription.
  • Need to archive both the original audio and the text transcript.
  • Want speaker detection (if using Deepgram with diarization).
  • Quick local transcription without external APIs (Whisper).

Architecture

Memory lives in ~/voice-transcriber/. See below for structure.

~/voice-transcriber/
├── memory.md          # Provider preferences, defaults, history
├── transcripts/       # Saved transcripts (txt, json, srt)
├── audio/             # Saved original audio files
└── temp/              # Processing workspace (auto-cleaned)

Quick Reference

TopicFile
Whisper model guidereferences/whisper-models.md
Troubleshootingreferences/troubleshooting.md
Main scriptscripts/transcribe.py

Core Rules

1. Detect Input Type

Before transcription:

  • Local file path → verify exists, check format (mp3, wav, m4a, mp4, etc.)
  • URL → download to temp/, then process
  • Voice memo → usually single speaker, short
  • Meeting / interview → likely multiple speakers, consider diarization

2. Choose Provider Based on Context

ScenarioBest ProviderWhy
Privacy, no API keysLocal WhisperRuns on-device, free
High accuracy, speedDeepgram Nova‑3Low latency, good accuracy
Speaker identificationDeepgram (with diarization)Native speaker labels
No internetLocal WhisperOffline capable

3. Handle Long Audio

Files >25 MB or >2 hours:

  1. Split into chunks with ffmpeg (see scripts/transcribe.py --split)
  2. Process each chunk
  3. Merge transcripts with proper timestamps

4. Save Artifacts

After successful transcription:

  • Save transcript to ~/voice-transcriber/transcripts/ with a meaningful name
  • Save original audio to ~/voice-transcriber/audio/ if user wants archival
  • Update memory.md with date, file, provider, duration

5. Output Formats

Default to plain text (.txt). Offer alternatives:

  • .txt — clean text, no timestamps
  • .srt / .vtt — subtitles with timing
  • .json — structured with word‑level timing (Deepgram) or segment timing (Whisper)

Common Traps

  • Assuming one provider fits all → Whisper lacks diarization; Deepgram needs API key.
  • Uploading huge files directly → Timeouts. Split first.
  • Ignoring audio quality → Noisy audio may need preprocessing (ffmpeg noise reduction).
  • Not checking language → Whisper auto‑detects but can fail on mixed‑language content.
  • Forgetting to save audio → User may want the original file archived.

Requirements

Required:

  • ffmpeg (audio conversion, splitting)
  • python3 + pip
  • Python packages: openai-whisper (local), requests (for Deepgram if used)

Optional API keys (only if using Deepgram):

  • DEEPGRAM_API_KEY — for Deepgram Nova‑3 (speaker diarization available)

Local Whisper works without any API keys.

Provider Quick Reference

Local Whisper (No API Key)

# Install
pip install openai-whisper

# Basic transcription (via script)
python3 scripts/transcribe.py --file audio.wav --provider whisper --model base

# Output formats: txt (default), srt, vtt, json
python3 scripts/transcribe.py --file audio.wav --provider whisper --model medium --format srt

Models: tiny (fastest) → basesmallmediumlarge (most accurate).

Deepgram Nova‑3 (API Key Required)

# Set environment variable
export DEEPGRAM_API_KEY="your_key_here"

# Transcribe with speaker diarization
python3 scripts/transcribe.py --file audio.wav --provider deepgram --diarize

# Output JSON with speaker labels
python3 scripts/transcribe.py --file audio.wav --provider deepgram --format json

Audio Preprocessing

Extract Audio from Video

ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Reduce Noise

ffmpeg -i noisy.wav -af "afftdn=nf=-25" clean.wav

Split Long Audio (10‑minute chunks)

ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy temp/chunk_%03d.mp3

Security & Privacy

Data that stays local:

  • Transcripts in ~/voice-transcriber/transcripts/
  • Original audio in ~/voice-transcriber/audio/
  • Local Whisper processes entirely on‑device

Data that leaves your machine (if using Deepgram):

  • Audio file sent to Deepgram API (api.deepgram.com)
  • Transcript returned and stored locally

This skill does NOT:

  • Store API keys in plain text (use environment variables)
  • Auto‑upload without confirmation
  • Retain files on external servers after processing

External Endpoints

EndpointData SentPurpose
api.deepgram.com/v1/listenAudio fileDeepgram transcription

Only called when user explicitly chooses Deepgram provider. Local Whisper sends nothing.

Memory Template

Create ~/voice-transcriber/memory.md with this structure:

# Voice Transcriber Memory

## Status
status: ongoing
version: 1.0.0
last: YYYY‑MM‑DD
integration: pending

## Context
<!-- Observations about transcription needs, preferred providers, languages, etc. -->

## Notes
<!-- Provider preferences, format preferences, diarization needs -->

---
*Updated: YYYY‑MM‑DD*

Related Skills

Install with clawhub install <slug> if user confirms:

  • speech-to-text-transcription — broader audio/video transcription with more providers
  • ffmpeg — advanced audio/video processing
  • audio — general audio manipulation

Feedback

  • If useful: clawhub star voice-transcriber
  • Stay updated: clawhub sync