Local STT (Nvidia Parakeet + Whisper Support)

v1.0.0

Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).

1· 2.7k·18 current·19 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
high confidence
Purpose & Capability
The code and SKILL.md align with a local STT tool (ffmpeg conversion, ONNX-based Parakeet/Whisper backends). The ability to post transcriptions to a Matrix room matches the documented --room-id option. However, the registry metadata listed no required environment variables while the script clearly expects MATRIX_HOMESERVER and MATRIX_ACCESS_TOKEN when the Matrix feature is used; that mismatch is noteworthy.
!
Instruction Scope
SKILL.md documents the --room-id option but does not mention that the runtime will: (1) attempt to load environment files from ~/.openclaw/.env and ~/.env, (2) read MATRIX_HOMESERVER and MATRIX_ACCESS_TOKEN from the environment, (3) write logs to /tmp/stt_matrix.log, and (4) load models via onnx_asr which typically pulls model files from network sources (e.g., huggingface). Reading a user's ~/.env is scope-creep because it can surface unrelated secrets; automatic model downloads are network activity not called out in metadata.
Install Mechanism
There is no install spec (instruction-only), which minimizes installer risk. The script includes a commented dependency list and a nonstandard shebang ('uv run --script') indicating runtime packages will be required; this implies runtime package installation/network activity but no explicit installer URL or archive is used.
!
Credentials
The skill requests no environment variables in registry metadata, yet the script loads ~/.openclaw/.env and ~/.env and reads MATRIX_HOMESERVER and MATRIX_ACCESS_TOKEN if present. Automatically loading a user's .env and using tokens is disproportionate unless clearly documented; it increases the chance of accidental use of unrelated secrets. The Matrix access token, if present, will be used to transmit transcriptions to the specified homeserver.
Persistence & Privilege
The skill is not always-enabled and does not request elevated platform privileges. It writes a local log file (/tmp/stt_matrix.log) and temporarily writes a converted WAV file before deleting it, which is reasonable for this CLI. It does not modify other skills or agent-wide configuration.
What to consider before installing
This skill appears to be a legitimate local STT tool, but you should be cautious before installing or using it as-is: - The script will automatically load ~/.openclaw/.env and ~/.env and may pick up sensitive environment variables. Review the contents of those files first or move secrets elsewhere. - If you use --room-id (Matrix integration), the script will look for MATRIX_HOMESERVER and MATRIX_ACCESS_TOKEN and will send the transcript to the specified homeserver; provide a minimally-privileged token or avoid the feature if you don't trust the destination. - The tool uses onnx_asr/huggingface components to load models at runtime; expect network downloads of model weights (possibly large) from external hosts. If you require offline-only operation, ensure required models are pre-provisioned and verify the code's model-loading behavior. - The script writes a local log (/tmp/stt_matrix.log) containing attempt metadata (URLs and HTTP status codes). Inspect this file for unexpected behavior. Recommended actions: ask the skill author to update registry metadata to declare required env vars (MATRIX_HOMESERVER, MATRIX_ACCESS_TOKEN) and to explicitly document network/model downloads; or run the skill in an isolated environment (container or VM) with only the minimal credentials you are willing to expose.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🎙️ Clawdis
Binsffmpeg
latestvk97b0npyj2b2yehzabprm6102h80c199
2.7kdownloads
1stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Local STT (Parakeet / Whisper)

Unified local speech-to-text using ONNX Runtime with int8 quantization. Choose your backend:

  • Parakeet (default): Best accuracy for English, correctly captures names and filler words
  • Whisper: Fastest inference, supports 99 languages

Usage

# Default: Parakeet v2 (best English accuracy)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg

# Explicit backend selection
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b whisper
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b parakeet -m v3

# Quiet mode (suppress progress)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg --quiet

Options

  • -b/--backend: parakeet (default), whisper
  • -m/--model: Model variant (see below)
  • --no-int8: Disable int8 quantization
  • -q/--quiet: Suppress progress
  • --room-id: Matrix room ID for direct message

Models

Parakeet (default backend)

ModelDescription
v2 (default)English only, best accuracy
v3Multilingual

Whisper

ModelDescription
tinyFastest, lower accuracy
base (default)Good balance
smallBetter accuracy
large-v3-turboBest quality, slower

Benchmark (24s audio)

Backend/ModelTimeRTFNotes
Whisper Base int80.43s0.018xFastest
Parakeet v2 int80.60s0.025xBest accuracy
Parakeet v3 int80.63s0.026xMultilingual

openclaw.json

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/skills/local-stt/scripts/local-stt.py",
            "args": ["--quiet", "{{MediaPath}}"],
            "timeoutSeconds": 30
          }
        ]
      }
    }
  }
}

Comments

Loading comments...