Parakeet Cli Ssml

v1.3.2

Offline voice toolkit for speech-to-text, text-to-speech, and language detection supporting 25 languages with no API keys or cloud usage.

0· 0·0 current·0 all-time
byAnton Yakutovich@drakulavich
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (local, offline STT/TTS/language-detection) aligns with the instructions: they call a local CLI (kesha) and document installing models and a Rust engine. Requiring global npm/bun installs and model downloads is reasonable for this purpose. Minor inconsistency: the skill registry summary said "Required binaries: none" while the SKILL.md frontmatter declares requires.bins: [kesha] — the skill expects a kesha CLI to be present or installed via the documented steps.
Instruction Scope
SKILL.md tells the agent to run kesha CLI commands (transcribe, say, --json, --ssml) and to run the documented installer (bun/npm + kesha install). The instructions do not ask the agent to read unrelated system files, environment variables, or send data to unknown endpoints. Note: 'kesha install' will download engine and models (large artifacts) from upstream (GitHub/HuggingFace) and requires network and disk access.
Install Mechanism
Installation is instruction-only (no packaged install spec in the registry), but SKILL.md recommends bun/npm global install of @drakulavich/kesha-voice-kit and running 'kesha install' which downloads engine/models. The sources referenced (GitHub releases, npm, HuggingFace, bun.sh) are public and expected for this project, but using 'curl | bash' (bun installer) and global npm installs carry standard supply-chain risks. Model downloads are large (~GB) and will be written to ~/.cache/kesha; they may be extracted to disk.
Credentials
The skill declares no required environment variables or credentials and its instructions do not request secrets. This is proportionate to an on-device voice toolkit.
Persistence & Privilege
The skill is not always-enabled and does not request elevated privileges. Install steps place artifacts in user cache and suggest system package manager installs for espeak-ng (brew/apt/choco) which are standard one-time system deps. The skill does not attempt to modify other skills or global agent settings beyond optional OpenClaw plugin instructions.
Assessment
This skill looks coherent for an offline voice toolkit, but before installing: 1) Verify the upstream repository and npm package (github.com/drakulavich/kesha-voice-kit and the npm package page) and check release checksums or signatures if available; 2) Be aware 'kesha install' downloads large models (GBs) from public hosts (GitHub/HuggingFace) and will consume disk and network; 3) The README suggests installing bun via an install script (curl | bash) — prefer using distro-packaged installers or review the script before running; 4) Consider installing and running the tool in a sandbox or isolated environment if you want to limit risk; 5) Note the small metadata inconsistency (registry said no required binary while SKILL.md expects the kesha CLI) — ensure the platform will install the CLI or that you will run the documented install steps.

Like a lobster shell, security has layers — review code before you run it.

latestvk976s6a1cr20ev3vxfpxf4abbd8563tz
0downloads
0stars
3versions
Updated 2h ago
v1.3.2
MIT-0

kesha-voice-kit

Local voice toolkit: transcribe voice messages to text, synthesize speech, detect language of audio or text. Fully offline after kesha install. No API keys, no per-minute billing.

Trigger keywords for when to use this skill: voice message, voice memo, .ogg, .wav, .mp3, audio file, transcribe, transcription, speech-to-text, STT, text-to-speech, TTS, synthesize speech, say, multilingual voice, multilingual ASR, language detection, offline voice, privacy, Apple Silicon, CoreML.

When to use

  • Voice memo arrived (Telegram, WhatsApp, Slack, Signal .ogg/.opus/.m4a): transcribe with kesha --json <path> and branch on the detected language.
  • Need to reply with audio: synthesize with kesha say "<text>" > reply.wav. Auto-routes by detected language (Kokoro-82M for English, Piper for Russian). For other languages and ~180 more voices use --voice macos-* on macOS (zero model download).
  • Need to detect what language a file is in before choosing a pipeline: kesha --json audio.ogg returns both audio-based and text-based language detection with confidence scores.

STT: transcribe audio

# JSON output with language detection (recommended for automation)
kesha --json voice.ogg
[{
  "file": "voice.ogg",
  "text": "Привет, как дела?",
  "lang": "ru",
  "audioLanguage": { "code": "ru", "confidence": 0.98 },
  "textLanguage": { "code": "ru", "confidence": 0.99 }
}]

Use lang (or the more detailed audioLanguage/textLanguage) to decide how to respond.

Formats: .ogg, .opus, .mp3, .m4a, .wav, .flac, .webm — decoded via symphonia, no ffmpeg required.

Other output modes:

  • kesha audio.ogg — plain transcript on stdout
  • kesha --format transcript audio.ogg — transcript + [lang: ru, confidence: 0.99] footer
  • kesha --verbose audio.ogg — human-readable with language info
  • kesha --lang en audio.ogg — warn if detected language differs (useful sanity check)

TTS: synthesize speech

kesha say "Hello, world" > hello.wav               # auto-routes en → Kokoro-82M
kesha say "Привет, мир" > privet.wav              # auto-routes ru → Piper
kesha say --voice macos-de-DE "Guten Tag" > de.wav # any macOS system voice — German, French, Italian, ...
kesha say --list-voices                            # Kokoro + Piper + ~180 macos-* voices

Output: WAV mono float32. --out <path> writes to a file instead of stdout.

Language detection standalone

kesha --json audio.ogg includes both audio-based (audioLanguage) and text-based (textLanguage) detection. Use audio detection to identify the language before running language-specific logic.

Install

bun add --global @drakulavich/kesha-voice-kit    # or: npm i -g @drakulavich/kesha-voice-kit
kesha install                                    # downloads engine (~350 MB)
kesha install --tts                              # adds Kokoro + Piper RU (~390 MB more, for TTS)

One-time runtime prereq for TTS on each platform:

  • macOS: brew install espeak-ng
  • Linux: sudo apt install espeak-ng
  • Windows: choco install espeak-ng

macos-* voices need no install — they use voices already on the Mac.

Supported languages

Speech-to-text (25): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian.

Text-to-speech: English (Kokoro-82M, ~70 voices), Russian (Piper ru-denis), plus any macOS system voice via --voice macos-*.

Performance

  • ASR: ~19× faster than OpenAI Whisper on Apple Silicon (CoreML via FluidAudio), ~2.5× on CPU (ONNX via ort).
  • TTS: sub-second latency for short utterances on Apple Silicon.

Why local

No API keys to manage. No per-minute billing. Voice data never leaves the machine — important for regulated industries, personal messaging, and anything that shouldn't be in a third-party log.

Links

Comments

Loading comments...