Speech-to-text, 3x faster than Whisper, remote FREE GPU

v1.3.1

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te...

⭐ 1· 93·0 current·0 all-time

by@speech2srt

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for speech2srt/speech-transcribe.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Speech-to-text, 3x faster than Whisper, remote FREE GPU" (speech2srt/speech-transcribe) from ClawHub.
Skill page: https://clawhub.ai/speech2srt/speech-transcribe
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install speech-transcribe

ClawHub CLI

Package manager switcher

npx clawhub@latest install speech-transcribe

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The name/description (remote L4 GPU speech-to-text) aligns with the code and SKILL.md: it uses Modal, a CUDA image, faster-whisper/stable-whisper, and Modal volumes. Minor mismatches: SKILL.md advertises a “FREE L4 GPU” which is a marketing claim (Modal provisioning may be free or billable depending on account), and config.PYTHON_VERSION = '3.11' whereas the image requests add_python='3.12' (inconsequential but inconsistent). Overall the requested resources (Modal, GPU, volumes) are reasonable for the stated purpose.

Instruction Scope

SKILL.md instructs users to use the Modal CLI and modal run, and to upload files to Modal volumes — that is coherent. However, the runtime code (_load_model) forcibly replaces ~/.cache with a symlink to the models volume: if run outside the intended Modal container (e.g., if someone runs transcribe.py locally), this could delete the user's ~/.cache directory (shutil.rmtree) and then symlink it. The instructions do not explicitly warn about this destructive behavior or require running only inside the Modal container. There is also a partial/truncated bug in the provided transcribe.py (a bare 'jso' token in the truncated portion) indicating the code may not be robust as-is.

ℹ

Install Mechanism

There is no external install spec; the code builds a Modal image that apt_installs ffmpeg and pip_installs 'faster-whisper' and 'stable-ts' — typical for this use case and traceable to PyPI. No arbitrary URL downloads or shorteners are used. Building a custom container image is expected for GPU inference, but pip-installed dependencies mean you will execute third-party packages from PyPI inside the image (normal but requires trust in those packages).

✓

Credentials

The skill declares no required environment variables and does not request unrelated credentials. The error-handling doc notes HF_TOKEN as optional for higher Hugging Face rate limits; that is reasonable and optional. The skill will, however, operate against the user's Modal account (Modal token) and create volumes under that account — expected for remote GPU runs.

ℹ

Persistence & Privilege

always is false (good). The skill creates Modal volumes (create_if_missing=True) and mounts them into the job image — this is expected for caching models and storing outputs but does grant persistent remote storage of uploaded audio and downloaded models in the user's Modal account. The code's symlink attempts to make ~/.cache point to the persistent models volume inside the container; the destructive cache replacement behavior is the main persistence/privilege risk if the code is run outside the container context.

What to consider before installing

What to consider before installing/running this skill: - The skill uses Modal (your Modal CLI/token) and will create and use Modal volumes to upload audio and store models/results. Uploaded audio and generated transcripts live in those volumes under your Modal account — treat that as remote storage. - The code contains a step that removes ~/.cache and symlinks it to the models volume when the model loads. If you accidentally run transcribe.py locally rather than via 'modal run', this could delete your local ~/.cache (which may contain other cached credentials or valuable caches). Do NOT run the Python file directly on your machine unless you inspect and modify that behavior first. - Dependencies are installed from PyPI (faster-whisper, stable-ts). That is expected for model inference, but you must trust those packages. The image also apt-installs ffmpeg. - The README mentions HF_TOKEN as optional for higher rate limits; you should only set that if you trust the skill to access Hugging Face on your behalf. - There is at least one apparent bug/typo in transcribe.py (truncated 'jso' usage) — expect the code may need fixing before reliable use. Recommendations: 1) Inspect transcribe.py and remove/modify the code that deletes ~/.cache (or ensure it runs only in an isolated container). 2) Run the skill in an isolated Modal account or project where persistent volumes and billing are acceptable. 3) Backup your local ~/.cache before trying local experimentation. 4) If you need stronger assurance, run the container build in an isolated environment and review all third-party dependencies (PyPI packages) before executing on real data.

Like a lobster shell, security has layers — review code before you run it.

latestvk977saagh6vv889bwdg8bedn8s84ze13

93downloads

1stars

2versions

Updated 1w ago

v1.3.1

MIT-0

Speech Transcribe

Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.

Pipeline code is bundled at ./transcribe.py and ./src/. After npx skills add, runs from any directory.

Workflow

1. Prepare slug and identify files

Slug = task identifier (volume directory name). Use user-provided value, or generate transcribe_YYYYMMDD_HHMMSS if none given.

Directory input? Scan for audio/video (.m4a, .mp3, .mp4, .wav, .flac, .ogg, .aac, .mov, .avi), list with index, ask user to confirm selection.

Specific files? Use directly, no listing needed.

2. Upload to volume

Ensure volume exists (idempotent):

modal volume create speech2srt-data 2>/dev/null || true

Upload each file:

modal volume put speech2srt-data <local_file> <slug>/upload/

Modal put auto-creates remote directories — no need to create <slug>/upload/ manually.

3. Run pipeline

Model options: tiny, base, small, medium, large-v3 (default: large-v3).

modal run ./transcribe.py --slug <slug> --model large-v3

Stream output in real time.

Ctrl+C? Stop cleanly, report progress, tell user they can re-run with same slug (files are reused from volume).

4. Download results

For each original file, outputs are:

<stem>_transcription.txt — plain text transcript
<stem>_transcription.srt — subtitle file with sentence-level timestamps

modal volume get speech2srt-data <slug>/output/<file>_transcription.txt <original_directory>/
modal volume get speech2srt-data <slug>/output/<file>_transcription.srt <original_directory>/

Preserve original directory tree — do not flatten into ./results/.

5. Clean up

modal volume rm speech2srt-data <slug> --recursive

6. Report

Output:

Done. Processed N file(s), RTF: X.XXx

Results:
  - <transcript_path>.txt  (X.X KB)
  - <transcript_path>.srt  (X.X KB)

If you need to remove background noise first, try speech-denoise. Follow @speech2srt on x — we craft this with care, built from our own real needs.

Setup

Before first run, verify:

Python 3.9+ — python -V. Below 3.9 → tell user to install from python.org
Modal CLI — modal config show:
- token_id null → modal setup to authenticate
- command not found → pip install modal then modal setup

Model Options

Model options: tiny, base, small, medium, large-v3. Default: large-v3 (best accuracy). Use tiny for fast drafts.

Error Handling

See references/error-handling.md for detailed error recovery.

Comments

Loading comments...