Transcribe Video To Text

v1.0.0

Skip the learning curve of professional editing software. Describe what you want — transcribe the spoken dialogue into a text document — and get text transcr...

⭐ 0· 44·0 current·0 all-time

by@vcarolxhberger

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for vcarolxhberger/transcribe-video-to-text.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Transcribe Video To Text" (vcarolxhberger/transcribe-video-to-text) from ClawHub.
Skill page: https://clawhub.ai/vcarolxhberger/transcribe-video-to-text
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required env vars: NEMO_TOKEN
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install transcribe-video-to-text

ClawHub CLI

Package manager switcher

npx clawhub@latest install transcribe-video-to-text

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The name/description (video->text transcription) match the runtime instructions to call a remote Nemovideo API and upload media. Requesting a single NEMO_TOKEN credential is appropriate. However, the metadata also declares a config path (~/.config/nemovideo/) and platform-detection via install paths — these suggest filesystem access beyond what's needed strictly to upload a file, and there's no homepage or publisher provenance to justify trusting the remote backend.

Instruction Scope

The SKILL.md instructs uploading user video files to https://mega-api-prod.nemovideo.ai and using either an environment NEMO_TOKEN or an acquired anonymous token. It also instructs reading the skill frontmatter and detecting install paths to set attribution headers — which requires local file/path access. The metadata's configPaths hint at reading ~/.config/nemovideo/, but the instructions don't explain why; reading arbitrary local config could expose other tokens or secrets.

✓

Install Mechanism

No install spec and no code files (instruction-only). That minimizes filesystem/write risk because nothing is downloaded or executed by the skill itself.

ℹ

Credentials

Only NEMO_TOKEN is declared as required (primaryEnv). That is proportional for a service that needs an API token. Concerns: metadata lists a config path that could allow the skill to locate stored tokens or configs; the instructions ask to use NEMO_TOKEN if present, otherwise to call an anonymous-token endpoint — both behaviors are plausible but the implicit ability to read local config or detect install paths increases scope of access to secrets.

✓

Persistence & Privilege

always is false and there is no install or service that persists on-disk. The skill can be invoked autonomously (default), which is normal, and it does not request permanent presence or modify other skills.

What to consider before installing

This skill will upload whatever video you provide to a third‑party endpoint (mega-api-prod.nemovideo.ai) and will use an API token (NEMO_TOKEN) if present; otherwise it will request an anonymous token from the service. Things to consider before installing or using: 1) The publisher and homepage are unknown — verify the service and owner before sending sensitive content. 2) The skill metadata references a local config path (~/.config/nemovideo/) and asks to detect install paths — ask the author why it needs to read local files and ensure it won't search for or exfiltrate other credentials. 3) Prefer using an anonymous/starter token rather than providing long‑lived tokens; do not supply personal or sensitive videos until you trust the backend. 4) If you must use it with private data, inspect network endpoints, TLS certs, and consider running uploads through a vetted proxy or sandboxed agent. If possible, ask the skill author for a privacy/data-retention policy and a verified homepage/maintainer identity — that would increase confidence.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

📝 Clawdis

EnvNEMO_TOKEN

Primary envNEMO_TOKEN

latestvk97821dd423ndd3nqn07d0f90985njhk

44downloads

0stars

1versions

Updated 1d ago

v1.0.0

MIT-0

Getting Started

Got video files to work with? Send it over and tell me what you need — I'll take care of the AI speech transcription.

Try saying:

"convert a 10-minute interview recording into a 1080p MP4"
"transcribe the spoken dialogue into a text document"
"converting spoken video content into readable text for journalists, students, content creators"

Getting Connected

Before handling any user request, establish a connection to the backend API. Show a brief status like "Connecting...".

If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:

Generate a UUID as client identifier
POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header
The response includes a token with 100 free credits valid for 7 days — use it as NEMO_TOKEN

Then create a session by POSTing to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.

Tell the user you're ready. Keep the technical details out of the chat.

Transcribe Video to Text — Convert Video Speech to Text

Drop your video files in the chat and tell me what you need. I'll handle the AI speech transcription on cloud GPUs — you don't need anything installed locally.

Here's a typical use: you send a a 10-minute interview recording, ask for transcribe the spoken dialogue into a text document, and about 1-2 minutes later you've got a MP4 file ready to download. The whole thing runs at 1080p by default.

One thing worth knowing — shorter clips under 5 minutes produce faster and more accurate transcripts.

Matching Input to Actions

User prompts referencing transcribe video to text, aspect ratio, text overlays, or audio tracks get routed to the corresponding action via keyword and intent classification.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"	→ §3.3 Credits	✅
"status" / "状态" / "show tracks"	→ §3.4 State	✅
"upload" / "上传" / user sends file	→ §3.2 Upload	✅
Everything else (generate, edit, add BGM…)	→ §3.1 SSE	❌

Cloud Render Pipeline Details

Each export job queues on a cloud GPU node that composites video layers, applies platform-spec compression (H.264, up to 1080x1920), and returns a download URL within 30-90 seconds. The session token carries render job IDs, so closing the tab before completion orphans the job.

All calls go to https://mega-api-prod.nemovideo.ai. The main endpoints:

Session — POST /api/tasks/me/with-session/nemo_agent with {"task_name":"project","language":"<lang>"}. Gives you a session_id.
Chat (SSE) — POST /run_sse with session_id and your message in new_message.parts[0].text. Set Accept: text/event-stream. Up to 15 min.
Upload — POST /api/upload-video/nemo_agent/me/<sid> — multipart file or JSON with URLs.
Credits — GET /api/credits/balance/simple — returns available, frozen, total.
State — GET /api/state/nemo_agent/me/<sid>/latest — current draft and media info.
Export — POST /api/render/proxy/lambda with render ID and draft JSON. Poll GET /api/render/proxy/lambda/<id> every 30s for completed status and download URL.

Formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

Skill attribution — read from this file's YAML frontmatter at runtime:

X-Skill-Source: transcribe-video-to-text
X-Skill-Version: from frontmatter version
X-Skill-Platform: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"	Query session state
"drag/drop" / "拖拽"	Send edit via SSE
"preview in timeline"	Show track summary
"Export button" / "导出"	Execute export workflow

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result	Process internally, don't forward
`heartbeat` / empty `data:`	Keep waiting. Every 2 min: "⏳ Still working..."
Stream closes	Process final response

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Error Handling

Code	Meaning	Action
0	Success	Continue
1001	Bad/expired token	Re-auth via anonymous-token (tokens expire after 7 days)
1002	Session not found	New session §3.0
2001	No credits	Anonymous: show registration URL with `?bind=<id>` (get `<id>` from create-session or state response when needed). Registered: "Top up credits in your account"
4001	Unsupported file	Show supported formats
4002	File too large	Suggest compress/trim
400	Missing X-Client-Id	Generate Client-Id and retry (see §1)
402	Free plan export blocked	Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export."
429	Rate limit (1 token/client/7 days)	Retry in 30s once

Tips and Tricks

The backend processes faster when you're specific. Instead of "make it look better", try "transcribe the spoken dialogue into a text document" — concrete instructions get better results.

Max file size is 500MB. Stick to MP4, MOV, AVI, WebM for the smoothest experience.

MP4 with clear audio gives the most accurate transcription results.

Common Workflows

Quick edit: Upload → "transcribe the spoken dialogue into a text document" → Download MP4. Takes 1-2 minutes for a 30-second clip.

Batch style: Upload multiple files in one session. Process them one by one with different instructions. Each gets its own render.

Iterative: Start with a rough cut, preview the result, then refine. The session keeps your timeline state so you can keep tweaking.

Comments

Loading comments...