Audio Video To Text

音视频转文字技能，使用 Whisper 进行语音识别。支持多种音视频格式，可输出纯文本、SRT/VTT 字幕或 JSON 格式。适用于会议记录、视频字幕生成、采访整理、播客转录等场景。

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 222 · 3 current installs · 3 all-time installs

by@ivan830826

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description (audio/video → text using Whisper) align with the included script and SKILL.md. Required tools (whisper package, ffmpeg) are explainable and necessary for transcription.

✓

Instruction Scope

SKILL.md and the script limit actions to installing dependencies, extracting audio, loading a Whisper model, transcribing, formatting output, and deleting temporary audio. There are no instructions to read unrelated files, access environment secrets, or send data to external endpoints.

ℹ

Install Mechanism

This is an instruction-only skill (no install spec). The script depends on the openai-whisper and ffmpeg-python packages and a system ffmpeg binary. Note: loading Whisper models will typically download large model weight files from the network the first time they are used, consuming disk and bandwidth.

✓

Credentials

The skill requires no environment variables, credentials, or config paths. It does not access unrelated secrets or other services.

✓

Persistence & Privilege

always:false and default invocation settings. The skill does not attempt to persist or modify other skills or system-wide agent configuration.

Assessment

This skill appears to do only local transcription with Whisper and ffmpeg. Before installing/running: (1) verify you trust the skill source and the PyPI package name (openai-whisper) you will install, (2) be aware that Whisper will likely download large model files (especially medium/large) which use network bandwidth and disk space and may require substantial RAM/GPU, (3) install ffmpeg from official sources, (4) run the script in a virtual environment or sandbox and inspect the code if you have concerns, and (5) only run it on files you trust (the script spawns ffmpeg as a subprocess and writes a temp audio file under /tmp by default).

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk977ynemx7sc1qsmx1p6931m7982esr9

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

音视频转文字

概述

本技能使用 OpenAI Whisper 模型将音频/视频文件转换为文字。支持自动语言检测和多种输出格式。

何时使用

会议录音转文字记录
视频内容生成字幕（SRT/VTT）
采访/播客内容整理
语音备忘录转文本
多语言视频翻译准备

快速开始

1. 安装依赖

pip install openai-whisper ffmpeg-python

确保系统已安装 ffmpeg：

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# 从 https://ffmpeg.org/download.html 下载

2. 基本用法

python scripts/transcribe.py <输入文件> [输出文件] [选项]

3. 示例

# 转录 MP4 视频，输出文本
python scripts/transcribe.py meeting.mp4

# 转录音频，输出 SRT 字幕
python scripts/transcribe.py podcast.mp3 podcast.srt --output-format srt

# 指定中文和较小模型（更快）
python scripts/transcribe.py interview.wav --model tiny --language zh

# 输出带时间戳的 JSON
python scripts/transcribe.py video.mp4 result.json --output-format json

命令行选项

选项	说明	默认值
`--model`	模型大小：tiny, base, small, medium, large	base
`--language`	语言代码：zh, en, ja 等	自动检测
`--output-format`	输出格式：txt, srt, vtt, json	txt
`--device`	运行设备：cpu, cuda	cpu
`--keep-audio`	保留临时音频文件	false

模型选择指南

模型	大小	速度	精度	适用场景
tiny	39M	最快	一般	快速测试、短音频
base	74M	快	良好	日常使用
small	244M	中等	较好	正式场合
medium	769M	慢	很好	高精度需求
large	1550M	最慢	最佳	专业转录

输出格式说明

TXT（纯文本）

这是转录的完整文本内容，适合阅读和编辑。

SRT（字幕格式）

1
00:00:01,000 --> 00:00:04,000
这是第一句字幕。

2
00:00:04,500 --> 00:00:07,000
这是第二句字幕。

VTT（Web 字幕）

WEBVTT

00:00:01.000 --> 00:00:04.000
这是第一句字幕。

00:00:04.500 --> 00:00:07.000
这是第二句字幕。

JSON（完整数据）

包含分段、时间戳、置信度等完整信息，适合程序处理。

支持的文件格式

音频： MP3, WAV, FLAC, OGG, M4A, AAC

视频： MP4, AVI, MOV, MKV, WEBM, FLV

性能优化建议

短音频优先用 tiny/base 模型 - 速度快，精度够用
长内容用 CPU - 避免 GPU 内存不足
指定语言 - 可提升准确率和速度
批量处理 - 脚本可循环调用处理多个文件

常见问题

转录质量不佳

尝试更大的模型（small/medium/large）
指定正确的语言代码
确保音频质量清晰

处理速度慢

使用更小的模型（tiny/base）
如有 GPU，使用 --device cuda
缩短音频长度或分段处理

内存不足

使用更小的模型
将长文件分割后分别处理
关闭其他占用内存的程序

脚本

scripts/transcribe.py - 主转录脚本

参考资料

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…