VoScript API

v1.1.2

VoScript self-hosted speech transcription API skill. Covers the full workflow: submit audio, poll job status, fetch results, export subtitles (SRT/TXT/JSON),...

0· 85·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for mapleeve/voscript-api.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "VoScript API" (mapleeve/voscript-api) from ClawHub.
Skill page: https://clawhub.ai/mapleeve/voscript-api
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install voscript-api

ClawHub CLI

Package manager switcher

npx clawhub@latest install voscript-api
Security Scan
Capability signals
Requires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included scripts and docs: all files implement REST calls to a user-supplied VoScript server (submit, poll, fetch, export, voiceprint enrollment/management). The only minor metadata mismatch is that the registry 'Requirements' lists no required env vars while SKILL.md and scripts clearly expect VOSCRIPT_URL and VOSCRIPT_API_KEY (SKILL.md instructs the agent to prompt the user when they are absent). This is a documentation/metadata gap, not a functional mismatch.
Instruction Scope
SKILL.md and the scripts confine actions to communicating with the configured VoScript HTTP API and local file I/O for exports or prompts. The scripts read common environment variables for configuration and language detection (LANG/LANGUAGE) and provide diagnostics, but they do not attempt to read unrelated system secrets or files. They do provide operations that modify server state (enroll/delete voiceprints) which is appropriate for the stated capability.
Install Mechanism
There is no install spec (instruction-only), so nothing is downloaded or executed automatically — lower risk. However, the packaged scripts require Python and the 'requests' library (common.py notes 'stdlib + requests'), which is not declared in registry metadata; the user/agent will need to ensure a Python runtime and requests are available before running scripts.
Credentials
The only sensitive items the skill uses are VOSCRIPT_URL and VOSCRIPT_API_KEY, which are exactly the credentials needed to talk to a self‑hosted VoScript server. The scripts also inspect LANG/LC_* for UI language selection, which is non‑sensitive. There are no requests for unrelated cloud credentials or system tokens.
Persistence & Privilege
The skill is not marked always:true and does not attempt to modify other skills or system-wide agent settings. It performs only its own client operations against the configured server. Autonomous invocation is allowed by default (normal); this combined with the skill's limited scope does not raise additional concerns.
Assessment
This skill is coherent with its description, but before installing consider: (1) You must provide a trusted VOSCRIPT_URL and VOSCRIPT_API_KEY — do not give keys for production services you don't control. (2) The skill will send audio (and can enroll/delete voiceprints) to the configured server — voiceprints are biometric data; beware privacy implications. (3) Ensure a Python runtime and the 'requests' package are available (the package has no install spec). (4) The registry metadata omits the environment variables the scripts expect — SKILL.md is authoritative. If you proceed, point the skill at a VoScript instance you control or trust, and rotate the API key if you suspect misuse.

Like a lobster shell, security has layers — review code before you run it.

audiovk97afgsyv8j6h30mc7gttstvnd85b5tblatestvk97afgsyv8j6h30mc7gttstvnd85b5tbself-hostedvk97afgsyv8j6h30mc7gttstvnd85b5tbspeech-to-textvk97afgsyv8j6h30mc7gttstvnd85b5tbtranscriptionvk97afgsyv8j6h30mc7gttstvnd85b5tbvoiceprintvk97afgsyv8j6h30mc7gttstvnd85b5tb
85downloads
0stars
1versions
Updated 5d ago
v1.1.2
MIT-0

VoScript API 技能包

VoScript 是一个自托管的语音转写服务,支持多说话人分离、声纹识别、降噪、 多格式导出。本技能包封装了其 REST API 的全部主要工作流。

重要:本技能与代理无关(agent-agnostic),同等适用于 Claude、Codex、 Trae、Hermes、OpenClaw 等任何 AI 代理,不依赖任何厂商专属特性。

1. 配置说明

VoScript 通过两个参数进行访问配置:

  • VOSCRIPT_URL:服务地址,例如 http://localhost:7880
  • VOSCRIPT_API_KEY:调用 API 所需的鉴权密钥

推荐通过环境变量设置,所有脚本也支持 --url / --api-key 命令行参数覆盖。

VOSCRIPT_URLVOSCRIPT_API_KEY 未配置时,代理必须:

  1. 先向用户索要服务地址与 API Key;
  2. 告知用户配置方式:
    • 环境变量:export VOSCRIPT_URL=... / export VOSCRIPT_API_KEY=...
    • 或使用脚本的 --url <URL> / --api-key <KEY> 参数。

详见 ${SKILL_PATH}/references/configuration.md

2. 提交音频转写

上传音频文件并创建转写任务。

接口: POST /api/transcribemultipart/form-data

请求参数:

参数类型必填默认说明
filefile待转写音频文件
languagestring自动语言代码,如 zh / en
min_speakersint1最少说话人数
max_speakersint10最多说话人数
denoise_modelstringnone可选 none / deepfilternet / noisereduce
snr_thresholdfloat10.0信噪比阈值
no_repeat_ngram_sizeint0解码时抑制 n-gram 重复
curl -X POST "$VOSCRIPT_URL/api/transcribe" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "file=@/path/to/audio.wav" \
  -F "language=zh" \
  -F "min_speakers=1" \
  -F "max_speakers=10"

响应字段说明:

字段类型说明
idstring任务 / 转写 ID(形如 tr_xxx),后续接口均以此为主键
statusstring初始状态,一般为 queued;若命中去重则为 completed
deduplicatedbool可选字段,出现且为 true 表示命中 SHA-256 去重,直接返回既有结果

deduplicated: true 不是错误,是正常响应。VoScript 对音频内容计算 SHA-256, 若已处理过相同文件,直接返回已有结果。此时 status"completed", 无需再次轮询,直接用返回的 id 获取结果。

错误响应表:

HTTP含义排查
401API Key 无效检查 VOSCRIPT_API_KEY 是否正确、有无多余空格
413文件过大超过服务端 MAX_UPLOAD_BYTES 限制(默认 2 GiB)
422参数校验失败检查 min_speakers/max_speakers/denoise_model 值是否合法
500服务端错误查看容器日志 docker logs voscript

执行脚本:

python ${SKILL_PATH}/scripts/submit_audio.py \
  --file <PATH> \
  [--language zh] \
  [--min-speakers 1] \
  [--max-speakers 10]

3. 轮询任务状态

接口: GET /api/jobs/{job_id}

curl -X GET "$VOSCRIPT_URL/api/jobs/tr_xxx" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

状态机: queued → converting → denoising → transcribing → identifying → completed | failed

状态含义与典型耗时:

状态含义典型耗时
queued等待 GPU 资源即时~数秒
convertingffmpeg 格式转换数秒
denoisingDeepFilterNet 降噪10-30 秒(可选步骤)
transcribingWhisper + pyannote 转写音频时长的 20-50%
identifying声纹匹配数秒
completed完成
failed失败查看 error 字段

⚠️ 轮询建议间隔 5 秒,首次加载模型需 2-5 分钟(仅首次), 轮询超时不代表失败,可继续等待或检查 /healthz

常见错误:

HTTP含义排查
401API Key 无效检查 VOSCRIPT_API_KEY
404job_id 不存在确认 ID 拼写,或任务可能已被清理

执行脚本:

python ${SKILL_PATH}/scripts/poll_job.py --job-id tr_xxx

详细状态机与阶段耗时:${SKILL_PATH}/references/job-lifecycle.md

4. 获取转写结果

接口: GET /api/transcriptions/{tr_id}

curl -X GET "$VOSCRIPT_URL/api/transcriptions/tr_xxx" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

返回内容包括:segmentsspeaker_mapparams 等完整结果。

Segment 字段表:

字段类型说明
idint片段序号
start / endfloat起止时间(秒)
textstring转写文本
speaker_labelstringpyannote 原始标签(如 SPEAKER_00),注册声纹时使用此值
speaker_idstring|null已绑定的声纹 ID,null 表示未注册
speaker_namestring显示名(已注册则为姓名,否则同 speaker_label)
similarityfloat|intAS-norm z-score,非概率,典型范围 -1 到 2,匹配阈值 ~0.5
wordsarray|null词级对齐(强制对齐成功时存在)

similarity 是 AS-norm 归一化的 z-score,不是 [0,1] 之间的概率。 值可能超过 1.0(最高观测值约 1.79)。若用 similarity > 0.5 判断是否匹配, 这是合理的经验值,但不能理解为"50% 置信度"。

常见错误:

HTTP含义排查
404tr_id 不存在核对 ID;确认任务已 completed
409任务尚未完成先通过 /api/jobs/{id} 轮询到 completed

执行脚本:

python ${SKILL_PATH}/scripts/fetch_result.py --tr-id tr_xxx

5. 导出转写

接口: GET /api/export/{tr_id}?format=srt|txt|json

curl -X GET "$VOSCRIPT_URL/api/export/tr_xxx?format=srt" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -o transcript.srt

支持格式:

format用途MIME
srt标准字幕文件,带时间轴application/x-subrip
txt带说话人前缀的纯文本text/plain
json完整结构化数据(segments + speaker_map)application/json

常见错误:

HTTP含义排查
404tr_id 不存在核对 ID
422format 参数非法只能是 srt / txt / json

格式细节:${SKILL_PATH}/references/export-formats.md

执行脚本:

python ${SKILL_PATH}/scripts/export_transcript.py --tr-id tr_xxx --format srt

6. 转写列表

接口: GET /api/transcriptions

curl -X GET "$VOSCRIPT_URL/api/transcriptions" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
idstring转写 ID
filenamestring原始文件名
created_atstringISO 8601 创建时间
segment_countint片段数量
speaker_countint说话人数量

执行脚本:

python ${SKILL_PATH}/scripts/list_transcriptions.py

7. 注册声纹

从已有转写中抽取某个 speaker_label 对应片段作为样本,注册或更新声纹。

接口: POST /api/voiceprints/enroll

请求参数:

参数类型必填说明
tr_idstring来源转写 ID
speaker_labelstringpyannote 原始标签,如 SPEAKER_00(不是显示名!)
speaker_namestring说话人姓名(显示用)
speaker_idstring传入已有声纹 ID 则更新该声纹
curl -X POST "$VOSCRIPT_URL/api/voiceprints/enroll" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "tr_id=tr_xxx" \
  -F "speaker_label=SPEAKER_00" \
  -F "speaker_name=张三"

响应字段:

字段类型说明
actionstringcreated(新建)或 updated(更新已有声纹)
speaker_idstring声纹 ID,后续可用于绑定

最常见错误:speaker_label 填写了显示名而非原始标签

  • ✗ 错误:--speaker-label "张三"
  • ✓ 正确:--speaker-label "SPEAKER_00"

speaker_label 必须是 pyannote 的原始标签(SPEAKER_00, SPEAKER_01 等), 来自转写结果的 segment.speaker_label 字段。

注册成功后,后续转写中识别出的同一说话人会自动匹配到 speaker_name

错误响应表:

HTTP含义排查
404Embedding not found for this speaker labelspeaker_label 在该转写中不存在。检查大小写、确认使用的是 SPEAKER_XX 格式
422参数缺失确认 tr_id、speaker_label、speaker_name 均已提供
401API Key 无效检查 VOSCRIPT_API_KEY

执行脚本:

python ${SKILL_PATH}/scripts/enroll_voiceprint.py \
  --tr-id tr_xxx \
  --speaker-label SPEAKER_00 \
  --speaker-name "张三"

8. 声纹列表

接口: GET /api/voiceprints

curl -X GET "$VOSCRIPT_URL/api/voiceprints" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
idstring声纹 ID
namestring显示姓名
sample_countint已累积的样本数量
sample_spreadfloat|null样本间余弦相似度的标准差;单样本时为 null;数值越小表示样本一致性越高
created_atstringISO 8601 创建时间
updated_atstringISO 8601 最后更新时间

⚠️ sample_spread 偏大(例如 > 0.3)说明样本之间差异大,可能混入了错误片段, 建议通过 manage_voiceprint.py --action get 查看详情并考虑清理。

执行脚本:

python ${SKILL_PATH}/scripts/list_voiceprints.py

9. 分配说话人

手动为某个 segment 指定说话人(用于纠正分离错误或补齐未识别片段)。

接口: PUT /api/transcriptions/{tr_id}/segments/{seg_id}/speaker

请求参数:

参数类型必填说明
speaker_namestring新的说话人显示名
speaker_idstring若要绑定已注册声纹,传入声纹 ID
curl -X PUT "$VOSCRIPT_URL/api/transcriptions/tr_xxx/segments/5/speaker" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "speaker_name=李四"

💡 当你发现某个片段的说话人识别有误,或想手动覆盖自动识别结果时使用。 手动分配不影响声纹库,仅修改该片段的显示名。

常见错误:

HTTP含义排查
404tr_id 或 seg_id 不存在核对 ID;seg_id 为 segment 在该转写中的序号
422参数缺失至少提供 speaker_name

执行脚本:

python ${SKILL_PATH}/scripts/assign_speaker.py \
  --tr-id tr_xxx \
  --seg-id 5 \
  --speaker-name "李四"

10. 管理声纹

操作端点参数
查看详情GET /api/voiceprints/{speaker_id}
重命名PUT /api/voiceprints/{speaker_id}/name表单字段 name
删除DELETE /api/voiceprints/{speaker_id}
curl -X GET "$VOSCRIPT_URL/api/voiceprints/<SPEAKER_ID>" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

常见错误:

HTTP含义排查
404speaker_id 不存在通过 /api/voiceprints 确认 ID
422重命名时缺少 name 字段提供表单字段 name

执行脚本:

python ${SKILL_PATH}/scripts/manage_voiceprint.py \
  --action [get|rename|delete] \
  --speaker-id xxx \
  [--name "新名字"]

11. 重建声纹 cohort

AS-norm 评分依赖 cohort(对比样本集)。

接口: POST /api/voiceprints/rebuild-cohort

curl -X POST "$VOSCRIPT_URL/api/voiceprints/rebuild-cohort" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
cohort_sizeint重建后 cohort 内样本数量
skippedint被跳过的样本数(质量不达标或重复)
saved_tostringcohort 文件保存路径

💡 何时执行重建:

  • 注册满 10 个说话人后首次执行
  • 每新增 10-20 个说话人后更新一次
  • cohort 大小 ≥ 50 时 AS-norm 评分最为稳定

重建前无需停止服务,操作为后台非阻塞任务。

执行脚本:

python ${SKILL_PATH}/scripts/rebuild_cohort.py

声纹完整工作流与阈值说明见 ${SKILL_PATH}/references/voiceprint-guide.md

错误响应规范

VoScript 返回标准 HTTP 状态码,代理在处理响应时应按下表做分支:

状态码含义处理建议
200成功正常解析响应
401API Key 无效提示用户检查 VOSCRIPT_API_KEY
404资源不存在核对 tr_id / speaker_id / job_id
409资源状态冲突例如任务尚未 completed 就请求结果
413文件过大检查服务端 MAX_UPLOAD_BYTES(默认 2 GiB)
422请求参数校验失败根据返回 detail 字段检查参数,常见于缺少 file
500服务端错误收集 error 字段,必要时检查服务端日志

诊断检查清单

遇到问题时,按以下顺序排查:

  1. 服务可达性curl $VOSCRIPT_URL/healthz 是否 200
  2. 鉴权X-API-Key 是否与容器环境变量 VOSCRIPT_API_KEY 一致,有无多余空格
  3. 任务状态:先通过 /api/jobs/{id} 确认 completed,再拉结果
  4. 声纹标签:注册声纹时使用 SPEAKER_XX 原始标签,不是显示名
  5. similarity 语义:不是概率,是 AS-norm z-score,阈值 ~0.5 为经验值
  6. 去重响应deduplicated: true 是正常返回,不是错误
  7. 首次冷启动:模型加载耗时 2-5 分钟,轮询超时不等于失败
  8. 容器日志docker logs voscript 查看详细栈回溯

典型使用序列

  1. 配置 VOSCRIPT_URL / VOSCRIPT_API_KEY
  2. submit_audio.py 上传音频,拿到 tr_id
  3. poll_job.py 轮询到 completed
  4. fetch_result.py 获取 segments,审阅 speaker 分离结果。
  5. 对每个 SPEAKER_xx 调用 enroll_voiceprint.py 注册真实姓名。
  6. 累计 10+ 声纹后运行 rebuild_cohort.py 刷新 AS-norm 基线。
  7. 后续新音频转写会自动识别已注册说话人。
  8. 需要字幕文件时使用 export_transcript.py 导出 SRT/TXT/JSON。

参考文档

  • ${SKILL_PATH}/references/configuration.md —— 配置与鉴权
  • ${SKILL_PATH}/references/job-lifecycle.md —— 任务状态机
  • ${SKILL_PATH}/references/voiceprint-guide.md —— 声纹与 AS-norm
  • ${SKILL_PATH}/references/export-formats.md —— 导出格式

Comments

Loading comments...