VoScript API

v1.1.2

VoScript self-hosted speech transcription API skill. Covers the full workflow: submit audio, poll job status, fetch results, export subtitles (SRT/TXT/JSON),...

0· 41·0 current·0 all-time
Security Scan
Capability signals
Requires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included scripts and docs: all files implement REST calls to a user-supplied VoScript server (submit, poll, fetch, export, voiceprint enrollment/management). The only minor metadata mismatch is that the registry 'Requirements' lists no required env vars while SKILL.md and scripts clearly expect VOSCRIPT_URL and VOSCRIPT_API_KEY (SKILL.md instructs the agent to prompt the user when they are absent). This is a documentation/metadata gap, not a functional mismatch.
Instruction Scope
SKILL.md and the scripts confine actions to communicating with the configured VoScript HTTP API and local file I/O for exports or prompts. The scripts read common environment variables for configuration and language detection (LANG/LANGUAGE) and provide diagnostics, but they do not attempt to read unrelated system secrets or files. They do provide operations that modify server state (enroll/delete voiceprints) which is appropriate for the stated capability.
Install Mechanism
There is no install spec (instruction-only), so nothing is downloaded or executed automatically — lower risk. However, the packaged scripts require Python and the 'requests' library (common.py notes 'stdlib + requests'), which is not declared in registry metadata; the user/agent will need to ensure a Python runtime and requests are available before running scripts.
Credentials
The only sensitive items the skill uses are VOSCRIPT_URL and VOSCRIPT_API_KEY, which are exactly the credentials needed to talk to a self‑hosted VoScript server. The scripts also inspect LANG/LC_* for UI language selection, which is non‑sensitive. There are no requests for unrelated cloud credentials or system tokens.
Persistence & Privilege
The skill is not marked always:true and does not attempt to modify other skills or system-wide agent settings. It performs only its own client operations against the configured server. Autonomous invocation is allowed by default (normal); this combined with the skill's limited scope does not raise additional concerns.
Assessment
This skill is coherent with its description, but before installing consider: (1) You must provide a trusted VOSCRIPT_URL and VOSCRIPT_API_KEY — do not give keys for production services you don't control. (2) The skill will send audio (and can enroll/delete voiceprints) to the configured server — voiceprints are biometric data; beware privacy implications. (3) Ensure a Python runtime and the 'requests' package are available (the package has no install spec). (4) The registry metadata omits the environment variables the scripts expect — SKILL.md is authoritative. If you proceed, point the skill at a VoScript instance you control or trust, and rotate the API key if you suspect misuse.

Like a lobster shell, security has layers — review code before you run it.

audiovk97afgsyv8j6h30mc7gttstvnd85b5tblatestvk97afgsyv8j6h30mc7gttstvnd85b5tbself-hostedvk97afgsyv8j6h30mc7gttstvnd85b5tbspeech-to-textvk97afgsyv8j6h30mc7gttstvnd85b5tbtranscriptionvk97afgsyv8j6h30mc7gttstvnd85b5tbvoiceprintvk97afgsyv8j6h30mc7gttstvnd85b5tb
41downloads
0stars
1versions
Updated 21h ago
v1.1.2
MIT-0

VoScript API 技能包

VoScript 是一个自托管的语音转写服务,支持多说话人分离、声纹识别、降噪、 多格式导出。本技能包封装了其 REST API 的全部主要工作流。

重要:本技能与代理无关(agent-agnostic),同等适用于 Claude、Codex、 Trae、Hermes、OpenClaw 等任何 AI 代理,不依赖任何厂商专属特性。

1. 配置说明

VoScript 通过两个参数进行访问配置:

  • VOSCRIPT_URL:服务地址,例如 http://localhost:7880
  • VOSCRIPT_API_KEY:调用 API 所需的鉴权密钥

推荐通过环境变量设置,所有脚本也支持 --url / --api-key 命令行参数覆盖。

VOSCRIPT_URLVOSCRIPT_API_KEY 未配置时,代理必须:

  1. 先向用户索要服务地址与 API Key;
  2. 告知用户配置方式:
    • 环境变量:export VOSCRIPT_URL=... / export VOSCRIPT_API_KEY=...
    • 或使用脚本的 --url <URL> / --api-key <KEY> 参数。

详见 ${SKILL_PATH}/references/configuration.md

2. 提交音频转写

上传音频文件并创建转写任务。

接口: POST /api/transcribemultipart/form-data

请求参数:

参数类型必填默认说明
filefile待转写音频文件
languagestring自动语言代码,如 zh / en
min_speakersint1最少说话人数
max_speakersint10最多说话人数
denoise_modelstringnone可选 none / deepfilternet / noisereduce
snr_thresholdfloat10.0信噪比阈值
no_repeat_ngram_sizeint0解码时抑制 n-gram 重复
curl -X POST "$VOSCRIPT_URL/api/transcribe" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "file=@/path/to/audio.wav" \
  -F "language=zh" \
  -F "min_speakers=1" \
  -F "max_speakers=10"

响应字段说明:

字段类型说明
idstring任务 / 转写 ID(形如 tr_xxx),后续接口均以此为主键
statusstring初始状态,一般为 queued;若命中去重则为 completed
deduplicatedbool可选字段,出现且为 true 表示命中 SHA-256 去重,直接返回既有结果

deduplicated: true 不是错误,是正常响应。VoScript 对音频内容计算 SHA-256, 若已处理过相同文件,直接返回已有结果。此时 status"completed", 无需再次轮询,直接用返回的 id 获取结果。

错误响应表:

HTTP含义排查
401API Key 无效检查 VOSCRIPT_API_KEY 是否正确、有无多余空格
413文件过大超过服务端 MAX_UPLOAD_BYTES 限制(默认 2 GiB)
422参数校验失败检查 min_speakers/max_speakers/denoise_model 值是否合法
500服务端错误查看容器日志 docker logs voscript

执行脚本:

python ${SKILL_PATH}/scripts/submit_audio.py \
  --file <PATH> \
  [--language zh] \
  [--min-speakers 1] \
  [--max-speakers 10]

3. 轮询任务状态

接口: GET /api/jobs/{job_id}

curl -X GET "$VOSCRIPT_URL/api/jobs/tr_xxx" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

状态机: queued → converting → denoising → transcribing → identifying → completed | failed

状态含义与典型耗时:

状态含义典型耗时
queued等待 GPU 资源即时~数秒
convertingffmpeg 格式转换数秒
denoisingDeepFilterNet 降噪10-30 秒(可选步骤)
transcribingWhisper + pyannote 转写音频时长的 20-50%
identifying声纹匹配数秒
completed完成
failed失败查看 error 字段

⚠️ 轮询建议间隔 5 秒,首次加载模型需 2-5 分钟(仅首次), 轮询超时不代表失败,可继续等待或检查 /healthz

常见错误:

HTTP含义排查
401API Key 无效检查 VOSCRIPT_API_KEY
404job_id 不存在确认 ID 拼写,或任务可能已被清理

执行脚本:

python ${SKILL_PATH}/scripts/poll_job.py --job-id tr_xxx

详细状态机与阶段耗时:${SKILL_PATH}/references/job-lifecycle.md

4. 获取转写结果

接口: GET /api/transcriptions/{tr_id}

curl -X GET "$VOSCRIPT_URL/api/transcriptions/tr_xxx" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

返回内容包括:segmentsspeaker_mapparams 等完整结果。

Segment 字段表:

字段类型说明
idint片段序号
start / endfloat起止时间(秒)
textstring转写文本
speaker_labelstringpyannote 原始标签(如 SPEAKER_00),注册声纹时使用此值
speaker_idstring|null已绑定的声纹 ID,null 表示未注册
speaker_namestring显示名(已注册则为姓名,否则同 speaker_label)
similarityfloat|intAS-norm z-score,非概率,典型范围 -1 到 2,匹配阈值 ~0.5
wordsarray|null词级对齐(强制对齐成功时存在)

similarity 是 AS-norm 归一化的 z-score,不是 [0,1] 之间的概率。 值可能超过 1.0(最高观测值约 1.79)。若用 similarity > 0.5 判断是否匹配, 这是合理的经验值,但不能理解为"50% 置信度"。

常见错误:

HTTP含义排查
404tr_id 不存在核对 ID;确认任务已 completed
409任务尚未完成先通过 /api/jobs/{id} 轮询到 completed

执行脚本:

python ${SKILL_PATH}/scripts/fetch_result.py --tr-id tr_xxx

5. 导出转写

接口: GET /api/export/{tr_id}?format=srt|txt|json

curl -X GET "$VOSCRIPT_URL/api/export/tr_xxx?format=srt" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -o transcript.srt

支持格式:

format用途MIME
srt标准字幕文件,带时间轴application/x-subrip
txt带说话人前缀的纯文本text/plain
json完整结构化数据(segments + speaker_map)application/json

常见错误:

HTTP含义排查
404tr_id 不存在核对 ID
422format 参数非法只能是 srt / txt / json

格式细节:${SKILL_PATH}/references/export-formats.md

执行脚本:

python ${SKILL_PATH}/scripts/export_transcript.py --tr-id tr_xxx --format srt

6. 转写列表

接口: GET /api/transcriptions

curl -X GET "$VOSCRIPT_URL/api/transcriptions" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
idstring转写 ID
filenamestring原始文件名
created_atstringISO 8601 创建时间
segment_countint片段数量
speaker_countint说话人数量

执行脚本:

python ${SKILL_PATH}/scripts/list_transcriptions.py

7. 注册声纹

从已有转写中抽取某个 speaker_label 对应片段作为样本,注册或更新声纹。

接口: POST /api/voiceprints/enroll

请求参数:

参数类型必填说明
tr_idstring来源转写 ID
speaker_labelstringpyannote 原始标签,如 SPEAKER_00(不是显示名!)
speaker_namestring说话人姓名(显示用)
speaker_idstring传入已有声纹 ID 则更新该声纹
curl -X POST "$VOSCRIPT_URL/api/voiceprints/enroll" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "tr_id=tr_xxx" \
  -F "speaker_label=SPEAKER_00" \
  -F "speaker_name=张三"

响应字段:

字段类型说明
actionstringcreated(新建)或 updated(更新已有声纹)
speaker_idstring声纹 ID,后续可用于绑定

最常见错误:speaker_label 填写了显示名而非原始标签

  • ✗ 错误:--speaker-label "张三"
  • ✓ 正确:--speaker-label "SPEAKER_00"

speaker_label 必须是 pyannote 的原始标签(SPEAKER_00, SPEAKER_01 等), 来自转写结果的 segment.speaker_label 字段。

注册成功后,后续转写中识别出的同一说话人会自动匹配到 speaker_name

错误响应表:

HTTP含义排查
404Embedding not found for this speaker labelspeaker_label 在该转写中不存在。检查大小写、确认使用的是 SPEAKER_XX 格式
422参数缺失确认 tr_id、speaker_label、speaker_name 均已提供
401API Key 无效检查 VOSCRIPT_API_KEY

执行脚本:

python ${SKILL_PATH}/scripts/enroll_voiceprint.py \
  --tr-id tr_xxx \
  --speaker-label SPEAKER_00 \
  --speaker-name "张三"

8. 声纹列表

接口: GET /api/voiceprints

curl -X GET "$VOSCRIPT_URL/api/voiceprints" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
idstring声纹 ID
namestring显示姓名
sample_countint已累积的样本数量
sample_spreadfloat|null样本间余弦相似度的标准差;单样本时为 null;数值越小表示样本一致性越高
created_atstringISO 8601 创建时间
updated_atstringISO 8601 最后更新时间

⚠️ sample_spread 偏大(例如 > 0.3)说明样本之间差异大,可能混入了错误片段, 建议通过 manage_voiceprint.py --action get 查看详情并考虑清理。

执行脚本:

python ${SKILL_PATH}/scripts/list_voiceprints.py

9. 分配说话人

手动为某个 segment 指定说话人(用于纠正分离错误或补齐未识别片段)。

接口: PUT /api/transcriptions/{tr_id}/segments/{seg_id}/speaker

请求参数:

参数类型必填说明
speaker_namestring新的说话人显示名
speaker_idstring若要绑定已注册声纹,传入声纹 ID
curl -X PUT "$VOSCRIPT_URL/api/transcriptions/tr_xxx/segments/5/speaker" \
  -H "X-API-Key: $VOSCRIPT_API_KEY" \
  -F "speaker_name=李四"

💡 当你发现某个片段的说话人识别有误,或想手动覆盖自动识别结果时使用。 手动分配不影响声纹库,仅修改该片段的显示名。

常见错误:

HTTP含义排查
404tr_id 或 seg_id 不存在核对 ID;seg_id 为 segment 在该转写中的序号
422参数缺失至少提供 speaker_name

执行脚本:

python ${SKILL_PATH}/scripts/assign_speaker.py \
  --tr-id tr_xxx \
  --seg-id 5 \
  --speaker-name "李四"

10. 管理声纹

操作端点参数
查看详情GET /api/voiceprints/{speaker_id}
重命名PUT /api/voiceprints/{speaker_id}/name表单字段 name
删除DELETE /api/voiceprints/{speaker_id}
curl -X GET "$VOSCRIPT_URL/api/voiceprints/<SPEAKER_ID>" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

常见错误:

HTTP含义排查
404speaker_id 不存在通过 /api/voiceprints 确认 ID
422重命名时缺少 name 字段提供表单字段 name

执行脚本:

python ${SKILL_PATH}/scripts/manage_voiceprint.py \
  --action [get|rename|delete] \
  --speaker-id xxx \
  [--name "新名字"]

11. 重建声纹 cohort

AS-norm 评分依赖 cohort(对比样本集)。

接口: POST /api/voiceprints/rebuild-cohort

curl -X POST "$VOSCRIPT_URL/api/voiceprints/rebuild-cohort" \
  -H "X-API-Key: $VOSCRIPT_API_KEY"

响应字段:

字段类型说明
cohort_sizeint重建后 cohort 内样本数量
skippedint被跳过的样本数(质量不达标或重复)
saved_tostringcohort 文件保存路径

💡 何时执行重建:

  • 注册满 10 个说话人后首次执行
  • 每新增 10-20 个说话人后更新一次
  • cohort 大小 ≥ 50 时 AS-norm 评分最为稳定

重建前无需停止服务,操作为后台非阻塞任务。

执行脚本:

python ${SKILL_PATH}/scripts/rebuild_cohort.py

声纹完整工作流与阈值说明见 ${SKILL_PATH}/references/voiceprint-guide.md

错误响应规范

VoScript 返回标准 HTTP 状态码,代理在处理响应时应按下表做分支:

状态码含义处理建议
200成功正常解析响应
401API Key 无效提示用户检查 VOSCRIPT_API_KEY
404资源不存在核对 tr_id / speaker_id / job_id
409资源状态冲突例如任务尚未 completed 就请求结果
413文件过大检查服务端 MAX_UPLOAD_BYTES(默认 2 GiB)
422请求参数校验失败根据返回 detail 字段检查参数,常见于缺少 file
500服务端错误收集 error 字段,必要时检查服务端日志

诊断检查清单

遇到问题时,按以下顺序排查:

  1. 服务可达性curl $VOSCRIPT_URL/healthz 是否 200
  2. 鉴权X-API-Key 是否与容器环境变量 VOSCRIPT_API_KEY 一致,有无多余空格
  3. 任务状态:先通过 /api/jobs/{id} 确认 completed,再拉结果
  4. 声纹标签:注册声纹时使用 SPEAKER_XX 原始标签,不是显示名
  5. similarity 语义:不是概率,是 AS-norm z-score,阈值 ~0.5 为经验值
  6. 去重响应deduplicated: true 是正常返回,不是错误
  7. 首次冷启动:模型加载耗时 2-5 分钟,轮询超时不等于失败
  8. 容器日志docker logs voscript 查看详细栈回溯

典型使用序列

  1. 配置 VOSCRIPT_URL / VOSCRIPT_API_KEY
  2. submit_audio.py 上传音频,拿到 tr_id
  3. poll_job.py 轮询到 completed
  4. fetch_result.py 获取 segments,审阅 speaker 分离结果。
  5. 对每个 SPEAKER_xx 调用 enroll_voiceprint.py 注册真实姓名。
  6. 累计 10+ 声纹后运行 rebuild_cohort.py 刷新 AS-norm 基线。
  7. 后续新音频转写会自动识别已注册说话人。
  8. 需要字幕文件时使用 export_transcript.py 导出 SRT/TXT/JSON。

参考文档

  • ${SKILL_PATH}/references/configuration.md —— 配置与鉴权
  • ${SKILL_PATH}/references/job-lifecycle.md —— 任务状态机
  • ${SKILL_PATH}/references/voiceprint-guide.md —— 声纹与 AS-norm
  • ${SKILL_PATH}/references/export-formats.md —— 导出格式

Comments

Loading comments...