{"skill":{"slug":"autoglmasr","displayName":"autoglmasr","summary":"AutoGLM ASR MCP 服务：长音频并发转录、上下文传递、时间戳分段。基于智谱 GLM-ASR-2512。触发词：语音识别、ASR、转录、转录音频、长音频","description":"---\nname: autoglm-asr-mcp\ndescription: \"AutoGLM ASR MCP 服务：长音频并发转录、上下文传递、时间戳分段。基于智谱 GLM-ASR-2512。触发词：语音识别、ASR、转录、转录音频、长音频\"\n---\n\n# AutoGLM ASR MCP Server\n\nGitHub: https://github.com/Starrylyn/autoglm-asr-mcp\n\n一个面向 Agent 的语音转文字 MCP 服务，核心特性：\n- 长音频自动分块\n- 并发调用（可配置并发数）\n- 上下文传递模式\n- 时间戳分段输出\n\n## 安装\n\n```bash\n# 前置依赖：ffmpeg\nbrew install ffmpeg  # macOS\n\n# 运行 MCP 服务\nnpx autoglm-asr-mcp\n```\n\n## MCP 配置\n\n```json\n{\n  \"mcpServers\": {\n    \"autoglm-asr\": {\n      \"command\": \"npx\",\n      \"args\": [\"-y\", \"autoglm-asr-mcp\"],\n      \"env\": {\n        \"AUTOGLM_ASR_API_KEY\": \"your-api-key\"\n      }\n    }\n  }\n}\n```\n\n## 核心工具\n\n### transcribe_audio\n\n| 参数 | 类型 | 必填 | 默认值 | 说明 |\n|------|------|------|--------|------|\n| `audio_path` | string | ✅ | - | 音频文件绝对路径 |\n| `context_mode` | string | ❌ | `sliding` | 上下文模式 |\n| `max_concurrency` | int | ❌ | 5 | 并发数 (1-20) |\n\n返回：\n- 完整转录文本\n- 时间戳分段列表\n- 运行统计（分块数、耗时、模式）\n\n### get_audio_info\n\n获取音频文件信息（时长、格式、预估分块数）。\n\n---\n\n## 核心实现解析\n\n### 1. 并发调用机制\n\n```python\n# 使用 Semaphore 控制并发数\nsemaphore = asyncio.Semaphore(concurrency)\n\nasync def transcribe_with_semaphore(chunk: AudioChunk) -> None:\n    async with semaphore:\n        result = await self._transcribe_chunk(chunk, audio_format=audio_format)\n        text_results[chunk.index] = result[\"text\"]\n        # ...\n\n# 所有分块并行执行\ntasks = [transcribe_with_semaphore(chunk) for chunk in non_silent_chunks]\nawait asyncio.gather(*tasks)\n```\n\n**关键点：**\n- 用 `Semaphore` 限制最大并发数\n- 用 `asyncio.gather()` 并行执行所有任务\n- 结果存入字典 `text_results: dict[int, str]`，按分块索引排序\n\n### 2. 上下文模式\n\n| 模式 | 速度 | 质量 | 说明 |\n|------|------|------|------|\n| `sliding` | 快 | 高 | 第一个分块初始化上下文，后续并行 |\n| `none` | 最快 | 中 | 各分块独立并行，无上下文传递 |\n| `full_serial` | 慢 | 最佳 | 顺序执行，完整上下文链 |\n\n**注意：** 新版 `/audio/transcriptions` API 不需要上下文传递，所有分块默认并行。\n\n### 3. 自动分块\n\n```python\nchunks = split_audio_on_silence(\n    audio,\n    max_chunk_duration_ms=self.config.max_chunk_duration * 1000,  # 默认 25s\n)\n```\n\n- 按静音点分割音频\n- 每块最大 25 秒（可配置）\n- 静音块自动跳过\n\n### 4. 静音检测 (VAD)\n\n```python\nnon_silent_chunks = [c for c in chunks if not c.is_silent]\nskipped_silent = len(chunks) - len(non_silent_chunks)\n```\n\n- 使用 VAD 检测静音片段\n- 静音块不调用 API，节省费用\n\n### 5. 结果合并\n\n```python\n# 按分块顺序合并文本\nfull_text = \"\".join(text_results.get(chunk.index, \"\") for chunk in chunks)\n\n# 合并时间戳分段（偏移调整）\nfor seg in result[\"segments\"]:\n    offset_segments.append(TranscriptionSegment(\n        start=seg.start + chunk.start_ms / 1000.0,  # 加上分块起始偏移\n        end=seg.end + chunk.start_ms / 1000.0,\n        text=seg.text,\n    ))\n```\n\n---\n\n## 环境变量\n\n| 变量 | 默认值 | 说明 |\n|------|--------|------|\n| `AUTOGLM_ASR_API_KEY` | 必填 | 智谱 API Key |\n| `AUTOGLM_ASR_API_BASE` | `https://open.bigmodel.cn/api/paas/v4/audio/transcriptions` | API 端点 |\n| `AUTOGLM_ASR_MODEL` | `glm-asr-2512` | ASR 模型 |\n| `AUTOGLM_ASR_MAX_CHUNK_DURATION` | 25 | 每块最大时长（秒） |\n| `AUTOGLM_ASR_MAX_CONCURRENCY` | 5 | 默认并发数 |\n| `AUTOGLM_ASR_CONTEXT_MAX_CHARS` | 2000 | 最大上下文字数 |\n| `AUTOGLM_ASR_REQUEST_TIMEOUT` | 60 | 请求超时（秒） |\n| `AUTOGLM_ASR_MAX_RETRIES` | 2 | 重试次数 |\n\n---\n\n## 支持的音频格式\n\n`mp3`, `wav`, `m4a`, `flac`, `ogg`, `webm`\n\n---\n\n## 直接调用 API（不通过 MCP）\n\n```bash\n# 短音频\ncurl --request POST \\\n  --url https://open.bigmodel.cn/api/paas/v4/audio/transcriptions \\\n  --header 'Authorization: Bearer YOUR_API_KEY' \\\n  --form model=glm-asr-2512 \\\n  --form stream=false \\\n  --form file=@audio.wav\n\n# 长音频：需要自己实现分块、并发、结果合并\n```\n\n---\n\n## 最佳实践\n\n1. **短音频（<30s）**：直接调用 API\n2. **长音频**：使用 MCP 服务，自动分块 + 并发\n3. **高质量需求**：用 `full_serial` 模式\n4. **快速处理**：用 `none` 模式 + 高并发（10-20）\n5. **平衡选择**：`sliding` 模式 + 并发 5（默认）\n\n---\n\n## 常见错误\n\n| 错误 | 原因 | 解决 |\n|------|------|------|\n| `ffmpeg not found` | 未安装 ffmpeg | `brew install ffmpeg` |\n| `File not found` | 路径错误 | 使用绝对路径 |\n| `AUTOGLM_ASR_API_KEY environment variable is required` | 未设置 API Key | 在 MCP 配置中设置 |\n| `transcriptions文件只支持单声道` | 音频是立体声 | 自动转换为单声道 |\n\n---\n\n## 关键代码片段（参考实现）\n\n### Python 异步并发调用示例\n\n```python\nimport asyncio\nimport httpx\n\nasync def transcribe_chunk(client, chunk_data, api_key):\n    \"\"\"转录单个音频块\"\"\"\n    headers = {\"Authorization\": f\"Bearer {api_key}\"}\n    files = {\"file\": (\"audio.wav\", chunk_data, \"audio/wav\")}\n    data = {\"model\": \"glm-asr-2512\"}\n    \n    response = await client.post(\n        \"https://open.bigmodel.cn/api/paas/v4/audio/transcriptions\",\n        headers=headers,\n        files=files,\n        data=data,\n    )\n    result = response.json()\n    return result.get(\"text\", \"\")\n\nasync def transcribe_parallel(chunks, api_key, max_concurrency=5):\n    \"\"\"并发转录多个音频块\"\"\"\n    semaphore = asyncio.Semaphore(max_concurrency)\n    client = httpx.AsyncClient(timeout=60)\n    results = {}\n    \n    async def limited_transcribe(chunk, index):\n        async with semaphore:\n            text = await transcribe_chunk(client, chunk, api_key)\n            results[index] = text\n    \n    tasks = [limited_transcribe(chunk, i) for i, chunk in enumerate(chunks)]\n    await asyncio.gather(*tasks)\n    await client.aclose()\n    \n    # 按顺序合并\n    return \"\".join(results.get(i, \"\") for i in range(len(chunks)))\n```\n\n---\n\n## 扩展阅读\n\n- [智谱 ASR API 文档](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512)\n- [MCP 协议规范](https://modelcontextprotocol.io)\n","topics":["上下文","音频"],"tags":{"latest":"0.0.1"},"stats":{"comments":0,"downloads":777,"installsAllTime":29,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1772508954196,"updatedAt":1778994682571},"latestVersion":{"version":"0.0.1","createdAt":1772508954196,"changelog":"AutoGLM ASR MCP: High-concurrency, context-aware, long audio transcription server based on GLM-ASR-2512.\n\n- Supports automatic chunking of long audio files and concurrent transcription.\n- Offers selectable context modes: sliding (default), none, or full serial for quality/speed tradeoffs.\n- Returns full transcript, timestamped segments, and detailed statistics.\n- Skips silent chunks with VAD to save on API calls and costs.\n- Configurable via environment variables and designed for Agent/MCP integration.","license":null},"metadata":null,"owner":{"handle":"isabellazhangym","userId":"s17dr2pyre7j44j5wm8xpd8gch885wa2","displayName":"IsabellaZhangYM","image":"https://avatars.githubusercontent.com/u/170412788?v=4"},"moderation":null}