Install
openclaw skills install voice-memo-syncSync, transcribe, and intelligently organize voice memos, audio/video files, and URLs. 同步、转录、智能整理语音备忘录、音视频文件和视频链接。
openclaw skills install voice-memo-syncIntelligent voice/video transcription and organization system.
智能语音/视频转录与整理系统。
# Run installation script / 运行安装脚本
cd ~/.openclaw/workspace/skills/voice-memo-sync
./scripts/install.sh
What it does / 安装内容:
memory/voice-memos/ / 创建数据目录config/voice-memo-sync.yaml / 创建配置文件✅ USE this skill when user:
❌ DO NOT use when:
On Apple Silicon, whisper-cpp provides 15-20x faster transcription:
| Audio | CPU (openai-whisper) | Metal GPU (whisper-cpp) |
|---|---|---|
| 5 min | ~5 min | ~20 sec |
| 30 min | ~30 min | ~2 min |
| 60 min | ~60 min | ~4 min |
# Install for Metal acceleration (recommended)
brew install whisper-cpp
The skill auto-detects and uses Metal when available.
| Type / 类型 | Formats / 格式 | Processing / 处理方式 |
|---|---|---|
| Voice Memos | .qta, .m4a | Apple native (QTA metadata) → Whisper fallback |
| Audio | .mp3, .wav, .aac, .flac | Whisper local transcription |
| Video | .mp4, .mov, .mkv, .webm | ffmpeg extract → Whisper |
| YouTube | URL | summarize CLI → yt-dlp fallback |
| Bilibili | URL | yt-dlp download → Whisper |
| Text | .txt, .md | Direct read, skip transcription |
| Documents | .doc, .docx | textutil convert → process |
| Structured | .json, .csv | Parse and extract text |
| iCloud | Configured paths | Scheduled sync |
Input (File/URL/Text)
│
▼
┌─────────────────────────────────────┐
│ 1. Source Detection │
│ 来源识别 │
│ Voice Memo / URL / File / Text │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Save Source Metadata │
│ 保存源信息 │
│ → memory/voice-memos/sources/ │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Transcription │
│ 转录提取 │
│ Priority: Apple > Text > summarize│
│ > Whisper-local > API │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Save Raw Transcript │
│ 保存原始转录 │
│ → memory/voice-memos/transcripts/ │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 5. LLM Deep Processing │
│ LLM深度整理 │
│ • Read USER.md & MEMORY.md │
│ • Clean up spoken language │
│ • Extract key points & insights │
│ • Identify TODOs & connections │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 6. Save Processed Result │
│ 保存处理结果 │
│ → memory/voice-memos/processed/ │
└─────────────────┬───────────────────┘
│
┌───────┴───────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ 7a. Apple Notes │ │ 7b. Reminders │
│ Structured note │ │ Create TODOs │
│ with #hashtags │ │ 创建提醒 │
└────────┬────────┘ └────────┬───────┘
│ │
└─────────┬─────────┘
▼
┌─────────────────────────────────────┐
│ 8. Update Index │
│ 更新索引 │
│ → memory/voice-memos/INDEX.md │
└─────────────────────────────────────┘
memory/voice-memos/ # All data, searchable via memory_search
├── INDEX.md # Processing records index / 处理记录索引
├── sources/ # Original file metadata / 原始文件元数据
│ └── YYYY-MM-DD_xxx.json
├── transcripts/ # Raw transcripts / 原始转录文本
│ └── YYYY-MM-DD_source_title.md
├── processed/ # LLM processed content / LLM处理后内容
│ └── YYYY-MM-DD_source_title.md
└── synced/ # Sync records / 同步记录
└── YYYY-MM-DD_source_title.json
The skill reads USER.md, SOUL.md, and MEMORY.md to provide personalized analysis:
处理时会读取 USER.md、SOUL.md 和 MEMORY.md 提供个性化分析:
🎙️ [Auto-generated Title / 智能生成的标题]
📅 Date | ⏱️ Duration | 👤 Source
🏷️ #tag1 #tag2 #tag3
━━━━━━━━━━━━━━━━━━━━━━
📌 Summary / 核心摘要
[One paragraph summarizing the content]
🎯 Key Points / 关键要点
• Point 1
• Point 2
• Point 3
💡 Deep Analysis & Reflection (For User) / 深度分析与反思
[Personalized analysis connecting to user's:
- Current research directions (from MEMORY.md)
- Active projects and interests (from USER.md)
- Decision-making style and preferences
- Critical counter-arguments and blind spots]
📋 Action Items / 行动建议
☐ Research: [specific to user's academic work]
☐ Business: [relevant to startup/investment focus]
☐ Content: [ideas for courses/articles]
🔗 Related Connections / 相关联系
• Connection to [project/memory]
• Recommended reading/research
💬 Notable Quotes / 金句摘录
• "Quote 1"
• "Quote 2"
━━━━━━━━━━━━━━━━━━━━━━
📝 Original Transcript (Cleaned) / 原始转录(已整理)
[Full transcript text, cleaned up from spoken language / 完整转录,已整理口语表达]
Apple Voice Memos on iOS/macOS 14+ uses .qta (QuickTime Audio) files that embed native transcription directly in the file metadata.
QTA File
├── ftyp (file type marker: "qt ")
├── wide (extended marker)
├── mdat (audio data, typically 90%+ of file size)
└── moov (metadata container)
├── mvhd (movie header)
└── trak (one or more tracks)
├── tkhd (track header)
├── mdia (media data)
└── meta (metadata - TRANSCRIPTION HERE!)
├── hdlr (handler: "mdta")
├── keys (key list: "com.apple.VoiceMemos.tsrp")
└── ilst (data list)
└── data (JSON transcription payload)
{
"locale": {"identifier": "zh-Hans_GB", "current": 1},
"attributedString": {
"runs": ["字",0,"符",1,"转",2,"录",3,...],
"attributeTable": [
{"timeRange": [0.0, 0.5]},
{"timeRange": [0.5, 0.8]},
...
]
}
}
Key Points:
runs array alternates: [text, index, text, index, ...]attributeTable provides timestamps for each characterilst/data atomextract-apple-transcript.py to reliably extract# Extract plain text
python3 scripts/extract-apple-transcript.py recording.qta
# Extract with metadata (JSON output)
python3 scripts/extract-apple-transcript.py recording.qta --json
# Extract with timestamps
python3 scripts/extract-apple-transcript.py recording.qta --json --with-timestamps
| Issue | Cause | Solution |
|---|---|---|
| "未找到转录数据" | Recording still processing | Wait 1-2 min, or use Whisper |
| "转录标记存在但数据不完整" | Partial transcription | Use Whisper fallback |
| JSON parse error | Corrupted file | Try Whisper transcription |
Location / 位置: ~/.openclaw/workspace/config/voice-memo-sync.yaml
sources:
voice_memos:
enabled: true
path: "~/Library/Group Containers/group.com.apple.VoiceMemos.shared/Recordings/"
icloud:
enabled: true
paths:
- "~/Library/Mobile Documents/com~apple~CloudDocs/Recordings"
- "~/Library/Mobile Documents/com~apple~CloudDocs/Meeting Recordings"
watch_patterns: ["*.m4a", "*.mp3", "*.mp4", "*.wav", "*.mov"]
transcription:
# Priority order / 优先级顺序
priority: ["apple", "text", "summarize", "whisper-local"]
whisper_model: "small" # tiny/small/medium/large
language: "auto" # auto/zh/en/ja/ko/...
notes:
folder: "Voice Memos" # Apple Notes folder name
include_quotes: true
include_original: true
reminders:
enabled: true
list: "Reminders"
auto_create: true
| Script | Purpose / 用途 | Usage / 用法 |
|---|---|---|
install.sh | Initialize setup | ./install.sh |
process.sh | Unified processing | ./process.sh <input> |
extract-apple-transcript.py | Extract Apple native transcription | python3 extract-apple-transcript.py <file> |
create-apple-note.sh | Create Apple Notes | ./create-apple-note.sh <title> <content> |
sync-icloud-recordings.sh | Sync iCloud directory | ./sync-icloud-recordings.sh |
When user sends audio/video or URL, follow these steps:
当用户发送音视频或URL时,按以下步骤处理:
YouTube URL → summarize extract
Bilibili URL → yt-dlp download + whisper
.qta/.m4a → Apple transcript extraction
Other audio/video → whisper transcription
.txt/.md file → direct read
.doc/.docx → textutil convert
# Record to memory/voice-memos/sources/
echo '{"input":"...", "type":"...", "date":"YYYY-MM-DD"}' > sources/xxx.json
# Save to memory/voice-memos/transcripts/YYYY-MM-DD_source_title.md
# Include: source info + full raw transcript
Read USER.md and MEMORY.md, combining user context.
**MODE SELECTION (Auto-detect or Manual Override) / 模式选择:**
┌─────────────────────────────────────────────────────────────────┐
│ Mode A: Solo Memo (Default) / 短语音 │
│ Trigger: < 5 min, single speaker, casual │
│ Output: Clean text + Key points + TODOs + Connections │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Mode B: Deep Meeting / 深度会议 │
│ Trigger: 15-60 min, multi-speaker with labels │
│ Output: │
│ 1. Executive Summary (1 paragraph) │
│ 2. Chronological Detail by time blocks │
│ 3. Debate Flow (who said what, conflicts) │
│ 4. Decision Matrix (Issue → Decision → Rationale) │
│ 5. Action Items with owners │
│ 6. Vital Quotes (preserve Voice) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Mode C: Lecture / Talk / 讲座模式 (NEW) │
│ Trigger: Single speaker, 30min-3hr, structured presentation │
│ Output: │
│ 1. Executive Summary (1 paragraph) │
│ 2. **Argument Structure (论点层级)**: │
│ - Core Thesis (核心论点) │
│ - Supporting Arguments (分论点 1, 2, 3...) │
│ - Key Evidence/Examples for each argument │
│ - Counter-arguments addressed (if any) │
│ 3. Key Definitions (关键定义/概念) │
│ 4. Notable Quotes (金句, with timestamps if available) │
│ 5. Connections to User's Work (个人关联) │
│ 6. Questions Raised / Gaps (讲座未解决的问题) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Mode D: Lecture + Q&A / 讲座+问答 (NEW) │
│ Trigger: First part monologue, second part Q&A │
│ Output: │
│ **Part I: Lecture Section** (use Mode C structure) │
│ **Part II: Q&A Section** │
│ - Group questions by theme/topic (not chronological) │
│ - Format: Q1 → A1 (summary), Q2 → A2... │
│ - Highlight: Best Questions, Surprising Answers │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Mode E: Long-form No-Speaker-Label / 超长无标注会议 (NEW) │
│ Trigger: > 90 min, NO speaker diarization (text is a blob) │
│ Strategy: │
│ 1. **Chunking**: Split into ~30min segments for processing │
│ 2. **Topic Detection**: Identify topic shift points │
│ (Don't force time blocks; use semantic breaks) │
│ 3. **Abandon Attribution**: Don't guess who said what │
│ Output: │
│ 1. Executive Summary │
│ 2. **Topic Blocks** (not time blocks): │
│ - Topic 1: [Summary] + [Key points] + [Quotes] │
│ - Topic 2: ... │
│ 3. Unresolved Issues / Open Questions │
│ 4. Action Items (may lack owners) │
│ 5. Full Cleaned Transcript (appended or linked) │
└─────────────────────────────────────────────────────────────────┘
**TWO-PASS PROCESSING for Long Content (> 60 min):**
- Pass 1 (Quick Scan): Identify structure type, speaker presence, topic shifts
- Pass 2 (Deep Process): Apply appropriate mode to each segment
**OUTPUT DENSITY LEVELS (User can request):**
- Level 1: Executive Only (1 page, for busy stakeholders)
- Level 2: Structured Summary (5-10 pages, default)
- Level 3: Full Annotated Transcript (everything, with margin notes)
# Save to memory/voice-memos/processed/YYYY-MM-DD_source_title.md
⚠️ CRITICAL: This step is MANDATORY. Never skip it.
⚠️ 关键:此步骤必须执行,不可跳过。
⚠️ Apple Notes requires HTML format, NOT Markdown!
⚠️ Apple Notes 需要 HTML 格式,不能直接用 Markdown!
Correct workflow / 正确流程:
# 1. Convert Markdown to HTML using pandoc (REQUIRED)
pandoc /path/to/processed.md -f markdown -t html -o /tmp/note-content.html
# 2. Create note with HTML content via AppleScript
osascript <<'EOF'
set htmlContent to do shell script "cat /tmp/note-content.html"
set noteTitle to "🎙️ Note Title"
tell application "Notes"
set folderName to "Voice Memos"
set targetFolder to missing value
repeat with f in folders
if name of f is folderName then
set targetFolder to f
exit repeat
end if
end repeat
if targetFolder is missing value then
make new folder with properties {name:folderName}
delay 1
set targetFolder to folder folderName
end if
tell targetFolder
make new note with properties {name:noteTitle, body:htmlContent}
end tell
end tell
EOF
Common mistakes to avoid / 常见错误:
memo notes -a interactively → 无法自动化remindctl add --title "TODO" --list "Reminders" --due "YYYY-MM-DD"
# Append record to memory/voice-memos/INDEX.md
⚠️ Privacy-First Design:
brew install openai-whisper
# Update yt-dlp
brew upgrade yt-dlp
# Or use proxy
export ALL_PROXY=http://127.0.0.1:7890
# Manually create via AppleScript
osascript -e 'tell application "Notes" to tell account "iCloud" to make new folder with properties {name:"Voice Memos"}'
# Use larger model for better accuracy
# Edit config: whisper_model: "medium" or "large"
--with-timestamps option for detailed time-aligned output.brew install ffmpegbrew install whisper-cppbrew install openai-whisperbrew install yt-dlp