视频自动笔记制作

Use this skill when the user provides a video URL and wants a complete Markdown learning note. It downloads the original video, transcribes audio with qwen-audio/STT, extracts timestamped frames with ffmpeg, reads and filters key screenshots one by one in combination with subtitles, and finally generates an illustrated learning note.

Audits

Pending

Install

openclaw skills install video-learning-notes

Video Learning Notes

Overview

Convert a video URL or local video file into a complete Markdown learning note. The note should be structured from the STT subtitle content and include selected key screenshots. Use this skill for requests such as “turn this video into learning notes”, “download this video and transcribe/analyze it”, or similar video-to-learning-note tasks.

Required Output

Create a self-contained output directory containing:

  • The downloaded original video file.
  • transcript.srt generated by qwen-audio/STT.
  • Timestamped frames extracted by ffmpeg under frames/.
  • Manually selected key screenshots under selected_frames/.
  • The final Markdown file, usually named video_learning_notes.md, using relative paths for the source video and images, and citing screenshot timestamps.

Workflow

1. Create a workspace

Create a dedicated output directory for each video note. Prefer the current task directory or a stable path such as ./<note-title>/. Keep all generated files inside this directory; do not scatter outputs into shared default folders.

2. Download the original video

If the source is an online video, use the yt-dlp-downloader skill/workflow to download the user-provided video URL. Preserve the original or best available quality when possible, and write the video into the current workspace.

Check dependencies when needed before downloading:

which yt-dlp || echo "yt-dlp not installed. Install with: pip install yt-dlp"
which ffmpeg || echo "ffmpeg not installed. Install with: brew install ffmpeg"

Recommended commands:

# Generic: download best quality into the workspace
yt-dlp -P "/path/to/workspace" -o "%(title)s.%(ext)s" "VIDEO_URL"

# YouTube: use browser cookies by default to reduce 403 errors
yt-dlp -P "/path/to/workspace" --cookies-from-browser chrome -o "%(title)s.%(ext)s" "YOUTUBE_URL"

# Download subtitles when available; still run qwen-audio/STT unless the user only wants official subtitles
yt-dlp -P "/path/to/workspace" --write-subs --sub-langs all -o "%(title)s.%(ext)s" "VIDEO_URL"

Platform handling principles:

  • YouTube / YouTube Music: use --cookies-from-browser chrome by default. Supported browser cookie sources include chrome, firefox, safari, edge, brave, and opera.
  • Bilibili, Twitter/X, TikTok, Douyin, Vimeo, Twitch, and most other platforms: try direct download first.
  • Playlist URLs: ask the user whether to process the entire playlist, one specific video, or a specific range.
  • Quality selection: default to the best available quality. If the user specifies a quality, use format selectors such as bestvideo[height<=1080]+bestaudio/best[height<=1080].

After downloading, identify the actual video file path, such as .mp4, .mkv, .mov, .webm, etc. If multiple files are produced, choose the main video as the source for the learning note, while keeping subtitles, thumbnails, and other files as supporting assets.

Troubleshooting:

  • HTTP 403 Forbidden: retry with --cookies-from-browser chrome or another browser where the user is logged in.
  • Video unavailable, private videos, or geo-restricted videos: ask the user for login access, cookies, or an accessible environment; do not bypass access restrictions.
  • Format not available: run yt-dlp -F "VIDEO_URL" to list available formats, then choose one.
  • Interrupted downloads: retry; yt-dlp can usually resume partial downloads.
  • yt-dlp: command not found: install yt-dlp or ask the user to install it.

If yt-dlp-downloader / yt-dlp is unavailable, or if the video requires login/authentication, stop and ask the user to provide the missing access requirement instead of silently switching to unreliable tools.

3. Transcribe with qwen-audio

Run qwen-audio/STT on the downloaded video or extracted audio, and save the result as transcript.srt.

For large videos, first use ffmpeg to extract compressed mono audio, then transcribe the smaller audio file:

ffmpeg -y -i input.mp4 -vn -ac 1 -ar 16000 -b:a 32k audio_for_stt.mp3

Preserve timestamp information as much as possible. Prefer SRT format. If STT only produces plain text, create transcript.txt and clearly note in the final output that exact subtitle timing is unavailable.

4. Extract timestamped candidate frames with ffmpeg

After confirming the video path, use scripts/prepare_video_learning_assets.py. The script generates timestamped candidate screenshots and a manifest file:

python3 "$SKILL_DIR/scripts/prepare_video_learning_assets.py" \
  --video /path/to/video.mp4 \
  --out /path/to/workspace \
  --scene-threshold 0.3

By default, the script extracts frames only from ffmpeg scene changes; it does not take one screenshot every 30 seconds. Use --interval <seconds> only when regular interval screenshots are explicitly needed.

For most learning videos, the recommended --scene-threshold range is 0.10.3:

  • Lower thresholds produce more frames and capture smaller visual changes.
  • Higher thresholds produce fewer frames and keep only more obvious scene changes.
  • After running the script, check the frame count in frames_manifest.json and adjust the threshold so the number of candidate frames is suitable for manual review.

The script writes:

  • frames/frame_000001__HH-MM-SS.jpg
  • frames_manifest.json
  • video_learning_notes.skeleton.md

If scene detection misses important content, add --interval <seconds> as a supplement. Use 10–15 seconds for slide-heavy or fast-changing instructional videos, and 45–60 seconds for talking-head videos.

5. Read and filter key frames one by one with STT context

Use the Read tool's visual analysis capability to inspect extracted screenshots. You must check candidate frames one by one in chronological order, and decide whether each frame is a key frame by combining the image content with nearby STT/SRT text.

For each candidate image:

  1. Read the image file content with vision; do not judge only from the filename or timestamp.
  2. Locate the subtitles around the same timestamp in transcript.srt, usually the preceding and following 15–30 seconds.
  3. Decide whether the image provides learning value beyond the transcript.
  4. Keep the frame as a key screenshot only when it helps explain, supplement, or preserve important information.

Prioritize frames containing:

  • Slides, diagrams, charts, code, formulas, tables, whiteboard content, UI operation screens, or definitions.
  • Scene changes that introduce a new topic.
  • Visual examples explicitly referenced by the subtitles.
  • Important on-screen text not fully captured by STT.

Skip frames that are:

  • Near-duplicates.
  • Blurry.
  • Pure talking-head shots without useful learning information.
  • Loading screens, ads, intros/outros, or irrelevant overlays.

Copy selected key images into selected_frames/, preserving the original timestamped filenames. Keep only enough screenshots to support learning; do not keep every candidate frame. For most videos, 8–30 screenshots are enough. Use more only when the video is highly visual.

6. Generate the learning note from screenshots and SRT

Read transcript.srt, the selected screenshot filenames/timestamps, and the visual notes produced while reading images one by one. Create video_learning_notes.md only after the key-frame selection step is complete.

When writing, embed key screenshots into the corresponding time-based sections, and explain why each screenshot matters based on both the STT context and the image content. Use the following structure:

# <Video title or topic>

<video src="relative/path/to/video.mp4" controls ></video>

- Original video: <URL>
- Video file: [<video-title>](relative/path/to/video.mp4)
- Generated date: YYYY-MM-DD

## Core Summary

<Summarize the most important ideas in 5–10 bullet points.>

## Learning Objectives

<List the concepts or operations the learner should understand after watching.>

## Section-by-Section Notes

### 00:00:00–00:03:20 <Section title>

<Convert the subtitles into readable learning notes. Do not dump the raw transcript.>

![00:01:30 key screenshot](selected_frames/frame_000003__00-01-30.jpg)

## Key Concepts / Terms

| Term | Explanation | Timestamp |
|---|---|---|

## Steps or Methodology

<If the video is a tutorial, organize the steps. If it is a course, organize the conceptual framework.>

## Review Checklist

- [ ] <Question or checkpoint>

Writing standards

  • Write in Chinese unless the user asks otherwise.
  • Convert the transcript into structured learning notes; do not dump a raw transcript.
  • Include screenshot timestamps in captions or nearby text.
  • At the beginning of the Markdown, embed the source video with the same syntax style as an image: <video src="relative/path/to/video.mp4" controls ></video>.
  • Use relative paths for both the source video and screenshots so the Markdown still displays correctly when the folder is moved.
  • Ground all claims in the subtitle content or image content. Use wording such as “possibly” or “appears to” for uncertain visual interpretations.
  • If the video contains code, formulas, financial charts, UI operations, or domain-specific terminology, preserve readable on-screen text as accurately as possible.

Tool notes

  • Use Bash for yt-dlp-downloader, ffmpeg, ffprobe, qwen-audio/STT commands, and necessary file operations.
  • Use Read with vision to review images; do not rely only on filenames or timestamps.
  • Use the bundled script for deterministic candidate frame extraction and skeleton generation; the final Markdown should be manually organized after subtitle and image analysis.
  • At the end of the task, use Message files_preview when appropriate to show the final Markdown and output directory.