Youtube Video Analyzer

Analyze YouTube videos by synchronizing transcript text with visual frames to produce detailed summaries, step-by-step guides, and content understanding.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
1 · 710 · 1 current installs · 1 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The name/description align with the requested binaries and the declared install of yt-dlp. ffmpeg, python3, curl and yt-dlp are exactly what you'd expect for downloading videos, extracting transcripts, and producing frames.
!
Instruction Scope
The SKILL.md provides concrete, low-level steps for metadata retrieval, subtitle extraction, video download, and frame extraction — all scoped to the stated purpose. However the critical 'Multimodal analysis' step is vague: it says 'Read each frame image, combine with transcript, Generate structured output' but does not specify how or where images are analyzed. That vagueness gives an agent broad discretion, which could lead to unexpected actions such as calling external image-analysis APIs or uploading frames to third-party endpoints.
Install Mechanism
The install spec installs yt-dlp via an installer kind labeled 'uv'. yt-dlp itself is an expected package for this use case, but the installer kind 'uv' is nonstandard/ambiguous in the provided metadata. If 'uv' maps to a known, audited package source in your environment this is low risk; if it downloads code from an untrusted host, it would be higher risk. No direct remote-download-from-arbitrary-URL pattern was found.
Credentials
No environment variables, credentials, or config paths are requested. The skill does not ask for unrelated secrets or system config access, which is proportionate to its stated purpose.
Persistence & Privilege
always:false and default invocation settings are used. The skill does not request persistent system-wide privileges or modifications to other skills. It writes temporary files only to a per-run temp directory.
What to consider before installing
This skill appears to legitimately implement video download, transcript extraction, and frame capture — those parts are coherent. The main risk is the unspecified image-analysis step: before installing, confirm where and how the extracted frames will be processed. Ask the publisher or inspect the full SKILL.md for any commands or API calls that would upload images or send data to external endpoints. Also verify what 'uv' install means in your environment and ensure yt-dlp will be fetched from a trusted package source. If you plan to analyze private or sensitive videos, run the skill in a restricted environment (sandbox), limit how many frames are saved, and ensure that your agent or any invoked tooling is not configured to forward files to third parties. If you need higher assurance, request the skill author provide the missing analysis code or a clear statement of which image-analysis services (if any) will be used.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk972ytr0x2rznk0zg7nj04ajcd817de6

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

🎬 Clawdis
OSLinux · macOS
Binsffmpeg, python3, curl

Install

uv
Bins: yt-dlp
uv tool install yt-dlp

SKILL.md

YouTube Video Analyzer — Multimodal

This skill performs deep analysis of YouTube videos through both information channels:

  • Audio channel: Transcript with timestamps (what is SAID)
  • Visual channel: Frame extraction + image analysis (what is SHOWN)

Most YouTube skills only extract transcripts. This skill closes the gap by synchronizing visual frames with spoken content, enabling accurate step-by-step guides where "click the blue button" is matched with the actual screenshot showing which button.

Workflow Overview

YouTube URL
    |
    +---> 1. Get metadata (title, duration, video ID)
    |
    +---> 2. Extract transcript (yt-dlp --dump-json + curl)
    |         -> Timestamped segments
    |
    +---> 3. Extract frames (yt-dlp + ffmpeg)
    |         -> Keyframes at strategic intervals
    |
    +---> 4. Synchronize frames <-> transcript
    |         -> Match frames to spoken content by timestamp
    |
    +---> 5. Multimodal analysis
              -> Read each frame image, combine with transcript
              -> Generate structured output

Step 1: Setup Working Directory

VIDEO_URL="<YOUTUBE_URL>"
WORK_DIR=$(mktemp -d /tmp/yt-analysis-XXXXXX)
mkdir -p "$WORK_DIR/frames"

Step 2: Get Video Metadata

yt-dlp --print title --print duration --print id "$VIDEO_URL" 2>/dev/null

This returns three lines: title, duration in seconds, video ID. Store these for later use.

Step 3: Extract Transcript

IMPORTANT: Direct subtitle download via --write-sub frequently hits YouTube rate limits (HTTP 429). Use the reliable two-step method below instead.

Step 3a: Get subtitle URL from video JSON

yt-dlp --dump-json "$VIDEO_URL" 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
auto = data.get('automatic_captions', {})
subs = data.get('subtitles', {})

# Priority: manual subs > auto subs. Prefer user's language, fallback chain.
for source in [subs, auto]:
    for lang in ['en', 'de', 'en-orig', 'fr', 'es']:
        if lang in source:
            for fmt in source[lang]:
                if fmt.get('ext') == 'json3':
                    print(fmt['url'])
                    sys.exit(0)

# Fallback: take first available auto-caption, get json3 URL
for lang in sorted(auto.keys()):
    for fmt in auto[lang]:
        if fmt.get('ext') == 'json3':
            url = fmt['url']
            # Remove translation param to get original language
            import re
            url = re.sub(r'&tlang=[^&]+', '', url)
            print(url)
            sys.exit(0)

print('NO_SUBS', file=sys.stderr)
sys.exit(1)
" > "$WORK_DIR/sub_url.txt"

Step 3b: Download and parse transcript

curl -s "$(cat "$WORK_DIR/sub_url.txt")" -o "$WORK_DIR/transcript.json3"

Verify it is valid JSON (not an HTML error page):

head -c 20 "$WORK_DIR/transcript.json3"
# Should start with { — if it starts with <html, retry after 10s sleep

Step 3c: Parse json3 into readable timestamped segments

python3 -c "
import json

with open('$WORK_DIR/transcript.json3') as f:
    data = json.load(f)

for event in data.get('events', []):
    segs = event.get('segs', [])
    if not segs:
        continue
    start_ms = event.get('tStartMs', 0)
    duration_ms = event.get('dDurationMs', 0)
    text = ''.join(s.get('utf8', '') for s in segs).strip()
    if not text or text == '\n':
        continue
    s = start_ms / 1000
    e = (start_ms + duration_ms) / 1000
    print(f'[{int(s//60):02d}:{int(s%60):02d} - {int(e//60):02d}:{int(e%60):02d}] {text}')
" > "$WORK_DIR/transcript.txt"

Read $WORK_DIR/transcript.txt to get the full transcript with timestamps.

Fallback: No transcript available

If no subtitles exist at all, inform the user and proceed with visual-only analysis.

Step 4: Download Video and Extract Frames

Step 4a: Download video (720p is sufficient for frame analysis)

yt-dlp -f "bestvideo[height<=720]+bestaudio/best[height<=720]" \
       -o "$WORK_DIR/video.mp4" "$VIDEO_URL"

Step 4b: Get exact duration

DURATION=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 "$WORK_DIR/video.mp4")

Step 4c: Extract frames using adaptive interval strategy

Choose interval based on video length:

DurationIntervalApprox. FramesRationale
< 5 min10s20-30Dense enough for detailed analysis
5-20 min20s15-60Good balance of coverage vs. volume
20-60 min30-45s30-120Focus on key moments
> 60 min60s60-120+Ask user if they want to focus on specific sections
# Example for a 5-20 minute video (interval=20):
ffmpeg -i "$WORK_DIR/video.mp4" -vf "fps=1/20" -q:v 3 "$WORK_DIR/frames/frame_%04d.jpg" 2>&1

For scene-change-detection (software HowTos, UI demos):

ffmpeg -i "$WORK_DIR/video.mp4" \
       -vf "select='gt(scene,0.3)',showinfo" \
       -vsync vfr -q:v 3 "$WORK_DIR/frames/scene_%04d.jpg" 2>&1

Step 4d: Calculate timestamps for each frame

For fixed-interval extraction: frame N has timestamp (N-1) * interval seconds.

frame_0001.jpg -> 0:00
frame_0002.jpg -> 0:20
frame_0003.jpg -> 0:40
...

Step 5: Synchronize Frames with Transcript

For each extracted frame:

  1. Calculate the frame's timestamp in seconds
  2. Find the transcript segment(s) covering that timestamp
  3. Create a synchronized pair: {timestamp, transcript_text, frame_path}

This is done mentally or via a simple lookup — no external script needed.

Step 6: Multimodal Analysis

Step 6a: Read and analyze each frame

Use the Read tool (or view tool) to look at each frame image. For each frame, consider:

  • UI elements: Buttons, menus, dialogs, settings panels visible
  • Text on screen: Code, labels, error messages, URLs, terminal output
  • Diagrams/graphics: Charts, flow diagrams, architecture drawings
  • Physical actions: Hand positions, tool usage (for physical HowTos)
  • Changes: What changed compared to the previous frame?

Step 6b: Synthesize both channels

For each key moment, combine audio and visual:

Segment [TIMESTAMP]:
  SAID: "Click the blue button in the top right"
  SHOWN: Settings page screenshot, blue "Save" button highlighted
         in top-right corner, cursor pointing at it
  SYNTHESIS: -> On the Settings page, click the blue "Save" button
               in the top-right corner

Step 6c: Identify visual-only information

Flag moments where the visual channel provides information NOT present in audio:

  • Specific button names, menu paths, exact UI locations
  • Code that is shown but not read aloud
  • Error messages visible on screen
  • Before/after comparisons

Output Formats

Generate the appropriate format based on the user's request:

Format A: Step-by-Step Guide (most common)

# [Video Title] — Guide

## Step 1: [Action] (00:15)
[Description based on transcript + frame analysis]
> Visual: [What the screen/image shows at this point]

## Step 2: [Action] (00:42)
[...]

Format B: Comprehensive Summary with Visual Anchors

# [Video Title] — Summary

## Overview
[2-3 sentence summary of the entire video]

## Key Sections

### [Section Name] (00:00 - 02:30)
[Summary of this section]
- Key visual: [Description of what's shown]
- Key quote: "[Important spoken content]"

### [Section Name] (02:30 - 05:00)
[...]

## Key Takeaways
- [Takeaway 1]
- [Takeaway 2]

Format C: Technical Detail Analysis

Separate analysis of both channels plus discrepancy detection:

# [Video Title] — Technical Analysis

## Audio Channel Analysis
[What was said, key points, structure]

## Visual Channel Analysis
[What was shown, UI flows, code, diagrams]

## Channel Synchronization
[Where audio and visual complement each other]

## Visual-Only Information
[Important details only visible in frames, not mentioned in speech]

Error Handling & Edge Cases

ProblemSolution
HTTP 429 on subtitle downloadUse --dump-json method (Step 3a). If curl also gets blocked, wait 10-15 seconds and retry with different User-Agent
No subtitles available at allProceed with visual-only analysis, inform user
Original audio language not in auto-captions listThe original language is the source — auto-captions are translations. Remove &tlang=XX from any auto-caption URL to get the original
transcript.json3 contains HTML instead of JSONYouTube returned an error page. Wait 10s, retry with: curl -s --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" "$URL"
Video > 60 minAsk user if they want to focus on specific time ranges or chapters
Poor video quality / blurry framesExtract more frames at tighter intervals to compensate
Video is age-restricted or privateInform user that the video cannot be accessed. Suggest using --cookies-from-browser if they have access
yt-dlp download failsTry alternative format: -f "best[height<=720]" without separate audio+video streams

Cleanup

After analysis is complete, remove temporary files:

rm -rf "$WORK_DIR"

Tips for Best Results

  • Software HowTos: Use scene-change detection — UI transitions create clear visual breaks
  • Physical HowTos: Use tighter frame intervals (10-15s) — movements are subtler
  • Read the transcript first: Identify "interesting timestamps" before extracting frames. Look for phrases like "as you can see here", "let me show you", "on the screen" — these signal important visual moments
  • Context-aware frame analysis: When analyzing a frame, always provide the transcript context. The speaker often explains what's about to be shown
  • Batch frame reading: Read frames in batches of 8-10 to maintain context across sequential frames and detect visual changes
  • Always extract both channels in parallel: Start the video download while processing the transcript to save time

Files

1 total
Select a file
Select a file to preview.

Comments

Loading comments…