gemini-video-understanding

Analyze videos with Google Gemini API (summaries, Q&A, transcription with timestamps + visual context, scene/timeline detection, video clipping, FPS control, multi-video comparison, and YouTube URL analysis).

Audits

Pass

ClawScanPass

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install pedestrian-traffic-counting-gemini-video-understanding

Gemini Video Understanding Skill

Purpose

This skill enables video understanding workflows using the Google Gemini API, including video summarization, question answering, transcription with optional visual descriptions, timestamp-based queries (MM:SS), scene/timeline detection, video clipping, custom FPS sampling, multi-video comparison, and YouTube URL analysis.

When to Use

Summarizing a video into key points or chapters
Answering questions about what happens at specific timestamps (MM:SS)
Producing a transcript (optionally with visual context) and speaker labels
Detecting scene changes or building a timeline of events
Analyzing long videos by clipping to relevant segments or reducing FPS
Comparing multiple videos (up to 10 videos on Gemini 2.5+)
Analyzing public YouTube videos directly via URL

Required Libraries

The following Python libraries are required:

from google import genai
from google.genai import types
import os
import time

Input Requirements

File formats: MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Size constraints:
- Use inline bytes for small files (rule of thumb: <20MB).
- Use the File API upload flow for larger videos (most real videos).
YouTube:
- Video must be public (not private/unlisted) and not age-restricted.
- Provide a valid YouTube URL.
Duration / context window (model-dependent):
- 2M-token models: ~2 hours (default resolution) or ~6 hours (low-res).
- 1M-token models: ~1 hour (default) or ~3 hours (low-res).
Timestamps: Use MM:SS (e.g., 01:15) when requesting time-based answers.

Output Schema

All extracted/derived content should be returned as valid JSON conforming to this schema:

{
  "success": true,
  "source": {
    "type": "file|youtube",
    "id": "video.mp4|VIDEO_ID_OR_URL",
    "model": "gemini-2.5-flash"
  },
  "summary": "Concise summary of the video...",
  "transcript": {
    "available": true,
    "text": "Full transcript text (may include speaker labels)...",
    "includes_visual_descriptions": true
  },
  "events": [
    {
      "timestamp": "MM:SS",
      "description": "What happens at this time",
      "category": "scene_change|key_point|action|other"
    }
  ],
  "warnings": [
    "Optional warnings about limitations, missing timestamps, or low confidence areas"
  ]
}

Field Descriptions

success: Whether the analysis completed successfully
source.type: file for uploaded/local content, youtube for YouTube analysis
source.id: Filename for local uploads, or URL/ID for YouTube
source.model: Gemini model used for the request
summary: High-level video summary
transcript.*: Transcript payload (may be omitted or available=false if not requested)
events: Timeline items with MM:SS timestamps (chapters, scene changes, key actions)
warnings: Any issues that could affect correctness (e.g., “timestamp not found”, “long video clipped”)

Code Examples

Basic Video Analysis (Local Video + File API)

from google import genai
import os
import time

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

# Upload video (File API for >20MB)
myfile = client.files.upload(file="video.mp4")

# Wait for processing
while myfile.state.name == "PROCESSING":
    time.sleep(1)
    myfile = client.files.get(name=myfile.name)

if myfile.state.name == "FAILED":
    raise ValueError("Video processing failed")

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=["Summarize this video in 3 key points", myfile],
)

print(response.text)

YouTube Video Analysis (Public Videos Only)

from google import genai
from google.genai import types
import os

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Summarize the main topics discussed",
        types.Part.from_uri(
            uri="https://www.youtube.com/watch?v=VIDEO_ID",
            mime_type="video/mp4",
        ),
    ],
)

print(response.text)

Inline Video (<20MB)

from google import genai
from google.genai import types
import os

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

with open("short-clip.mp4", "rb") as f:
    video_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "What happens in this video?",
        types.Part.from_bytes(data=video_bytes, mime_type="video/mp4"),
    ],
)

print(response.text)

Video Clipping (Analyze a Segment Only)

from google.genai import types

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Summarize this segment",
        types.Part.from_video_metadata(
            file_uri=myfile.uri,
            start_offset="40s",
            end_offset="80s",
        ),
    ],
)

Custom Frame Rate (Token/Cost Control)

from google.genai import types

# Lower FPS for static content (saves tokens)
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Analyze this presentation",
        types.Part.from_video_metadata(file_uri=myfile.uri, fps=0.5),
    ],
)

# Higher FPS for fast-moving content
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Analyze rapid movements in this sports video",
        types.Part.from_video_metadata(file_uri=myfile.uri, fps=5),
    ],
)

Timeline / Scene Detection (MM:SS)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        """Create a timeline with timestamps:
        - Key events
        - Scene changes
        - Important moments
        Format: MM:SS - Description
        """,
        myfile,
    ],
)

Transcription (Optional Visual Descriptions)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        """Transcribe with visual context:
        - Audio transcription
        - Visual descriptions of important moments
        - Timestamps for salient events
        """,
        myfile,
    ],
)

Structured JSON Output (Schema-Guided)

from pydantic import BaseModel
from typing import List
from google.genai import types as genai_types

class VideoEvent(BaseModel):
    timestamp: str  # MM:SS
    description: str
    category: str

class VideoAnalysis(BaseModel):
    summary: str
    events: List[VideoEvent]
    duration: str

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=["Analyze this video", myfile],
    config=genai_types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=VideoAnalysis,
    ),
)

Best Practices

Use the File API for most videos (>20MB) and wait for processing to complete before analysis.
Reduce token usage by clipping to the relevant segment and/or lowering FPS for static content.
Improve accuracy by being explicit about the desired output (timestamps format, number of events, whether you want scene changes vs actions vs chapters).
Use gemini-2.5-pro when you need highest-quality reasoning over complex, long, or visually dense videos.

Error Handling

import time

def upload_and_wait(client, file_path: str, max_wait_s: int = 300):
    myfile = client.files.upload(file=file_path)
    waited = 0

    while myfile.state.name == "PROCESSING" and waited < max_wait_s:
        time.sleep(5)
        waited += 5
        myfile = client.files.get(name=myfile.name)

    if myfile.state.name == "FAILED":
        raise ValueError(f"Video processing failed: {myfile.state.name}")
    if myfile.state.name == "PROCESSING":
        raise TimeoutError(f"Processing timeout after {max_wait_s}s")

    return myfile

Common issues:

Upload processing stuck: wait and poll; fail after a max timeout.
YouTube errors: verify the video is public and not age-restricted.
Rate limits: retry with exponential backoff.
Incorrect timestamps: re-prompt with strict “MM:SS” formatting and request fewer events.

Limitations

Long-video support is limited by model context and token budget (default vs low-res modes).
YouTube analysis requires public videos; live streaming analysis is not supported.
Very long videos may require chunking (clip by time range and process in segments).
Multi-video comparison is limited (up to 10 videos per request on Gemini 2.5+).

Version History

1.0.0 (2026-01-15): Initial release focused on Gemini video understanding (summaries, Q&A, timestamps, clipping, FPS control, YouTube, and structured outputs).