gemini-video-understanding
v0.1.0Analyze videos with Google Gemini API (summaries, Q&A, transcription with timestamps + visual context, scene/timeline detection, video clipping, FPS control,...
Security Scan
Capability signals
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
OpenClaw
Suspicious
medium confidencePurpose & Capability
The SKILL.md clearly implements Google Gemini video workflows (uploading files, YouTube URIs, model calls) which fits the skill name/description. However the registry slug ('pedestrian-traffic-counting-...') and the published 'source: unknown / homepage: none' indicate inconsistent provenance and a possible mismatch in intended purpose or origin. Also the code examples require a GEMINI_API_KEY, but the skill metadata declares no required environment variables or primary credential.
Instruction Scope
The runtime instructions are concrete and scoped to video analysis: uploading local files, reading video bytes, using the File API, and pointing to public YouTube URLs. They do not instruct reading unrelated system files or secrets beyond an API key. However the examples explicitly call os.getenv("GEMINI_API_KEY") and read local video files — the former is not declared in the skill metadata, which is an inconsistency to resolve before use.
Install Mechanism
No install specification and no code files — this is instruction-only, which is lower risk from an installation perspective. Nothing is downloaded or written to disk by an install step in the skill bundle itself.
Credentials
The SKILL.md requires an API key (os.getenv("GEMINI_API_KEY")) for Google Gemini calls, but the skill metadata lists no required environment variables or primary credential. That omission is disproportionate/incoherent: a networked API key is necessary for the stated functionality and should be declared and justified. Also consider privacy implications: videos (local or YouTube) may contain PII and will be transmitted to Google's API.
Persistence & Privilege
The skill is not marked always:true and is user-invocable with normal autonomous invocation allowed. It does not request persistent presence or modify other skills. This is the expected configuration for an instruction-only integration.
What to consider before installing
This skill appears to implement Google Gemini video analysis and the instructions are internally consistent with that purpose, but metadata gaps and provenance issues make it suspicious. Before installing: (1) confirm the author/source and why the registry slug references 'pedestrian-traffic-counting' if your use case is general video analysis; (2) require the skill to declare the GEMINI_API_KEY (or other credentials) in its metadata and document required IAM/billing scope; (3) understand that local videos and public YouTube URLs will be sent to Google's API—do not send private or sensitive videos unless you accept that; (4) verify billing and token lifetime for the Gemini API key and that you trust the endpoint; and (5) prefer a skill with a verifiable homepage/source or maintainer contact. If these items are resolved and provenance is clear, the skill is likely fine to use; if not, treat it as untrusted.Like a lobster shell, security has layers — review code before you run it.
latest
Gemini Video Understanding Skill
Purpose
This skill enables video understanding workflows using the Google Gemini API, including video summarization, question answering, transcription with optional visual descriptions, timestamp-based queries (MM:SS), scene/timeline detection, video clipping, custom FPS sampling, multi-video comparison, and YouTube URL analysis.
When to Use
- Summarizing a video into key points or chapters
- Answering questions about what happens at specific timestamps (MM:SS)
- Producing a transcript (optionally with visual context) and speaker labels
- Detecting scene changes or building a timeline of events
- Analyzing long videos by clipping to relevant segments or reducing FPS
- Comparing multiple videos (up to 10 videos on Gemini 2.5+)
- Analyzing public YouTube videos directly via URL
Required Libraries
The following Python libraries are required:
from google import genai
from google.genai import types
import os
import time
Input Requirements
- File formats: MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Size constraints:
- Use inline bytes for small files (rule of thumb: <20MB).
- Use the File API upload flow for larger videos (most real videos).
- YouTube:
- Video must be public (not private/unlisted) and not age-restricted.
- Provide a valid YouTube URL.
- Duration / context window (model-dependent):
- 2M-token models: ~2 hours (default resolution) or ~6 hours (low-res).
- 1M-token models: ~1 hour (default) or ~3 hours (low-res).
- Timestamps: Use MM:SS (e.g.,
01:15) when requesting time-based answers.
Output Schema
All extracted/derived content should be returned as valid JSON conforming to this schema:
{
"success": true,
"source": {
"type": "file|youtube",
"id": "video.mp4|VIDEO_ID_OR_URL",
"model": "gemini-2.5-flash"
},
"summary": "Concise summary of the video...",
"transcript": {
"available": true,
"text": "Full transcript text (may include speaker labels)...",
"includes_visual_descriptions": true
},
"events": [
{
"timestamp": "MM:SS",
"description": "What happens at this time",
"category": "scene_change|key_point|action|other"
}
],
"warnings": [
"Optional warnings about limitations, missing timestamps, or low confidence areas"
]
}
Field Descriptions
success: Whether the analysis completed successfullysource.type:filefor uploaded/local content,youtubefor YouTube analysissource.id: Filename for local uploads, or URL/ID for YouTubesource.model: Gemini model used for the requestsummary: High-level video summarytranscript.*: Transcript payload (may be omitted oravailable=falseif not requested)events: Timeline items with MM:SS timestamps (chapters, scene changes, key actions)warnings: Any issues that could affect correctness (e.g., “timestamp not found”, “long video clipped”)
Code Examples
Basic Video Analysis (Local Video + File API)
from google import genai
import os
import time
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# Upload video (File API for >20MB)
myfile = client.files.upload(file="video.mp4")
# Wait for processing
while myfile.state.name == "PROCESSING":
time.sleep(1)
myfile = client.files.get(name=myfile.name)
if myfile.state.name == "FAILED":
raise ValueError("Video processing failed")
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=["Summarize this video in 3 key points", myfile],
)
print(response.text)
YouTube Video Analysis (Public Videos Only)
from google import genai
from google.genai import types
import os
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Summarize the main topics discussed",
types.Part.from_uri(
uri="https://www.youtube.com/watch?v=VIDEO_ID",
mime_type="video/mp4",
),
],
)
print(response.text)
Inline Video (<20MB)
from google import genai
from google.genai import types
import os
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
with open("short-clip.mp4", "rb") as f:
video_bytes = f.read()
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"What happens in this video?",
types.Part.from_bytes(data=video_bytes, mime_type="video/mp4"),
],
)
print(response.text)
Video Clipping (Analyze a Segment Only)
from google.genai import types
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Summarize this segment",
types.Part.from_video_metadata(
file_uri=myfile.uri,
start_offset="40s",
end_offset="80s",
),
],
)
Custom Frame Rate (Token/Cost Control)
from google.genai import types
# Lower FPS for static content (saves tokens)
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Analyze this presentation",
types.Part.from_video_metadata(file_uri=myfile.uri, fps=0.5),
],
)
# Higher FPS for fast-moving content
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Analyze rapid movements in this sports video",
types.Part.from_video_metadata(file_uri=myfile.uri, fps=5),
],
)
Timeline / Scene Detection (MM:SS)
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"""Create a timeline with timestamps:
- Key events
- Scene changes
- Important moments
Format: MM:SS - Description
""",
myfile,
],
)
Transcription (Optional Visual Descriptions)
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"""Transcribe with visual context:
- Audio transcription
- Visual descriptions of important moments
- Timestamps for salient events
""",
myfile,
],
)
Structured JSON Output (Schema-Guided)
from pydantic import BaseModel
from typing import List
from google.genai import types as genai_types
class VideoEvent(BaseModel):
timestamp: str # MM:SS
description: str
category: str
class VideoAnalysis(BaseModel):
summary: str
events: List[VideoEvent]
duration: str
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=["Analyze this video", myfile],
config=genai_types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=VideoAnalysis,
),
)
Best Practices
- Use the File API for most videos (>20MB) and wait for processing to complete before analysis.
- Reduce token usage by clipping to the relevant segment and/or lowering FPS for static content.
- Improve accuracy by being explicit about the desired output (timestamps format, number of events, whether you want scene changes vs actions vs chapters).
- Use
gemini-2.5-prowhen you need highest-quality reasoning over complex, long, or visually dense videos.
Error Handling
import time
def upload_and_wait(client, file_path: str, max_wait_s: int = 300):
myfile = client.files.upload(file=file_path)
waited = 0
while myfile.state.name == "PROCESSING" and waited < max_wait_s:
time.sleep(5)
waited += 5
myfile = client.files.get(name=myfile.name)
if myfile.state.name == "FAILED":
raise ValueError(f"Video processing failed: {myfile.state.name}")
if myfile.state.name == "PROCESSING":
raise TimeoutError(f"Processing timeout after {max_wait_s}s")
return myfile
Common issues:
- Upload processing stuck: wait and poll; fail after a max timeout.
- YouTube errors: verify the video is public and not age-restricted.
- Rate limits: retry with exponential backoff.
- Incorrect timestamps: re-prompt with strict “MM:SS” formatting and request fewer events.
Limitations
- Long-video support is limited by model context and token budget (default vs low-res modes).
- YouTube analysis requires public videos; live streaming analysis is not supported.
- Very long videos may require chunking (clip by time range and process in segments).
- Multi-video comparison is limited (up to 10 videos per request on Gemini 2.5+).
Version History
- 1.0.0 (2026-01-15): Initial release focused on Gemini video understanding (summaries, Q&A, timestamps, clipping, FPS control, YouTube, and structured outputs).
Resources
Comments
Loading comments...
