Video Metadata Intelligence System

Video content analysis pipeline — extract frames, transcribe audio, run LLM visual+audio analysis, synthesize structured Bilibili publish metadata (title, intro, tags, category, cover suggestion). Use when user says '生成投稿元数据', '视频元数据分析', or explicitly requests this skill. ⚠️ API modes send video frames and audio to external LLM providers — see Privacy section.

CyberKurry@cyberkurry

Install

openclaw skills install @cyberkurry/video-metadata-analyzer

Video Analyzer

Three-stage video analysis pipeline: parallel visual + audio observation, then metadata synthesis for Bilibili publishing.

When to Use

User says "生成投稿元数据", "视频元数据分析", "analyze video metadata"
User explicitly requests this skill for structured content analysis from a video file
User wants to prepare a video for Bilibili publishing (title, intro, tags, category, cover)
Upstream of bilibili-publish-playwright: this skill generates the metadata that feeds into Bilibili publishing

⚠️ Privacy Notice: API modes (audio-llm, cloud, vision-llm, api) transmit video frames, audio, transcripts, and derived metadata to user-configured external LLM endpoints. Videos often contain faces, voices, on-screen text, or other sensitive data. Before using API modes, verify the endpoint trust, provider data retention policy, and your approval to process the media. Use agent-direct or local modes for confidential or regulated content.

Architecture

text

Input Video ──────────────────────────────────────────────────────
    │                                                            │
    ├── Stage 1a: visual.py ──→ observations_visual.json          │
    │      (ffmpeg extract frames → encode → vision LLM observe)  │ PARALLEL
    │                                                            │
    ├── Stage 1b: transcribe.py ──→ observations_audio.json       │
    │      (ffmpeg extract audio → transcribe + structure)         │
    │                                                            │
    └── Stage 2: analyze.py ──→ metadata.json ←───────────────────┘
           (merge V+A observations → publishable metadata via LLM)

run.sh orchestrates: launches visual.py and transcribe.py as background processes (&), wait for both, then optionally runs analyze.py.

Output Directory

text

$OUTPUT/
├── observations_visual.json    # JSON array: one object per frame
├── observations_audio.json     # JSON object: transcript + structured info
├── metadata.json               # (optional) Synthesized Bilibili metadata
└── frames/                     # (only with --keep-frames, auto-cleaned otherwise)

Procedure

1. Full pipeline (recommended — all external LLM API)

bash

bash scripts/run.sh \
  --video VIDEO_PATH --output /tmp/va-out \
  --transcribe audio-llm \
  --audio-llm-key KEY --audio-llm-base URL --audio-llm-model MODEL \
  --vision-llm-key KEY --vision-llm-base URL --vision-llm-model MODEL \
  --max-frames 15 \
  --synthesize-method api \
  --analyze-llm-key KEY --analyze-llm-base URL --analyze-llm-model MODEL

2. Agent-direct mode (no external API — agent reads frames/audio directly)

bash

bash scripts/run.sh \
  --video VIDEO_PATH --output /tmp/va-out --keep-frames

Agent then reads observations_visual.json (placeholder frames), observations_audio.json (audio file path), and optionally the frame images + audio file directly to generate metadata.

3. Mixed / observe-only

Omit --synthesize-method to observe only, then run analyze.py separately later. Each stage (visual, audio, synthesize) can use different keys and models.

Key Parameters

Parameter	Default	Purpose
`--video PATH`	—	Required. Input video file
`--output DIR`	—	Required. Output directory
`--transcribe MODE`	`agent-direct`	`local` / `cloud` / `agent-direct` / `audio-llm`
`--max-frames N`	`15`	Max frames per 4-min segment
`--keep-frames`	`false`	Keep extracted frame images
`--synthesize-method METHOD`	—	`api` / `agent` / `manual`. Omit = observe only

All *-key, *-base, *-model parameters follow the pattern: --vision-llm-key, --audio-llm-key, --analyze-llm-key etc. See references/REFERENCE.md for the complete parameter table.

Scripts

File	Role
`scripts/common.py`	Shared utilities: HTTP retry with backoff, media duration via ffprobe, JSON parse from LLM output
`scripts/visual.py`	Frame extraction (auto-segment, auto-compress >200KB) + vision LLM observation. Long videos: segments processed in parallel (max 4 concurrent)
`scripts/transcribe.py`	Audio extraction + transcription (4 modes). Auto-chunks large audio with 2s overlap for dedup
`scripts/analyze.py`	Observations → publish metadata (3 methods: api/agent/manual). Heuristic fallback on API failure
`scripts/run.sh`	Orchestrator: parallel visual+audio, then optional synthesis

Output Summary

observations_visual.json — JSON array, one object per frame with frame, objects, desc, texts, actions, style, cover_candidate, segment, segment_start.

observations_audio.json — transcript, speakers, key_points, tone. Agent-direct mode includes audio_file path.

metadata.json — title (≤80 chars), intro (≤2000 chars), tags (≤10), category (B站 type2 平铺分区，30 个一级分区), cover_suggestion (primary + reason + secondary), declaration (6 选 1), copyright_claim, watermark, author_marks.

Pitfalls

API keys in chat: Some platforms truncate keys with …. Always pass keys via command-line arguments, not through messages.
Model capability: Vision requires image_url support. Audio-LLM requires input_audio support. Check your provider.
Game recordings: Frames are large (~300-380KB vs ~30-80KB for phone). Auto-compression handles this, but plan rate limits for long videos.
Long videos = parallel API calls: 30-min video = 8 segments × 15 frames = 8 vision API calls (capped at 4 concurrent). Consider rate limits.
Missing credentials auto-degrade: Omitting LLM keys → preprocess-only or agent-direct mode. Scripts never crash on missing keys.
--interval deprecated: Ignored. Interval auto-calculated per segment based on --max-frames.
Timeout protection: run.sh auto-wraps with timeout $VA_TIMEOUT (default 3600s = 1h). Override via VA_TIMEOUT env var.

Error Handling

Three-layer defense:

HTTP retry — 3 retries with exponential backoff on 5xx / connection errors
JSON parse retry — 3 attempts with error feedback sent back to LLM
Graceful degradation — placeholder observations on visual failure, raw text on audio failure, heuristic fallback on synthesis failure

Verification

Check exit code: run.sh returns 0 on success
Verify observations_visual.json has entries for expected frame count
Verify observations_audio.json has transcript field (non-empty for speech videos)
If --synthesize-method used, verify metadata.json has all required fields (title, intro, tags, category, cover_suggestion)

For complete parameter reference, output schemas, standalone usage per script, and detailed error handling, see references/REFERENCE.md.