Install
openclaw skills install video-metadata-analyzerVideo content analysis pipeline — extract frames, transcribe audio, run LLM visual+audio analysis, synthesize structured Bilibili publish metadata (title, intro, tags, category, cover suggestion). Use when user says 'analyze video', '视频分析', '生成投稿元数据', or wants structured content analysis from a video file.
openclaw skills install video-metadata-analyzerThree-stage video analysis pipeline: parallel visual + audio observation, then metadata synthesis for Bilibili publishing.
bilibili-publish-playwright: this skill generates the metadata that feeds into Bilibili publishingInput Video ──────────────────────────────────────────────────────
│ │
├── Stage 1a: visual.py ──→ observations_visual.json │
│ (ffmpeg extract frames → encode → vision LLM observe) │ PARALLEL
│ │
├── Stage 1b: transcribe.py ──→ observations_audio.json │
│ (ffmpeg extract audio → transcribe + structure) │
│ │
└── Stage 2: analyze.py ──→ metadata.json ←───────────────────┘
(merge V+A observations → publishable metadata via LLM)
run.sh orchestrates: launches visual.py and transcribe.py as background processes (&), wait for both, then optionally runs analyze.py.
$OUTPUT/
├── observations_visual.json # JSON array: one object per frame
├── observations_audio.json # JSON object: transcript + structured info
├── metadata.json # (optional) Synthesized Bilibili metadata
└── frames/ # (only with --keep-frames, auto-cleaned otherwise)
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out \
--transcribe audio-llm \
--audio-llm-key KEY --audio-llm-base URL --audio-llm-model MODEL \
--vision-llm-key KEY --vision-llm-base URL --vision-llm-model MODEL \
--max-frames 15 \
--synthesize-method api \
--analyze-llm-key KEY --analyze-llm-base URL --analyze-llm-model MODEL
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out --keep-frames
Agent then reads observations_visual.json (placeholder frames), observations_audio.json (audio file path), and optionally the frame images + audio file directly to generate metadata.
Omit --synthesize-method to observe only, then run analyze.py separately later. Each stage (visual, audio, synthesize) can use different keys and models.
| Parameter | Default | Purpose |
|---|---|---|
--video PATH | — | Required. Input video file |
--output DIR | — | Required. Output directory |
--transcribe MODE | agent-direct | local / cloud / agent-direct / audio-llm |
--max-frames N | 15 | Max frames per 4-min segment |
--keep-frames | false | Keep extracted frame images |
--synthesize-method METHOD | — | api / agent / manual. Omit = observe only |
All *-key, *-base, *-model parameters follow the pattern: --vision-llm-key, --audio-llm-key, --analyze-llm-key etc. See references/REFERENCE.md for the complete parameter table.
| File | Role |
|---|---|
scripts/common.py | Shared utilities: HTTP retry with backoff, media duration via ffprobe, JSON parse from LLM output |
scripts/visual.py | Frame extraction (auto-segment, auto-compress >200KB) + vision LLM observation. Long videos: segments processed in parallel (max 4 concurrent) |
scripts/transcribe.py | Audio extraction + transcription (4 modes). Auto-chunks large audio with 2s overlap for dedup |
scripts/analyze.py | Observations → publish metadata (3 methods: api/agent/manual). Heuristic fallback on API failure |
scripts/run.sh | Orchestrator: parallel visual+audio, then optional synthesis |
observations_visual.json — JSON array, one object per frame with frame, objects, desc, texts, actions, style, cover_candidate, segment, segment_start.
observations_audio.json — transcript, speakers, key_points, tone. Agent-direct mode includes audio_file path.
metadata.json — title (≤80 chars), intro (≤2000 chars), tags (≤10), category (B站 type2 平铺分区,30 个一级分区), cover_suggestion (primary + reason + secondary), declaration (6 选 1), copyright_claim, watermark, author_marks.
…. Always pass keys via command-line arguments, not through messages.image_url support. Audio-LLM requires input_audio support. Check your provider.--interval deprecated: Ignored. Interval auto-calculated per segment based on --max-frames.run.sh auto-wraps with timeout $VA_TIMEOUT (default 3600s = 1h). Override via VA_TIMEOUT env var.Three-layer defense:
run.sh returns 0 on successobservations_visual.json has entries for expected frame countobservations_audio.json has transcript field (non-empty for speech videos)--synthesize-method used, verify metadata.json has all required fields (title, intro, tags, category, cover_suggestion)For complete parameter reference, output schemas, standalone usage per script, and detailed error handling, see references/REFERENCE.md.