Install
openclaw skills install senseaudio-video-genUse when the user asks to create, inspect, render, or repair an HTML-authored video from a brief, website, Markdown/text file, or GitHub repository; needs captions, voiceover, background music, generated images/video clips, or a HyperFrames-like local video pipeline using SenseAudio media APIs and AudioClaw LLM planning.
openclaw skills install senseaudio-video-genAuthor videos as HTML compositions, preview them in a browser, render them locally through Chrome screenshots plus FFmpeg, plan scripts/storyboards with AudioClaw by default, and generate supporting media through SenseAudio APIs. Treat HTML as the editable source of truth, SenseAudio as the media engine, and AudioClaw as the default LLM route.
On a new machine, configure the media API and LLM API separately:
export SENSEAUDIO_API_KEY="..."
SENSEAUDIO_API_KEY powers only SenseAudio media APIs: TTS, ASR, image, video, and music.
AudioClaw LLM planning uses a separate OpenAI-compatible route. If running inside AudioClaw, no extra LLM env is needed when the local AudioClaw config file exists. Otherwise set the LLM env explicitly:
export AUDIOCLAW_CONFIG_PATH="config/audioclaw.json"
export AUDIOCLAW_LLM_MODEL="doubao-seed-2-0-pro-260215"
export AUDIOCLAW_LLM_BASE_URL="https://platform.senseaudio.cn/v1"
export AUDIOCLAW_LLM_API_KEY="..."
LLM config precedence is CLI flags, then AUDIOCLAW_LLM_*, then AUDIOCLAW_CONFIG_PATH or the local AudioClaw config file. The CLI deliberately does not reuse SENSEAUDIO_API_KEY as an AudioClaw LLM key. Use --llm none for deterministic heuristic planning, or --offline to skip live media calls.
Start from a brief when the user wants a complete project shell:
python3 scripts/senseaudio_video_gen.py compose \
--project my-video \
--brief "Make a premium launch film for a new AI research assistant." \
--duration 12 \
--style-preset executive-film
compose is the general video path for product launch films, feature explainers, report summaries, technical walkthroughs, title cards, social cuts, and branded motion pieces. It defaults to executive-film, a restrained cinematic style with large typography, letterbox framing, low ornament, and non-web chapter IDs. Use --offline when drafting without live API calls, and add --render when the project should be rendered immediately.
compose now defaults to --llm audioclaw. If the default LLM route is unavailable, --llm-fallback keeps the project moving with heuristic planning and records a warning in senseframe.json. Pass --no-llm-fallback when an LLM failure should stop the run.
For the closest HyperFrames-style website workflow, prefer the one-pass site-video pipeline:
python3 scripts/senseaudio_video_gen.py site-video \
--url https://www.anthropic.com/ \
--project anthropic-site-video \
--brief "用中文介绍 Anthropic 官网的 Claude、安全 AI、研究与企业能力。" \
--duration 14 \
--fps 30 \
--llm audioclaw
site-video defaults to editorial-pro, layered beats, cinematic motion, GSAP-compatible timing, real website screenshots, AudioClaw LLM planning, SenseAudio narration/ASR when live media is enabled, audio-reactive data, local rendering, inspect frames, local frame-quality audit, and motion audits. If LLM planning fails, --llm-fallback retries with heuristic planning and records the warning. Use --offline --no-render for a safe draft that writes the same editable project structure and pipeline-report.json.
Add --music --music-poll when the site video should request a SenseAudio music bed, download it, mix it under narration as assets/final-audio.m4a, and render with that mixed track. If SenseAudio accepts the task but does not return audio_url in time, --music-fallback creates a local ambient bed so the video still ships with background music while preserving the task manifest. Use --music-dry-run or --offline to inspect the /music/song/create payload without spending credits. Add --auto-repair when the project should run a second pass after motion/vision audits, tighten real screenshot crops, damp busy overlays, and rerender the repaired composition.
Website capture follows the useful parts of the HyperFrames loop: warm the live page, dismiss common cookie/modals, scroll to trigger lazy assets, record assets/site-capture-quality.json, capture renders/inspect frames, and write renders/inspect/contact-sheet.html for review. Use --vision-audit when a live VL model should judge the rendered frames; the local frame-quality-audit still runs by default when rendering.
For gated, cookie-sensitive, or region-personalized sites, keep browser state explicit:
python3 scripts/senseaudio_video_gen.py site-video \
--url https://example.com/ \
--project example-site-video \
--browser-profile profiles/example-capture \
--cookie-file cookies/example.json
Cookies often make screenshots closer to what a real user sees, but the clean temporary browser remains the default to avoid leaking private account pages into generated videos.
For URL-to-video work, site-ingest classifies real page material into semantic roles such as hero, product, research, safety, developer, enterprise, customer, pricing, and CTA. These roles drive story_evidence, shot choice, composition mode, camera path, and data-material-role markers in the rendered HTML.
Use source-ingest for first-stage non-web inputs. It converts local Markdown/text files or a GitHub repository README into the same site-profile.json shape used by website projects, so compose --site-file <profile.json> can reuse storyboard, narration, semantic role, and production-spec logic without a separate document pipeline:
python3 scripts/senseaudio_video_gen.py source-ingest \
--file product-notes.md \
--output product-notes.site.json \
--json
python3 scripts/senseaudio_video_gen.py source-ingest \
--github-url heygen-com/hyperframes \
--output hyperframes-readme.site.json
Use site-vision-plan when screenshot crops need to be planned before rendering. The default heuristic provider derives crop center, zoom, pan, and focus from DOM highlights and semantic roles. --provider openrouter builds an OpenRouter-compatible vision request so a VL model can inspect screenshots first; keep --fallback enabled so rendering degrades to deterministic crops if the model route is unavailable.
Use the music and repair commands directly when tuning an existing project:
python3 scripts/senseaudio_video_gen.py music-create \
--prompt "Instrumental premium website explainer bed, subtle pulse, no vocals" \
--duration 16 \
--poll \
--download my-video/assets/background-music.mp3 \
--project my-video
python3 scripts/senseaudio_video_gen.py mix-audio \
--project my-video \
--voice my-video/assets/narration.mp3 \
--music my-video/assets/background-music.mp3 \
--output my-video/assets/final-audio.m4a \
--duration 16
python3 scripts/senseaudio_video_gen.py repair --project my-video --json
Use the default AudioClaw route for creative plans when the brief needs LLM-written copy and storyboard:
python3 scripts/senseaudio_video_gen.py llm-plan \
--brief "Make a concise webpage intro for SenseAudio's sound library." \
--duration 9 \
--output my-plan.json
python3 scripts/senseaudio_video_gen.py compose \
--project my-video \
--brief "Make a concise webpage intro for SenseAudio's sound library." \
--generate-images \
--generate-broll \
--asset-dry-run \
--offline
llm-plan defaults to --provider audioclaw. The skill strips LiteLLM-style provider prefixes such as volcengine/ for platform.senseaudio.cn and retries without response_format for models that do not support JSON-mode requests.
DeepSeek remains available with --provider deepseek or --llm deepseek; set DEEPSEEK_API_KEY, DEEPSEEK_MODEL, or DEEPSEEK_BASE_URL when using it.
If the AudioClaw configured model is not strong enough for dense product research, switch planning to OpenRouter with --provider openrouter or --llm openrouter, and choose a capable model via --model, --llm-model, OPENROUTER_LLM_MODEL, or OPENROUTER_MODEL.
Build an existing project as a local pipeline:
python3 scripts/senseaudio_video_gen.py build --project my-video --dry-run
python3 scripts/senseaudio_video_gen.py build --project my-video --output my-video/renders/final.mp4
Or scaffold a blank composition:
python3 scripts/senseaudio_video_gen.py init my-video --duration 6 --fps 24
cd my-video
python3 ../scripts/senseaudio_video_gen.py preview .
python3 ../scripts/senseaudio_video_gen.py inspect . --samples 5
python3 ../scripts/senseaudio_video_gen.py render . --output renders/final.mp4
Use SenseAudio assets inside the same project:
python3 scripts/senseaudio_video_gen.py tts \
--text "让声音、字幕和画面在一个视频项目里完成。" \
--voice-id male_0028_a \
--output my-video/assets/narration.mp3
python3 scripts/senseaudio_video_gen.py asr \
--file my-video/assets/narration.mp3 \
--timestamps word \
--output my-video/assets/transcript.json
python3 scripts/senseaudio_video_gen.py captions \
--project my-video \
--transcript my-video/assets/transcript.json \
--output my-video/assets/captions.json
python3 scripts/senseaudio_video_gen.py captions-export \
--captions my-video/assets/captions.json \
--format srt \
--output my-video/renders/final.srt
python3 scripts/senseaudio_video_gen.py render my-video \
--audio my-video/assets/narration.mp3 \
--parallel 4 \
--resume \
--output my-video/renders/final-with-voice.mp4
python3 scripts/senseaudio_video_gen.py lint --project my-video --json
python3 scripts/senseaudio_video_gen.py asset-report --project my-video --json
python3 scripts/senseaudio_video_gen.py generate-assets \
--project my-video \
--image-prompt "clean product UI hero image for a sound library" \
--video-prompt "short b-roll of creators choosing voices" \
--dry-run
python3 scripts/senseaudio_video_gen.py timeline \
--project my-video \
--preset cinematic
data-composition-id, data-width, data-height, data-duration.data-start, data-duration, optional data-media-start, and optional data-scene.assets/timeline.json, data-timeline-source, and optional data-effect presets such as fade-up, slide-left, zoom-in, spotlight, and parallax.styles and compose --style-preset; tokens are embedded as CSS variables, written to assets/style-preset.json, and recorded in senseframe.json.gsap-compat timeline engine writes labels and tracks and uses a local createGsapCompatTimeline adapter; never load external GSAP/CDN code for deterministic renders.transition_preset plus transitions[] in assets/timeline.json; supported presets include editorial, glass, ribbon, iris, and luma.compose maps each storyboard item to matching data-scene and data-timeline-id elements rather than fixed template beats.assets/beats.json, .beat-layer, and data-beat markers. compose --beat-mode layered splits each storyboard scene into hook/proof/detail/cta overlays so a single scene can carry multiple timed visual arguments.motion-map reports flashiness risk when beat/transition rates are too high.hero-overview, nav-scan, feature-zoom, trust-message, cta-summary) so adjacent scenes do not reuse the same visual structure.data-composition-mode values such as full-bleed, split-scan, zoom-callout, evidence-board, and cta-lockup; every website scene should also set a data-camera-path so the renderer can apply distinct camera motion instead of a repeated left/right card.brand-extract or compose --brand-url to create assets/brand.json; brand name, description, nav labels, colors, logos/icons, social images, typography, keywords, and inferred voice should influence website explainer shots.site-ingest or compose --site-url to create assets/site-profile.json; headings, sections, CTA labels, and evidence snippets should drive storyboard scenes and visual cards before generic brief fallback.source-ingest --file <notes.md|notes.txt> or source-ingest --github-url <owner/repo> to create the same site-profile shape from Markdown, plain text, or GitHub README content.site-capture or compose --site-screenshots to capture scroll positions into assets/site-screenshots/; these screenshots should appear in website explainer shots as visual evidence with deterministic pan/zoom and evidence highlight boxes, not decorative stock media.zh-CN) for narration, captions, beat text, and storyboard intent unless the user explicitly requests another language.window.__timelines["main"]; the runtime seeks registered timelines during frame capture so entrances, breathing motion, exits, chips, waveforms, and focus highlights render deterministically.assets/audio-data.json from audio-data; the runtime loads data-audio-source and maps RMS/bands to local mesh intensity, card glow, waveform motion, and transition light. Do not drive the global camera directly from raw audio.data-caption-source="./assets/captions.json" or inline window.__sfCaptions; captions --include-words enables active word highlighting with .sf-word[data-sf-active="true"].sf-word-emphasis, and styling must remain deterministic with no CSS animation loops.render --audio <file> is provided or build finds a registered audio asset such as narration; lint warns when narration text exists but no audio asset is registered.window.renderFrame(time).motion-audit --project <dir> --strict after composing to catch storyboard/DOM/timeline mismatches and legacy fixed-template markers.motion-map --project <dir> --strict before expensive renders to score motion density, scene coverage, low-motion zones, transitions, and audio-reactive binding.window.__senseframes.time, CSS variable --sf-time, data-sf-active, data-scene-active, and dispatches sf-seek.renderFrame(time); avoid wall-clock animation for rendered output.compose for brief-to-project or init for a blank source; use --beat-mode layered when the video needs dense HyperFrames-style scene internals.voices, tts, asr, captions, image-sync, or video-create to produce assets.asset-add or command manifests so assets/asset-manifest.json and senseframe.json stay current.lint, motion-audit, and motion-map to catch missing assets, mismatched scenes, and flat motion.inspect before rendering to catch layout, legibility, and timing issues.render or build; mux narration with --audio when needed.senseframe.json, transcripts, prompts, and asset manifests.| Task | Command | Purpose |
|---|---|---|
| One-pass website video | site-video --url <site> | Ingest, plan, capture, narrate, bind audio data, render, and audit in one pipeline |
| LLM plan | llm-plan / llm-plan --provider openrouter | Generate title, narration, visual style, and storyboard JSON; defaults to AudioClaw |
| Brief to project | compose --project <dir> --brief ... | Create storyboard, narration script, caption scaffold, HTML, and manifests; defaults to AudioClaw with heuristic fallback |
| Brand extraction | brand-extract --url <site> | Extract brand identity, colors, nav, logos/icons, typography, keywords, and voice |
| Site ingestion | site-ingest --url <site> | Extract real headings, sections, CTA labels, and evidence snippets for URL-to-video |
| Source ingestion | source-ingest --file <md/txt> / --github-url <repo> | Convert Markdown, text, or GitHub README content into a reusable site-profile.json |
| Site screenshots | site-capture --url <site> | Capture real scroll screenshots with Chrome, warm lazy content, clean overlays, and register visual evidence |
| Frame quality | frame-quality-audit --project <dir> | Check inspect/site frames for blank captures and leaked planning copy |
| Visual crop plan | site-vision-plan --project <dir> | Plan screenshot crop, zoom, pan, and focus before rendering |
| Beat layers | beats --project <dir> | Split storyboard scenes into hook/proof/detail/cta timed overlays |
| Local pipeline | build --project <dir> | Run lint, create captions when a transcript exists, and render |
| Generated assets | generate-assets --project <dir> | Plan or call SenseAudio image/video generation and register results |
| Project validation | lint --project <dir> | Check entry HTML, runtime, caption sources, timing, and asset existence |
| Style registry | styles --json | List built-in visual presets and recommended motion defaults |
| Motion audit | motion-audit --project <dir> | Check storyboard scene binding, beat layers, transition layer, audio-reactive hooks, timeline registry, and legacy markers |
| Motion map | motion-map --project <dir> | Score motion density, scene/beat coverage, flashiness risk, transition coverage, dead zones, and audio-reactive binding |
| Audio data | audio-data --audio <file> --output assets/audio-data.json | Extract frame-level RMS/band data and bind it with data-audio-source |
| Scaffold | init <dir> | Create index.html, runtime, manifest, assets, renders |
| Preview | preview <dir> | Serve project for browser review |
| Inspect | inspect <dir> | Capture timestamped sample frames |
| Timeline | timeline --project <dir> --timeline-engine gsap-compat | Generate animation tracks, labels, transitions, and bind them to the runtime |
| Render | render <dir> | Convert HTML frames to MP4 locally with optional --parallel, --resume, and --frame-dir |
| Voiceover | tts | Generate narration from SenseAudio TTS |
| Transcript | asr --timestamps word | Produce transcript timing for captions |
| Captions | captions --transcript ... | Convert ASR JSON into assets/captions.json |
| Subtitle files | captions-export | Export captions JSON to .srt or .vtt |
| Asset registry | asset-add | Register local/generated assets in the project manifest |
| Asset inventory | asset-report | List registered assets and missing files |
| Still assets | image-sync | Generate first frames, backdrops, thumbnails |
| Model clips | video-create / video-status | Generate AI video clips through SenseAudio |
| Voices | voices --voice-type all | Discover usable voice_id values |
animation, setInterval, or wall-clock playback for final renders; tie motion to time.voices or use a user-provided voice.senseframe.json updated with generated assets, task IDs, transcripts, and final output paths.captions to convert ASR words into grouped caption cues before authoring caption elements.references/authoring.md — HTML composition patterns and timing rules.references/renderer.md — local renderer requirements and troubleshooting.references/media-pipeline.md — SenseAudio asset pipeline.references/api.md — endpoint and model parameter summary.examples/starter-html-video — minimal editable composition project.