Install
openclaw skills install alibabacloud-avatar-videoUse Alibaba Cloud DashScope API and LingMou to generate AI video and speech. Seven capabilities — (1) LivePortrait talking-head (image + audio → video, two-step), (2) EMO talking-head, (3) AA/AnimateAnyone full-body animation (three-step), (4) T2I text-to-image (Wan 2.x, default wan2.2-t2i-flash), (5) I2V image-to-video (Wan 2.x, default wan2.7-i2v-flash, supports T2I→I2V pipeline), (6) Qwen TTS (auto model/voice by scene, default qwen3-tts-vd-realtime-2026-01-15), (7) LingMou digital-human template video with random template, public-template copy, and script confirmation. Trigger when the user needs talking-head, portrait, full-body animation, text-to-image, text-to-video, or speech synthesis.
openclaw skills install alibabacloud-avatar-video| Capability | Script | Model / API | Region | Summary |
|---|---|---|---|---|
| LivePortrait | live_portrait.py | liveportrait | cn-beijing | Portrait + audio/video → talking video, two steps |
| EMO | portrait_animate.py | emo-v1 | cn-beijing | Portrait + audio → talking head, detect + generate |
| AA (AnimateAnyone) | animate_anyone.py | animate-anyone-gen2 | cn-beijing | Full-body animation: detect → motion template → video |
| T2I | text_to_image.py | wan2.x-t2i | Multi-region | Text → image, default wan2.2-t2i-flash |
| I2V | image_to_video.py | wan2.x-i2v | Multi-region | Image → video; T2I→I2V pipeline supported; default wan2.7-i2v-flash |
| Qwen TTS | qwen_tts.py | qwen3-tts-* | cn-beijing / Singapore | Text → speech; auto model/voice by scene |
| LingMou | avatar_video.py | LingMou SDK | cn-beijing | Template-based digital-human broadcast video |
Talking head (have audio/video already) → LivePortrait
Talking head (no audio; synthesize first) → Qwen TTS → LivePortrait
Full-body dance / motion → AA (AnimateAnyone)
Text → image → T2I (text_to_image)
Image → video → I2V (image_to_video)
Text → video end-to-end → T2I → I2V (image_to_video --t2i-prompt)
Enterprise digital human / template news → LingMou (avatar_video)
pip install requests==2.33.1 dashscope==1.25.15 oss2==2.19.1 numpy==1.26.4
# LingMou additionally:
pip install alibabacloud-lingmou20250527==1.7.0 alibabacloud-tea-openapi==0.4.4
export DASHSCOPE_API_KEY=sk-xxxx # Beijing-region API key
export ALIBABA_CLOUD_ACCESS_KEY_ID=xxx # OSS upload
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=xxx
export OSS_BUCKET=your-bucket
export OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com
⚠️ API keys for
cn-beijingand Singapore are not interchangeable; use the key for the correct region.
OSS_ENDPOINTmay include or omit thehttps://prefix; scripts normalize it.
When to use: You have a portrait photo + speech and want a talking-head video quickly.
Flow:
Step 1: liveportrait-detect (sync) → pass=true
↓
Step 2: liveportrait (async) → video_url
Image: Single person, front-facing portrait, clear face, no occlusion
Audio: wav/mp3, < 15MB, 1s–3min
Video input: Audio extracted automatically (ffmpeg)
# Image + audio file
python scripts/live_portrait.py \
--image ./portrait.jpg \
--audio ./speech.mp3 \
--template normal --download
# Image + video (extract audio)
python scripts/live_portrait.py \
--image ./portrait.jpg \
--video ./speech_video.mp4 \
--template active --download
# Public URLs
python scripts/live_portrait.py \
--image-url "https://..." \
--audio-url "https://..." \
--mouth-strength 1.2 --download
Motion templates:
normal (default, moderate motion)calm (calm; news / storytelling)active (lively; singing / hosting)When to use: Generate speech files from text (for LivePortrait, EMO, etc.).
Default model: qwen3-tts-vd-realtime-2026-01-15
Scene --scene | Suggested model | Suggested voice |
|---|---|---|
default / brand | qwen3-tts-vd-realtime-2026-01-15 | Cherry |
news / documentary / advertising | qwen3-tts-instruct-flash-realtime | Serena / Ethan |
audiobook / drama | qwen3-tts-instruct-flash-realtime | Cherry / Dylan |
customer_service / chatbot / education | qwen3-tts-flash-realtime | Anna / Ethan |
ecommerce / short_video | qwen3-tts-flash-realtime | Cherry / Chelsie |
| Voice | Character |
|---|---|
Cherry | Bright, sweet female; ads / audiobooks / dubbing |
Serena | Mature, intellectual female; news / explainers / corporate |
Ethan | Steady, warm male; education / documentary / training |
Dylan | Expressive male; radio drama / game VO |
Anna | Gentle, friendly female; support / assistant / daily |
Chelsie | Young, fresh female; short video / e-commerce |
Thomas | Deep, magnetic male; brand / ads |
Luna | Warm, soft female; meditation / storytelling |
# Default (qwen3-tts-vd-realtime + Cherry)
python scripts/qwen_tts.py --text "Hello, welcome to Qwen TTS." --download
# Match by scene
python scripts/qwen_tts.py --text "Today's market..." --scene news --download
python scripts/qwen_tts.py --text "Once upon a time..." --scene audiobook --download
# Style via instructions
python scripts/qwen_tts.py \
--text "Dear students..." \
--model qwen3-tts-instruct-flash-realtime \
--instructions "Warm tone, steady pace, suitable for teaching" \
--download
# List options
python scripts/qwen_tts.py --list-voices
python scripts/qwen_tts.py --list-models
When to use: Generate images from text (optionally feed into I2V).
# Default model (wan2.2-t2i-flash, fast)
python scripts/text_to_image.py \
--prompt "A woman in Hanfu in a peach blossom forest, cinematic, 4K, soft light" \
--size 960*1696 --download
# Higher quality
python scripts/text_to_image.py \
--prompt "..." --model wan2.2-t2i-plus --size 1280*1280 --download
# Latest (Wan 2.6)
python scripts/text_to_image.py \
--prompt "..." --model wan2.6-t2i --size 1280*1280 --n 1 --download
Models:
wan2.2-t2i-flash (default, fast, good for tests)wan2.2-t2i-plus (higher quality)wan2.6-t2i (latest; more aspect ratios; sync call)Common sizes: 1280*1280 (1:1) / 960*1696 (9:16) / 1696*960 (16:9)
When to use: Turn an image into motion video; supports text-to-video via T2I first.
# Local image → video
python scripts/image_to_video.py \
--image ./portrait.jpg \
--prompt "She turns slowly and smiles; dress and petals drift gently" \
--model wan2.7-i2v \
--resolution 720P --duration 5 --download
# Pipeline: text → image → video
python scripts/image_to_video.py \
--t2i-prompt "A woman in Hanfu in a peach blossom forest" \
--prompt "She turns slowly; petals fall; poetic mood" \
--download --output result.mp4
# With background music
python scripts/image_to_video.py \
--image ./portrait.jpg \
--audio-url "https://..." \
--prompt "..." --download
Models:
wan2.7-i2v (default; includes sound; 5s/10s)wan2.5-i2v-preview (high-quality preview)wan2.2-i2v-plus (no built-in audio; faster)When to use: Full-body photo + reference motion video → dance / motion video.
Requirements:
Three steps:
Step 1: animate-anyone-detect-gen2 (sync) → check_pass=true
↓
Step 2: animate-anyone-template-gen2 (async) → template_id (~3–5 min)
↓
Step 3: animate-anyone-gen2 (async) → video_url (~3–5 min)
# Local files (auto convert + OSS upload)
python scripts/animate_anyone.py \
--image ./portrait_fullbody.jpg \
--video ./dance.mp4 \
--download --output result.mp4
# Use image as background
python scripts/animate_anyone.py \
--image ./portrait.jpg --video ./dance.mp4 \
--use-ref-img-bg --video-ratio 9:16 --download
# Skip Step 2 (existing template_id)
python scripts/animate_anyone.py \
--image ./portrait.jpg \
--template-id "AACT.xxx.xxx" --download
Auto conversion: video webm/mkv/flv → mp4; image webp/heic → jpg; if fps is under 24, normalize to 24 fps
Note: Prefer LivePortrait; EMO suits cases that need stricter lip-sync.
python scripts/portrait_animate.py \
--image ./portrait.jpg \
--audio ./speech.mp3 \
--download
When to use: Corporate digital-human news, template-based broadcasts, scripted reads with optional character images.
template_id)template_id: use that template to generate.template_id:
scripts/avatar_video.py supports--list-templates: list account templates--list-public-templates: list public templates (SDK 1.7.0+)--copy-public-templates: copy up to 3 public templates (SDK 1.7.0+)--template-id: random existing template--show-template-detail: template detail and replaceable variablestext_content / test_text)# List templates
python scripts/avatar_video.py --list-templates
# Public templates (SDK 1.7.0+)
python scripts/avatar_video.py --list-public-templates
# Copy up to 3 public templates (SDK 1.7.0+)
python scripts/avatar_video.py --copy-public-templates
# No template_id — random existing template
python scripts/avatar_video.py \
--text "Hello, welcome to today's tech news." \
--download
# Specific template_id
python scripts/avatar_video.py \
--template-id "BS1b2WNnRMu4ouRzT4clY9Jhg" \
--text "Hello, welcome to today's tech news." \
--download
# Detail for randomly chosen template
python scripts/avatar_video.py \
--show-template-detail \
--text "This is a test script for broadcast."
When the user says things like:
Do this: