Alibabacloud Avatar Video

v0.0.1

Use Alibaba Cloud DashScope API and LingMou to generate AI video and speech. Seven capabilities — (1) LivePortrait talking-head (image + audio → video, two-s...

0· 66·0 current·0 all-time
byalibabacloud-skills-team@sdk-team

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for sdk-team/alibabacloud-avatar-video.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Alibabacloud Avatar Video" (sdk-team/alibabacloud-avatar-video) from ClawHub.
Skill page: https://clawhub.ai/sdk-team/alibabacloud-avatar-video
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required env vars: DASHSCOPE_API_KEY, ALIBABA_CLOUD_ACCESS_KEY_ID, ALIBABA_CLOUD_ACCESS_KEY_SECRET, OSS_BUCKET, OSS_ENDPOINT
Required binaries: ffmpeg, ffprobe
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install alibabacloud-avatar-video

ClawHub CLI

Package manager switcher

npx clawhub@latest install alibabacloud-avatar-video
Security Scan
Capability signals
Requires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (AI avatar/video/speech) match the actual requirements: DashScope API key for model calls, Alibaba AK/SK + OSS info for uploads, and ffmpeg/ffprobe for media conversion. Required binaries and env vars are expected for the listed capabilities.
Instruction Scope
Runtime instructions and scripts stay within the declared purpose (media conversion, OSS upload, DashScope/LingMou API calls). Two items to note: (1) some scripts will automatically upload user media to your OSS bucket and produce signed GET URLs (expected for service), and (2) the LingMou pipeline can auto-copy public templates into your LingMou account as a fallback when no account templates exist — this is a write operation against your LingMou account and may create templates without explicit user approval if the script is run with those flags or the automatic-copy code path is taken.
Install Mechanism
This is an instruction-only skill (no packaged installer). SKILL.md recommends pip installing dashscope, oss2, and alibabacloud-lingmou packages — which is normal but means code from PyPI will be installed into the environment. No arbitrary downloads or extract-from-URL installs are present in the manifest.
Credentials
The skill requests DASHSCOPE_API_KEY and Alibaba Cloud AK/SK + OSS_BUCKET/OSS_ENDPOINT — these are necessary for DashScope calls, OSS uploads, and LingMou operations. The referenced IAM policies in docs recommend broad permissions (e.g., AliyunOSSFullAccess, LingMou full actions); while functionally convenient, these permissions are broader than strictly necessary unless scoped carefully. The scripts also reference optional envs (e.g., LINGMOU_ENDPOINT, LINGMOU_REGION, DASHSCOPE_BASE_URL, LINGMOU_VENV_PYTHON) that are not listed in requires.env but are harmless defaults.
Persistence & Privilege
The skill does not request 'always: true' or system-wide privileges. It will create objects in your OSS bucket and can copy templates into your LingMou account (account-side changes). Those behaviors are consistent with its purpose but are persistent side effects in your cloud account and should be considered when granting credentials.
Assessment
This skill appears coherent with its stated purpose, but it requires sensitive Alibaba credentials and will write into your cloud account. Before installing or running it: - Use least-privilege credentials: create a RAM user/role scoped to only the required OSS object prefix and DashScope/LingMou actions you actually need; avoid using root keys. - Scope OSS permissions to a single bucket/prefix (e.g., human-avatar/*) and avoid giving global OSSFullAccess if you can. - Be aware the scripts upload local files to your OSS bucket and generate signed URLs; these signed URLs are public (or time-limited) and used by DashScope to fetch media. Set short expiry and enforce lifecycle rules to delete temporary objects. - The LingMou code can copy public templates into your account — this will create resources in your LingMou account. If you do not want that, don't run the copy/list-public options or ensure you have at least one template in the account beforehand. - Review the provided Python scripts locally before running and run them in an isolated virtualenv; SKILL.md suggests pip installing packages — inspect those packages and use pinned versions if desired. - Rotate keys after testing and prefer short-lived credentials where possible (RAM role, STS tokens). If you want, I can: list exact places in the code where uploads and template-copying happen, extract the minimal IAM actions needed, or produce a safe example command line to run a dry-run (read-only) check first.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🎭 Clawdis
Binsffmpeg, ffprobe
EnvDASHSCOPE_API_KEY, ALIBABA_CLOUD_ACCESS_KEY_ID, ALIBABA_CLOUD_ACCESS_KEY_SECRET, OSS_BUCKET, OSS_ENDPOINT
latestvk97b9k7gv0kdrpt3ypvg1edrg1850389
66downloads
0stars
1versions
Updated 1w ago
v0.0.1
MIT-0

Human Avatar — Alibaba Cloud AI Video & Speech

Capabilities overview

CapabilityScriptModel / APIRegionSummary
LivePortraitlive_portrait.pyliveportraitcn-beijingPortrait + audio/video → talking video, two steps
EMOportrait_animate.pyemo-v1cn-beijingPortrait + audio → talking head, detect + generate
AA (AnimateAnyone)animate_anyone.pyanimate-anyone-gen2cn-beijingFull-body animation: detect → motion template → video
T2Itext_to_image.pywan2.x-t2iMulti-regionText → image, default wan2.2-t2i-flash
I2Vimage_to_video.pywan2.x-i2vMulti-regionImage → video; T2I→I2V pipeline supported; default wan2.7-i2v-flash
Qwen TTSqwen_tts.pyqwen3-tts-*cn-beijing / SingaporeText → speech; auto model/voice by scene
LingMouavatar_video.pyLingMou SDKcn-beijingTemplate-based digital-human broadcast video

Quick selection guide

Talking head (have audio/video already)     → LivePortrait
Talking head (no audio; synthesize first)   → Qwen TTS → LivePortrait
Full-body dance / motion                    → AA (AnimateAnyone)
Text → image                                → T2I (text_to_image)
Image → video                               → I2V (image_to_video)
Text → video end-to-end                     → T2I → I2V (image_to_video --t2i-prompt)
Enterprise digital human / template news    → LingMou (avatar_video)

Environment setup

pip install requests==2.33.1 dashscope==1.25.15 oss2==2.19.1 numpy==1.26.4
# LingMou additionally:
pip install alibabacloud-lingmou20250527==1.7.0 alibabacloud-tea-openapi==0.4.4
export DASHSCOPE_API_KEY=sk-xxxx               # Beijing-region API key
export ALIBABA_CLOUD_ACCESS_KEY_ID=xxx         # OSS upload
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=xxx
export OSS_BUCKET=your-bucket
export OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com

⚠️ API keys for cn-beijing and Singapore are not interchangeable; use the key for the correct region.
OSS_ENDPOINT may include or omit the https:// prefix; scripts normalize it.


1. LivePortrait — talking-head video

When to use: You have a portrait photo + speech and want a talking-head video quickly.

Flow:

Step 1: liveportrait-detect (sync)  → pass=true
  ↓
Step 2: liveportrait        (async)  → video_url

Image: Single person, front-facing portrait, clear face, no occlusion
Audio: wav/mp3, < 15MB, 1s–3min
Video input: Audio extracted automatically (ffmpeg)

# Image + audio file
python scripts/live_portrait.py \
  --image ./portrait.jpg \
  --audio ./speech.mp3 \
  --template normal --download

# Image + video (extract audio)
python scripts/live_portrait.py \
  --image ./portrait.jpg \
  --video ./speech_video.mp4 \
  --template active --download

# Public URLs
python scripts/live_portrait.py \
  --image-url "https://..." \
  --audio-url "https://..." \
  --mouth-strength 1.2 --download

Motion templates:

  • normal (default, moderate motion)
  • calm (calm; news / storytelling)
  • active (lively; singing / hosting)

2. Qwen TTS — text to speech

When to use: Generate speech files from text (for LivePortrait, EMO, etc.).

Default model: qwen3-tts-vd-realtime-2026-01-15

Auto model selection by scene

Scene --sceneSuggested modelSuggested voice
default / brandqwen3-tts-vd-realtime-2026-01-15Cherry
news / documentary / advertisingqwen3-tts-instruct-flash-realtimeSerena / Ethan
audiobook / dramaqwen3-tts-instruct-flash-realtimeCherry / Dylan
customer_service / chatbot / educationqwen3-tts-flash-realtimeAnna / Ethan
ecommerce / short_videoqwen3-tts-flash-realtimeCherry / Chelsie

Available voices

VoiceCharacter
CherryBright, sweet female; ads / audiobooks / dubbing
SerenaMature, intellectual female; news / explainers / corporate
EthanSteady, warm male; education / documentary / training
DylanExpressive male; radio drama / game VO
AnnaGentle, friendly female; support / assistant / daily
ChelsieYoung, fresh female; short video / e-commerce
ThomasDeep, magnetic male; brand / ads
LunaWarm, soft female; meditation / storytelling
# Default (qwen3-tts-vd-realtime + Cherry)
python scripts/qwen_tts.py --text "Hello, welcome to Qwen TTS." --download

# Match by scene
python scripts/qwen_tts.py --text "Today's market..." --scene news --download
python scripts/qwen_tts.py --text "Once upon a time..." --scene audiobook --download

# Style via instructions
python scripts/qwen_tts.py \
  --text "Dear students..." \
  --model qwen3-tts-instruct-flash-realtime \
  --instructions "Warm tone, steady pace, suitable for teaching" \
  --download

# List options
python scripts/qwen_tts.py --list-voices
python scripts/qwen_tts.py --list-models

3. T2I — Wan 2.x text-to-image

When to use: Generate images from text (optionally feed into I2V).

# Default model (wan2.2-t2i-flash, fast)
python scripts/text_to_image.py \
  --prompt "A woman in Hanfu in a peach blossom forest, cinematic, 4K, soft light" \
  --size 960*1696 --download

# Higher quality
python scripts/text_to_image.py \
  --prompt "..." --model wan2.2-t2i-plus --size 1280*1280 --download

# Latest (Wan 2.6)
python scripts/text_to_image.py \
  --prompt "..." --model wan2.6-t2i --size 1280*1280 --n 1 --download

Models:

  • wan2.2-t2i-flash (default, fast, good for tests)
  • wan2.2-t2i-plus (higher quality)
  • wan2.6-t2i (latest; more aspect ratios; sync call)

Common sizes: 1280*1280 (1:1) / 960*1696 (9:16) / 1696*960 (16:9)


4. I2V — Wan 2.x image-to-video

When to use: Turn an image into motion video; supports text-to-video via T2I first.

# Local image → video
python scripts/image_to_video.py \
  --image ./portrait.jpg \
  --prompt "She turns slowly and smiles; dress and petals drift gently" \
  --model wan2.7-i2v \
  --resolution 720P --duration 5 --download

# Pipeline: text → image → video
python scripts/image_to_video.py \
  --t2i-prompt "A woman in Hanfu in a peach blossom forest" \
  --prompt "She turns slowly; petals fall; poetic mood" \
  --download --output result.mp4

# With background music
python scripts/image_to_video.py \
  --image ./portrait.jpg \
  --audio-url "https://..." \
  --prompt "..." --download

Models:

  • wan2.7-i2v (default; includes sound; 5s/10s)
  • wan2.5-i2v-preview (high-quality preview)
  • wan2.2-i2v-plus (no built-in audio; faster)

5. AA AnimateAnyone — full-body animation

When to use: Full-body photo + reference motion video → dance / motion video.

Requirements:

  • Image: Single person, full body front, head to toe, aspect ratio 0.5–2.0
  • Video: Full body in frame from first frame; mp4/avi/mov; fps ≥ 24; 2–60s

Three steps:

Step 1: animate-anyone-detect-gen2   (sync)  → check_pass=true
  ↓
Step 2: animate-anyone-template-gen2 (async)  → template_id (~3–5 min)
  ↓
Step 3: animate-anyone-gen2          (async)  → video_url (~3–5 min)
# Local files (auto convert + OSS upload)
python scripts/animate_anyone.py \
  --image ./portrait_fullbody.jpg \
  --video ./dance.mp4 \
  --download --output result.mp4

# Use image as background
python scripts/animate_anyone.py \
  --image ./portrait.jpg --video ./dance.mp4 \
  --use-ref-img-bg --video-ratio 9:16 --download

# Skip Step 2 (existing template_id)
python scripts/animate_anyone.py \
  --image ./portrait.jpg \
  --template-id "AACT.xxx.xxx" --download

Auto conversion: video webm/mkv/flv → mp4; image webp/heic → jpg; if fps is under 24, normalize to 24 fps


6. EMO — talking head (legacy)

Note: Prefer LivePortrait; EMO suits cases that need stricter lip-sync.

python scripts/portrait_animate.py \
  --image ./portrait.jpg \
  --audio ./speech.mp3 \
  --download

7. LingMou — enterprise template video

When to use: Corporate digital-human news, template-based broadcasts, scripted reads with optional character images.

New workflow (prefer no template_id)

  • If the user provides template_id: use that template to generate.
  • If no template_id:
    1. List existing broadcast templates for the account.
    2. If any exist, pick one at random for creation.
    3. If none, fetch public templates and copy up to 3 into the account.
    4. Pick one at random from the copy results and continue.
  • Caveat: After a public template is copied, the copy may not yet be a fully “ready-to-render” template; some copies are still drafts and may lack clips, assets, or variable bindings—complete them in LingMou.
  • If the user only gives an image and “make a talking video” without a script: confirm the spoken copy before generating.

What scripts/avatar_video.py supports

  • --list-templates: list account templates
  • --list-public-templates: list public templates (SDK 1.7.0+)
  • --copy-public-templates: copy up to 3 public templates (SDK 1.7.0+)
  • Omit --template-id: random existing template
  • When local templates are empty: auto try public-template copy as fallback
  • --show-template-detail: template detail and replaceable variables
  • Fills input text into template text variables (prefers text_content / test_text)
  • If generation fails right after copying a public template, surfaces a clear error that the template may still need completion (no silent failure)
# List templates
python scripts/avatar_video.py --list-templates

# Public templates (SDK 1.7.0+)
python scripts/avatar_video.py --list-public-templates

# Copy up to 3 public templates (SDK 1.7.0+)
python scripts/avatar_video.py --copy-public-templates

# No template_id — random existing template
python scripts/avatar_video.py \
  --text "Hello, welcome to today's tech news." \
  --download

# Specific template_id
python scripts/avatar_video.py \
  --template-id "BS1b2WNnRMu4ouRzT4clY9Jhg" \
  --text "Hello, welcome to today's tech news." \
  --download

# Detail for randomly chosen template
python scripts/avatar_video.py \
  --show-template-detail \
  --text "This is a test script for broadcast."

Conversational usage

When the user says things like:

  • “Make a talking video from this image”
  • “Digital-human broadcast for me”
  • “Upload image and make a news read”

Do this:

  1. Check whether they already gave copy/script ready to read.
  2. If not, ask: “What is the exact script to read? You can give bullet points and I can turn them into broadcast-ready copy.”
  3. With script in hand, run LingMou: prefer random existing template; if none locally, try public copy.
  4. If they uploaded a portrait but the template API does not use it, explain: this path is template-driven; for image-driven talking head, use LivePortrait or EMO.

API reference links

Comments

Loading comments...