Ai Talking Photo

v1.0.0

Bring any still photo to life with ai-talking-photo, the skill that syncs facial animation to audio and makes portraits speak, sing, or narrate. Upload a fac...

0· 105·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for whitejohnk-26/ai-talking-photo.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Ai Talking Photo" (whitejohnk-26/ai-talking-photo) from ClawHub.
Skill page: https://clawhub.ai/whitejohnk-26/ai-talking-photo
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required env vars: NEMO_TOKEN
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install ai-talking-photo

ClawHub CLI

Package manager switcher

npx clawhub@latest install ai-talking-photo
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The skill claims to animate still photos via a remote GPU-backed API (nemovideo) and only asks for a single API token (NEMO_TOKEN) and optional config path for nemovideo—both of which are reasonable for that purpose.
Instruction Scope
SKILL.md instructs the agent to create a session, upload images/audio, call SSE endpoints, and poll render status on the nemovideo API — all expected for this feature. It also instructs the agent to detect install path and read the skill's YAML frontmatter for attribution headers; this requires some filesystem inspection but is narrowly scoped. The runtime will upload user media to an external service, so user data leaves the device (expected, but privacy-relevant).
Install Mechanism
There is no install spec and no code files — instruction-only skills carry lower installation risk because nothing is downloaded or written by an installer.
Credentials
The skill declares a single credential (NEMO_TOKEN), which fits the cloud API usage. However, SKILL.md also describes an anonymous-token flow if NEMO_TOKEN is absent (POST to /api/auth/anonymous-token) and the frontmatter includes a configPath (~/.config/nemovideo/) while registry metadata reported none — this inconsistency should be clarified. The anonymous-token flow means the skill can operate without a pre-provided secret, and it will request or create a token at runtime.
Persistence & Privilege
The skill is not always-enabled, does not request elevated platform privileges, and does not claim to modify other skills or system-wide settings.
Assessment
This skill appears to do what it says — it uploads photos and audio to a nemovideo.ai backend to produce animated videos. Before installing, consider: 1) Privacy: any media you upload will be sent to an external service; do not upload private or sensitive images without checking the service's terms. 2) Token behavior: the skill accepts NEMO_TOKEN but can also obtain an anonymous token automatically; ask whether you prefer using your own token or an anonymous session. 3) Metadata mismatch: the skill's frontmatter lists a config path (~/.config/nemovideo/) while registry metadata did not — confirm whether the agent will read that directory and what it may contain. 4) Attribution headers: the skill sets X-Skill-* headers based on install path (the agent may inspect certain home dirs); if that concerns you, ask for clarification from the skill author. If you rely on sensitive data or need stricter privacy, avoid installing or require review of where tokens and media are stored and transmitted.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🗣️ Clawdis
EnvNEMO_TOKEN
Primary envNEMO_TOKEN
latestvk977h1ndz82kp0pba7fkcbhgjh8491nb
105downloads
0stars
1versions
Updated 3w ago
v1.0.0
MIT-0

Getting Started

Welcome! With AI Talking Photo, you can turn any portrait into a speaking, animated video in moments. Upload your photo and audio clip (or tell me what you'd like the subject to say) and let's bring it to life!

Try saying:

  • "Animate this portrait with my audio"
  • "Make my headshot say this script"
  • "Create talking photo for social reel"

Quick Start Setup

This skill connects to a cloud processing backend. On first use, set up the connection automatically and let the user know ("Connecting...").

Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:

  • Generate a UUID as client identifier
  • POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header
  • Extract data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)

Session: POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.

Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.

Make Any Photo Speak With One Upload

Static photos hold stories that never get told. AI Talking Photo changes that by turning a single still image into an animated, speaking portrait — no camera, no studio, no video shoot required. Whether it's a historical figure, a product mascot, a family member, or your own headshot, this skill breathes voice and movement into the image in seconds.

The process is straightforward: provide a clear face photo and either an audio file or a text script you want spoken. The skill analyzes the facial geometry, maps lip movements to the audio waveform, and generates a short video where the subject appears to genuinely speak. Subtle head motion, eye blinks, and micro-expressions are layered in to avoid the uncanny stiffness of early deepfake tools.

Creators use this for memorial tribute videos, branded spokesperson content, educational history lessons, social media reels, and interactive storytelling. If you can photograph a face, you can give it a voice — that's the core promise of AI Talking Photo.

Routing Animate Portrait Requests

When a user submits a still image with a voice or script input, the skill parses the facial detection parameters and animation style preferences before dispatching the job to the appropriate talking photo pipeline endpoint.

User says...ActionSkip SSE?
"export" / "导出" / "download" / "send me the video"→ §3.5 Export
"credits" / "积分" / "balance" / "余额"→ §3.3 Credits
"status" / "状态" / "show tracks"→ §3.4 State
"upload" / "上传" / user sends file→ §3.2 Upload
Everything else (generate, edit, add BGM…)→ §3.1 SSE

Talking Photo API Reference

The cloud processing backend handles facial landmark mapping, lip-sync synthesis, and expression blending on remote GPU clusters, meaning heavy rendering never touches the local device. Completed animated portrait outputs are returned as video streams or downloadable clips once the synthesis job finalizes.

Skill attribution — read from this file's YAML frontmatter at runtime:

  • X-Skill-Source: ai-talking-photo
  • X-Skill-Version: from frontmatter version
  • X-Skill-Platform: detect from install path (~/.clawhub/clawhub, ~/.cursor/skills/cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: https://mega-api-prod.nemovideo.ai

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: {"urls":["<url>"],"source_type":"url"}

Credits: GET /api/credits/balance/simple — returns available, frozen, total

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

EventAction
Text responseApply GUI translation (§4), present to user
Tool call/resultProcess internally, don't forward
heartbeat / empty data:Keep waiting. Every 2 min: "⏳ Still working..."
Stream closesProcess final response

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend saysYou do
"click [button]" / "点击"Execute via API
"open [panel]" / "打开"Query session state
"drag/drop" / "拖拽"Send edit via SSE
"preview in timeline"Show track summary
"Export button" / "导出"Execute export workflow

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)

Error Handling

CodeMeaningAction
0SuccessContinue
1001Bad/expired tokenRe-auth via anonymous-token (tokens expire after 7 days)
1002Session not foundNew session §3.0
2001No creditsAnonymous: show registration URL with ?bind=<id> (get <id> from create-session or state response when needed). Registered: "Top up credits in your account"
4001Unsupported fileShow supported formats
4002File too largeSuggest compress/trim
400Missing X-Client-IdGenerate Client-Id and retry (see §1)
402Free plan export blockedSubscription tier issue, NOT credits. "Register or upgrade your plan to unlock export."
429Rate limit (1 token/client/7 days)Retry in 30s once

Best Practices

The quality of your output is almost entirely determined by the quality of your input photo. Avoid images with heavy filters, strong side-lighting, or partial face occlusion — these confuse the facial landmark detection and produce jittery or misaligned lip movements. A neutral expression in the source photo gives the animation engine the most flexibility to map a wide range of speech sounds accurately.

Keep audio clips clean and free of background music during the lip-sync generation phase. If your final video needs music, add it as a separate layer after the talking photo is rendered. This prevents the model from misreading musical frequencies as speech phonemes.

For emotional impact — especially in memorial or tribute videos — choose audio that is paced naturally and not too fast. Rapid speech compresses lip movements and reduces the realism of the animation. A speaking rate of 120–150 words per minute tends to yield the most convincing results. Finally, always review the generated video before publishing; small manual trims at the start and end of the clip can remove any initialization frames where the face hasn't yet settled into the animation.

Integration Guide

Getting started with AI Talking Photo requires just two inputs: a face image and an audio source. For best results, use a front-facing photo where the subject's mouth and eyes are clearly visible, unobstructed by hands, masks, or extreme angles. JPEG and PNG formats work well; aim for at least 512×512 pixels to preserve animation quality.

For audio, you can supply an MP3, WAV, or M4A file up to 60 seconds, or simply paste a text script and select a voice style — the skill will synthesize the speech internally before animating the photo. If you're embedding the output in a website or presentation, request the export in MP4 format with a transparent-background option for overlay use.

When building workflows — such as auto-generating spokesperson videos from a CMS or producing personalized video messages at scale — pass the image URL and script text as variables. The skill returns a video URL or file you can route directly into your delivery pipeline, email platform, or social scheduler.

Comments

Loading comments...