๐Ÿซง Image-to-Video โ€” Pro Pack on RunComfy

v0.1.2

Image-to-video generation on RunComfy. This image-to-video skill turns any still image into a short video clip via the RunComfy Model API. The image-to-video...

1ยท 257ยท 3 versionsยท 0 currentยท 0 all-timeยท Updated 5h agoยท MIT-0
byKalvin@kalvinrv

๐Ÿซง Image-to-Video โ€” Pro Pack on RunComfy

runcomfy.com ยท docs ยท Image-to-video models

Image-to-video generation on RunComfy. This skill is the canonical image-to-video entry point for the RunComfy Model API: give it a still image and a motion description, and it returns a short video clip. Image-to-video on RunComfy means turning any image โ€” portrait, product photo, environment, illustration โ€” into a video, with the motion driven by your prompt.

What "image-to-video" means here

Image-to-video (often abbreviated i2v or image2video) is the task of generating a short video starting from a single still image. The image fixes the look โ€” face, wardrobe, product, scene geometry โ€” and the prompt drives the motion. Image-to-video is distinct from text-to-video (no input image) and from video-to-video (which transforms an existing clip).

Image-to-video on RunComfy supports three patterns:

  • General image-to-video: animate any still โ€” portrait drift, product reveal, environment motion, illustration coming alive. The default image-to-video pipeline.
  • Lip-sync image-to-video: a custom voiceover drives mouth movement on a generated talking-head image-to-video clip. Input: image + audio. Output: lip-synced image-to-video.
  • Multi-modal image-to-video: combine subject image + reference scene video + reference voice audio into one image-to-video output.

This skill picks the right image-to-video endpoint for the user's intent and calls runcomfy run <model>/image-to-video with the matching schema.

When to use image-to-video on RunComfy

Pick image-to-video on RunComfy whenever:

  • You have a still image and want it to move โ€” image-to-video is the right task.
  • You want identity-stable image-to-video โ€” the face / product / brand from your input image must survive into the output video.
  • You want fast iteration on image-to-video โ€” RunComfy hosts the GPU; you don't deploy or rent.
  • You're building image-to-video at scale โ€” multi-language image-to-video dubs, multi-shot image-to-video sequences, batch image-to-video jobs.

If the user said "image to video", "i2v", "animate this image", "image2video", "make a video from this", or showed an image and asked for video โ€” route here.

Image-to-video routes

User intentImage-to-video modelWhy
Default image-to-video โ€” portraits, products, environmentshappyhorse-1-0/image-to-video#1 on Arena (Elo 1392 i2v); strong identity preservation; native synchronized audio in image-to-video output
Image-to-video with custom voiceover lip-syncwan-ai/wan-2-7/text-to-video + audio_urlDrives lip-sync on the image-to-video frame from your audio file
Multi-modal image-to-video (image + ref video + ref audio)bytedance/seedance-v2/proMulti-input image-to-video with up to 9 image refs and 3 audio refs

The agent reads this table, classifies the user's image-to-video intent, and picks the matching endpoint.

Prerequisites

  1. RunComfy CLI โ€” npm i -g @runcomfy/cli
  2. RunComfy account โ€” runcomfy login opens a browser device-code flow.
  3. CI / containers โ€” set RUNCOMFY_TOKEN=<token>.
  4. A source image URL โ€” JPEG/PNG/WebP, min 300px, โ‰ค10MB; aspect 1:2.5 to 2.5:1 for the default image-to-video model.

Default image-to-video โ€” HappyHorse 1.0 i2v

The default image-to-video endpoint. Use for any general image-to-video task: portrait drift, product reveal, environment motion, character animation. Image-to-video output includes synchronized audio in the same generation pass.

Schema

FieldTypeRequiredDefaultNotes
image_urlstringyesโ€”The source still for image-to-video. JPEG/PNG/WebP, min 300px, aspect 1:2.5โ€“2.5:1, โ‰ค10MB.
promptstringyesโ€”Motion / camera / lighting description for the image-to-video output. โ‰ค5000 chars.
resolutionenumno1080P720P or 1080P.
durationintno53โ€“15 seconds per image-to-video clip.
seedintno0Reuse for image-to-video variant comparisons.
watermarkboolnotrueProvider watermark on image-to-video output.

Output aspect of the image-to-video clip equals input image aspect.

Invoke

runcomfy run happyhorse/happyhorse-1-0/image-to-video \
  --input '{
    "image_url": "https://.../portrait.jpg",
    "prompt": "Gentle camera drift around the subject'\''s face, subtle breathing motion, identity-stable features, soft natural light."
  }' \
  --output-dir <absolute/path>

Lip-sync image-to-video โ€” custom voiceover

When the image-to-video output needs to lip-sync to a custom audio track, use Wan 2.7 with audio_url. The image-to-video clip is generated around your voiceover so mouth movement matches.

FieldTypeRequiredNotes
promptstringyesDescribe the talking-head shot for the image-to-video output.
audio_urlstringyesWAV/MP3, 3โ€“30s, โ‰ค15MB. Drives lip-sync on the image-to-video frame.
aspect_ratioenumno16:9, 9:16, 1:1, 4:3, 3:4.
resolutionenumno720p or 1080p.
durationenumno2โ€“15 seconds. Match audio length for clean image-to-video lip-sync.
runcomfy run wan-ai/wan-2-7/text-to-video \
  --input '{
    "prompt": "Medium close-up, soft key light, locked tripod, shallow DOF.",
    "audio_url": "https://.../voiceover-en.mp3",
    "duration": 12,
    "aspect_ratio": "9:16"
  }' \
  --output-dir <absolute/path>

For multi-language image-to-video dubs: same prompt, swap audio_url per call, lock seed for visual consistency across all image-to-video outputs.

Multi-modal image-to-video โ€” image + ref video + ref audio

When the image-to-video output should fuse a subject image with a scene reference and voice reference, use Seedance 2.0 Pro. Multi-modal image-to-video accepts up to 9 image refs.

FieldTypeRequiredNotes
promptstringyesDescription for the image-to-video output. EN โ‰ค1000 words.
image_urlarrayyes0โ€“9 source images for image-to-video. First is the primary subject.
video_urlarrayno0โ€“3 reference clips (2โ€“15s each) for image-to-video scene cues.
audio_urlarrayno0โ€“3 reference audio (2โ€“15s, <15MB each) for image-to-video voice cues.
durationintno4โ€“15 seconds.
resolutionenumno480p or 720p.
runcomfy run bytedance/seedance-v2/pro \
  --input '{
    "prompt": "Subject from image 1 walks through the scene from video 1, voice from audio 1.",
    "image_url": ["https://.../subject.jpg"],
    "video_url": ["https://.../scene.mp4"],
    "audio_url": ["https://.../voice.mp3"],
    "duration": 8
  }' \
  --output-dir <absolute/path>

Prompting image-to-video โ€” what works

Image-to-video prompts behave differently from text-to-video prompts. The image already fixes the look โ€” your prompt should drive motion, not redescribe the image.

  • Lead with motion verbs. "drift", "dolly in", "orbit", "tilt up", "blink", "breathe" โ€” front-load what's MOVING in the image-to-video output.
  • Don't restate the image. The image-to-video model sees the input. Spend tokens on what changes, not what already exists.
  • Preservation goals explicit. "identity-stable features", "packaging unchanged", "background geometry stable" โ€” tell the image-to-video model what NOT to change.
  • One beat per image-to-video clip. Single primary motion (orbit OR dolly OR tilt OR character action). Compound motion drifts.
  • Lighting evolution. "rim light intensifying", "shadows shortening as camera rises" โ€” image-to-video output reads lighting cues well.

Image-to-video FAQ

What's the max duration of an image-to-video clip? 15 seconds across all image-to-video routes here. For longer image-to-video sequences, generate multiple clips and stitch.

What image formats does image-to-video accept? JPEG, PNG, WebP. Min 300px, โ‰ค10MB, aspect 1:2.5 to 2.5:1.

Does image-to-video preserve face identity? Yes โ€” the default image-to-video model has strong identity preservation. For best identity hold, the face should fill at least 5% of the frame in the input image.

Can image-to-video include audio? Yes. The default image-to-video model generates synchronized audio in the same pass. The lip-sync image-to-video route accepts your custom audio. The multi-modal image-to-video route accepts reference audio.

Image-to-video vs text-to-video on RunComfy? Image-to-video starts from your image (look fixed). Text-to-video starts from your prompt only (look generated). Use image-to-video when you have an exact reference; use text-to-video for novel content.

Image-to-video output resolution? 720p or 1080p depending on the route.

Limitations

  • Image-to-video clip length is 15s per call. Longer image-to-video output requires stitching multiple calls.
  • Image-to-video output aspect = input image aspect on the default route. For independent reframing, crop the input first.
  • Image-to-video doesn't blend across routes in one call. If you need multi-modal image-to-video + custom voiceover lip-sync in one clip, that's two image-to-video calls plus a stitch.

Exit codes

codemeaning
0image-to-video succeeded
64bad CLI args
65bad input JSON for image-to-video / schema mismatch
69upstream 5xx
75retryable: timeout / 429
77not signed in or token rejected

Full reference: docs.runcomfy.com/cli/troubleshooting.

How it works

The skill picks one of three image-to-video endpoints based on user intent (general image-to-video, lip-sync image-to-video, or multi-modal image-to-video) and invokes runcomfy run <endpoint> with the matching JSON body. The CLI POSTs to the RunComfy Model API, polls the image-to-video request status every 2 seconds, and downloads the resulting image-to-video file from the *.runcomfy.net / *.runcomfy.com URL into --output-dir. Ctrl-C cancels the in-flight image-to-video request.

Security & Privacy

  • Token storage: runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600. Set RUNCOMFY_TOKEN env var to bypass the file in CI.
  • Input boundary: the image-to-video prompt is passed as JSON via --input. The CLI does NOT shell-expand. No shell-injection surface.
  • Third-party content: image / video / audio URLs are fetched by the RunComfy server. Treat external URLs as untrusted; image-based prompt injection is a known risk for any image-to-video model.
  • Outbound endpoints: only model-api.runcomfy.net and *.runcomfy.net / *.runcomfy.com. No telemetry.
  • Generated-file size cap: the CLI aborts any image-to-video download > 2 GiB.

Version tags

latestvk972663m17hsnqswg484k4r74d85ryr9

Runtime requirements

Binsruncomfy
EnvRUNCOMFY_TOKEN
Config~/.config/runcomfy