# Workflow Recommendations

When user requirements exceed a single mode's limits, **suggest combining modes or switching to a more capable model** before starting. Ask the user to confirm the approach.

## Duration > 5s

kf2v (first+last frame) and vace are capped at **5s**. If the user wants a longer video:

1. **Preferred**: Use **wan2.6-t2v** (up to 15s) or **wan2.6-i2v** (up to 15s) which natively support longer durations. If the user has a first frame image, i2v is a direct replacement for kf2v with longer duration support.
2. **If kf2v is essential** (user needs both first AND last frame control): Generate the 5s kf2v video first, then extend it using **vace `video_extension`** — trim the kf2v output to a ≤3s tail clip and use it as `first_clip_url` to generate the next 5s segment. Repeat for additional segments, then concatenate all segments (see [merge-media.md](merge-media.md)).
3. **For any mode**: To produce videos >15s, chain multiple generations — use the last frame or tail clip of each segment as input for the next. Concatenate final segments (see [merge-media.md](merge-media.md)). Inform the user this is a multi-step workflow.

## Improving kf2v Image Fidelity

kf2v generates a 5s transition between first and last frames. By design the model **interpolates motion** between the two frames — it respects the visual content of both frames but heavily relies on the prompt to determine *how* the transition happens. If generated video doesn't match the input images well:

1. **Turn off `prompt_extend`** — Smart rewriting can drift from original intent. Set `"prompt_extend": false` to keep the model closer to your literal description.
2. **Write transition-focused prompts** — Instead of describing static scenes, describe the *motion/change*: "camera gradually rises from eye level", "person slowly turns around". The prompt should bridge the visual gap between the two frames.
3. **Minimize visual difference** between frames — If first and last frame are radically different (different scene, different character), the model may produce artifacts. Keep frames within the same scene/subject.
4. **Use higher resolution** — `"resolution": "1080P"` (wan2.2-kf2v-flash) produces more detail-faithful output than 480P/720P.
5. **Consider alternatives when image fidelity matters most**:
   - **i2v (wan2.6-i2v)** — Uses only the first frame but respects it more faithfully, supports up to 15s + audio. Best when the starting image must be preserved exactly.
   - **vace `image_reference`** — Lets you mark images as subject (`obj`) or background (`bg`), giving explicit control over how each image is referenced.
   - **r2v** — For character-consistent multi-reference. Preserves character identity across scenes.

## Audio Needed but Mode is Silent-Only

kf2v and vace produce **silent video only**. If the user also wants audio:

1. **Switch mode**: If the user can relax the first+last frame constraint, recommend **wan2.6-i2v** (first-frame + audio, up to 15s) or **wan2.6-t2v** (text + audio, up to 15s). These support `audio_url` for custom audio or auto-dubbing.
2. **Post-process**: Generate the silent video first, then add an audio track (see [merge-media.md](merge-media.md) for ffmpeg/moviepy recipes), or use a separate TTS step (see **qwencloud-audio-tts** for speech synthesis) to generate the audio file first.
3. **Always inform the user** about the silent limitation before generating, and propose the alternative.

## Multi-Shot Narrative (Cinematic Storytelling)

wan2.6 models (t2v, i2v, r2v) natively support **multi-shot** video — multiple camera angles and scenes in a single generation, up to 15s. This is the recommended approach for cinematic storytelling.

**How to use:**

1. Set `shot_type: "multi"` AND `prompt_extend: true` in the request.
2. Structure the prompt using `第N个镜头[Xs-Ys]` format (Chinese) or describe sequential scenes clearly:

```
第1个镜头[0-3秒] 雨夜街头，侦探快步前行，镜头从远处跟随。
第2个镜头[3-5秒] 侦探推开老旧建筑的大门，镜头推近特写。
第3个镜头[5-8秒] 室内昏暗灯光下，侦探发现一封信，低头阅读。
```

3. For multi-character stories, combine with **r2v** mode: provide reference videos/images for each character, use `character1`/`character2` in prompt.

**Limits:** Single generation max 15s. For longer narratives, generate multiple multi-shot segments and chain them (use the last frame of segment N as the first frame of segment N+1 via i2v).

## VACE Post-Processing Pipeline

VACE functions can be chained as **post-processing steps** after initial video generation. Common pipelines:

| Pipeline | Steps | Use case |
|----------|-------|----------|
| Generate + Extend | t2v/i2v → vace `video_extension` (×N) | Make a 5s video into 10s, 15s, ... |
| Generate + Outpaint | t2v/i2v → vace `video_outpainting` | Expand 16:9 to 9:16 (horizontal to vertical) |
| Generate + Edit | t2v/i2v → vace `video_edit` | Replace a character or remove an object |
| Generate + Repaint | t2v/i2v → vace `video_repainting` | Change art style while keeping motion |
| kf2v + Extend + Audio | kf2v → vace `video_extension` → concatenate segments → add audio | Controlled transition, longer, with sound |

**Chaining rules:**
- Each VACE step outputs ≤5s, 720P, silent video.
- For `video_extension`: input clip must be ≤3s. If extending a 5s video, trim the last 2-3s to prepare `first_clip_url` (see [merge-media.md](merge-media.md)).
- Download each intermediate video (URL expires in 24h) before feeding it to the next step.
- After all segments are generated, concatenate them and optionally overlay audio (see [merge-media.md](merge-media.md)).
- Inform the user of the full plan and get confirmation before starting a multi-step pipeline.

## Complex Multi-Step Video Projects

For requests like "generate a kf2v transition, extend it to 15s, and add narration":

1. Plan the full pipeline with estimated step count and total time
2. Present the plan to the user and get confirmation
3. Execute one step at a time, saving intermediate outputs to the session output directory
4. After all video steps complete, concatenate segments and add audio if needed (see [merge-media.md](merge-media.md))