Install
openclaw skills install controlfoley-audio-generatorA multi-functional audio generation tool for SFX generation, video-to-audio and text-to-audio. 多功能音频生成工具,集成可控视频生成音频、文本生成音频等功能.
openclaw skills install controlfoley-audio-generatorA multi-functional audio generation tool powered by the ControlFoley model, integrating video sound effect (SFX) generation, video background music composition, text-to-audio and other functions to realize diversified creative audio generation.
This tool supports four modes: Video-to-Audio (V2A), Text-Controlled Video-to-Audio (TC-V2A), Audio-Controlled Video-to-Audio (AC-V2A), and Text-to-Audio (T2A).
| Field | Value |
|---|---|
| Service Operator | Xiaomi LLM Plus Team |
| API Endpoint | https://controlfoley.ai.xiaomi.com |
| Open Source Repo | https://github.com/xiaomi-research/controlfoley |
| Project Page | https://yjx-research.github.io/ControlFoley_web_page/ |
| Online Demo | https://yjx-research.github.io/ControlFoley_web_page/#try-gen |
| Model Weights | https://huggingface.co/YJX-Xiaomi/ControlFoley/ |
| API Key | Not required |
| Script Path | scripts/foley.py |
python3 --version # Python 3.x
curl --version # curl for API submission
ffmpeg -version # optional, for audio format conversion
| Mode | Command | Input | Output | Description |
|---|---|---|---|---|
| V2A | v2a video.mp4 | Video file | .mp4 + .flac | Generate audio matching the video content |
| TC-V2A | v2a video.mp4 --prompt "text" | Video + text | .mp4 + .flac | Generate audio aligned with text prompts while staying synchronized with the video |
| AC-V2A | v2a video.mp4 --ref-audio ref.wav | Video + reference audio | .mp4 + .flac | Generate audio with timbre matching reference audio while staying synchronized with the video |
| T2A | t2a "prompt" | Text description | .flac | Generate audio from text descriptions |
python3 scripts/foley.py t2a "dog barking loudly in a park"
python3 scripts/foley.py v2a input.mp4
python3 scripts/foley.py v2a input.mp4 --prompt "footsteps on gravel with birds chirping"
python3 scripts/foley.py v2a input.mp4 --ref-audio reference.wav
python3 scripts/foley.py t2a "A mountain stream murmurs, its gentle current lapping against the pebbles." --duration 15
python3 scripts/foley.py t2a "cat purring softly" --count 3
python3 scripts/foley.py t2a "rain on a tin roof" --seed 42
python3 scripts/foley.py models
curl -X POST "https://controlfoley.ai.xiaomi.com/api/v1/v2a/submit" -F "file=@video_path" -F "prompt=footsteps on gravel with birds chirping"
return
{"taskId": "xxx", "message": "Task submitted successfully"}
curl -X GET "https://controlfoley.ai.xiaomi.com/api/v1/v2a/models"
return
{"models":[{"name":"ControlFoley","enabled":true}]}
curl -X GET "https://controlfoley.ai.xiaomi.com/api/v1/v2a/status/{taskId}"
return
{"urls":["{Domain name}/ControlFoley_output/{taskId}/{filename}"],"status":"success","done":true}
{"status":"processing","done":false}
{"status":"pending","queue_pos":1,"queue_position":1,"total_queue":2,"done":false}
curl -X GET "https://controlfoley.ai.xiaomi.com/ControlFoley_output/{taskId}/{filename}" --output ./output.flac
curl -X GET "https://controlfoley.ai.xiaomi.com/api/v1/v2a/status_download/{taskId}" --output-dir ./output --output audio.zip
| Parameter | Description | Default | Example |
|---|---|---|---|
prompt | Audio description text (required) | — | "dog barking in park" |
--model | Model ID | ControlFoley | --model ControlFoley |
--duration | Audio length in seconds (max 30) | 8 | --duration 15 |
--negative | Negative prompt to exclude unwanted sounds | — | --negative "noise, human voice" |
--cfg | CFG strength — higher = stricter prompt adherence | 4.5 | --cfg 6.0 |
--count | Number of variants to generate (1–5) | 1 | --count 3 |
--seed | Fixed random seed for reproducibility | — | --seed 42 |
-o/--outdir | Output directory | ./output | -o ./my_audio |
| Parameter | Description | Default | Example |
|---|---|---|---|
video | Input video path (required) | — | input.mp4 |
--model | Model ID | ControlFoley | --model ControlFoley |
--prompt | Text prompt to guide audio generation (TC-V2A) | — | --prompt "keyboard tapping" |
--negative | Negative prompt to exclude unwanted sounds | — | --negative "music, noise" |
--ref-audio | Reference audio file for timbre control (AC-V2A) | — | --ref-audio reference.wav |
--cfg | CFG strength | 4.5 | --cfg 7.0 |
--count | Number of variants to generate (1–5) | 1 | --count 2 |
--seed | Fixed random seed (not forwarded to API currently) | — | --seed 42 |
-o/--outdir | Output directory | ./output | -o ./results |
"cat footsteps on wooden floor" beats "cat sound"--negative "human voice, music, noise" to filter unwanted audio.flac (44100 Hz, lossless).mp4 (original video + generated audio track)--outdir, paths printed to stdoutConvert to MP3 for sharing:
ffmpeg -i output.flac -codec:a libmp3lame -qscale:a 2 output.mp3
| Issue | Cause | Fix |
|---|---|---|
| Internal URL inaccessible | Result URL uses .xiaomi.srv internal domain | Script auto-falls back to /api/v1/v2a/ControlFoley_output/{task_id}/{filename} |
| Queue busy | Task is waiting | Script auto-polls up to ~5 min; check load via curl $API_BASE/health |
| Model unavailable | Model not enabled | Run foley.py models to see available models |
| Task timeout | Service overloaded | Resubmit the task |
See ./references/api-reference.md for full endpoint documentation.
https://controlfoley.ai.xiaomi.com