Install
openclaw skills install music-craftGenerate music through a disciplined OpenClaw-native workflow. Use when producing songs, instrumentals, or lyrics-driven tracks with structure, anti-sparse prompt engineering, and quality verification. Provider-agnostic — works with any music backend the OpenClaw runtime exposes.
openclaw skills install music-craftTreat music generation as a small, controlled iteration loop, not a single "press button, get song" call.
The normal loop is:
For deep prompt engineering, lyrics structure, and the full user-preference decision table, see the linked references at the end.
Use this skill when the task involves:
audio_duration as a parameter, so you can ask for exactly 30s/60s/180s/210s/600s. The music-craft-minimax skill (mmx backend) has no native duration control — see that skill's "Song length" section.Do not use this skill when:
music-craft-minimaxmusic-craft-minimaxmusic-craft-minimaxUse this skill unless the request explicitly needs a MiniMax-only path:
--avoid, --bpm, --key, or --structure, switch to music-craft-minimax.standard unless user requests otherwise.xl-mixed with the caveat that the 50-step sft model output quality is currently poor on 24 GB M3 (high-frequency noise, unclear vocals). Recommend standard tier for known-good output unless the user wants to experiment with the fix list in the next section.thinking=true is set. The LM will use these as anchors. If the user hasn't provided them, infer sensible defaults (e.g. 96 BPM for dream pop, "D major" if the prompt mentions a key) before submitting. For xl-sft (xl-mixed tier), detailed metas are essential; for the standard v15-turbo they're optional but improve consistency.This skill is provider-agnostic by design. It works with whatever music backend is available: a native music_generate tool exposed by the runtime, or a CLI like mmx invoked via bash. It does not assume any specific provider, model, or API.
Three rules drive every generation:
[Verse], [Chorus], [Break], and similar tags to give the generator a clear shape.This skill is agent-neutral. It uses whatever music backend is available — a native tool or a CLI — in the active runtime.
It does not require:
mmx or other)If a more capable backend is installed, the music-craft-minimax skill unlocks cover workflow, separate parameter flags, and emotion-driven mashups. This skill is the entry point; that one is the power-user upgrade.
The OpenClaw runtime exposes several free tools that enrich the music generation workflow. None of these require user-side installation — they are part of the runtime, and the skill can call them directly to gather more context about the user's request before building the prompt.
| Tool | Purpose | When to use |
|---|---|---|
web_fetch | Fetch readable content from any URL | Lyrics pages, YouTube watch pages, Wikipedia, artist bios, music blogs |
web_search | Search the web with a query | Find lyrics when only the title is known, find artist info, find genre descriptions |
image (and MiniMax__understand_image) | Analyze an image | Album artwork style cues, concert photo mood, music video screenshots |
memory_search / memory_get | Recall from the user's durable memory | Previous music preferences, prior generation issues, typical genres |
browser | Drive a real browser | JS-heavy lyrics sites (genius.com dynamic loading) — fallback when web_fetch returns only chrome |
web_fetchweb_search, then web_fetch the top resultimage analysismemory_search firstweb_fetch returns only chrome (no content) → browser as fallback"Make a song like 'Bohemian Rhapsody'":
web_search "Bohemian Rhapsody structure analysis" → pick a music theory blog.web_fetch the blog → extract: multi-section, operatic, dramatic dynamics, ~6 min."Make a song like this YouTube video: [URL]":
web_fetch the YouTube watch page.web_search for "[channel] genre style" to confirm.User attaches album art: "Make something with this vibe":
MiniMax__understand_image on the image.references/style-categories.md)."I want a song in the style of 80s Italo disco":
web_search "Italo disco characteristics".web_fetch Wikipedia or a music blog.music-craft-minimax for the audio path).music-craft-minimaxmusic-craft-minimaxFor the deep dive on each free tool, parameters, and edge cases, see references/free-tool-inputs.md.
Before starting the workflow loop, verify the runtime can do the work. This skill has zero external dependencies — the only requirement is a music generation backend (native tool or CLI). See the Required check in the Pre-Flight section.
Some setups (local models, audio analysis) need installs, large downloads, or — on corporate machines — changes to the certificate trust store. Apply this protocol every time:
OUTPUT_DIR) and use a per-song subfolder.
Do not invent an output path silently.references/windows-wsl-setup.md,
which encodes these gates for the WSL2 + corporate-proxy path.Identify the host OS so the install commands later use the right package manager.
# POSIX (Linux, macOS, WSL, Git Bash)
uname -s # "Linux" or "Darwin"
# Windows PowerShell
$env:OS # "Windows_NT"
# Windows cmd
ver
Then identify the available package manager (in priority order):
| OS | Package managers (in priority order) |
|---|---|
| Ubuntu / Debian / Mint | apt |
| Fedora / RHEL / Rocky | dnf (legacy: yum) |
| Arch / Manjaro | pacman |
| Alpine | apk |
| macOS | brew (install from brew.sh if missing) |
| Windows | winget (built into Windows 10 2004+), then choco (Chocolatey), then scoop |
A useful one-liner to detect the active manager:
# POSIX
command -v apt dnf pacman apk brew 2>/dev/null | head -1
# Windows PowerShell
Get-Command winget, choco, scoop -ErrorAction SilentlyContinue | Select-Object -First 1
If the agent is running inside WSL, treat it as Linux (use apt). If it is running inside a non-standard environment (container, Codespace, dev container), ask the user which base image they are on before proposing install commands.
Before installing any backend, gather the user's preferences and confirm hardware. The skill works on any modern machine — not just Apple Silicon MacBooks. Auto-detect first, then confirm critical values with the user.
Run this on the user's machine to detect platform, RAM, disk, and existing installs:
echo "=== Platform ==="
uname -srm
echo ""
echo "=== RAM ==="
case "$(uname -s)" in
Darwin) sysctl -n hw.memsize | awk '{printf "%.0f GB\n", $1/1024/1024/1024}' ;;
Linux) awk '/MemTotal/{printf "%.0f GB\n", $2/1024/1024}' /proc/meminfo ;;
*) echo "unknown (Windows: run systeminfo | findstr Memory)" ;;
esac
echo ""
echo "=== CPU chip ==="
case "$(uname -s)" in
Darwin) sysctl -n machdep.cpu.brand_string 2>/dev/null ;;
Linux) grep -m1 'model name' /proc/cpuinfo | cut -d: -f2 | xargs ;;
*) echo "unknown" ;;
esac
echo ""
echo "=== GPU ==="
if [ "$(uname -m)" = "arm64" ] && [ "$(uname -s)" = "Darwin" ]; then
echo "Apple Silicon (MPS available for MLX)"
elif command -v nvidia-smi >/dev/null 2>&1; then
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null
else
echo "No GPU detected (CPU only)"
fi
echo ""
echo "=== Disk free in \$HOME ==="
df -h "$HOME" | tail -1
echo ""
echo "=== Python managers ==="
command -v uv >/dev/null 2>&1 && echo "uv: $(which uv)" || echo "uv: not found"
command -v conda >/dev/null 2>&1 && echo "conda: $(which conda)" || echo "conda: not found"
command -v python3 >/dev/null 2>&1 && echo "python3: $(python3 --version 2>&1)" || echo "python3: not found"
echo ""
echo "=== Existing ACE-Step install? ==="
ls ~/ACE-Step-1.5 2>/dev/null >/dev/null && echo "✓ Found at ~/ACE-Step-1.5" || echo "✗ Not found in ~/ACE-Step-1.5"
Windows note: the probe above is bash (POSIX) and cannot run on native Windows
before WSL exists. Run this PowerShell probe first (full walkthrough in
references/windows-wsl-setup.md):
(Get-CimInstance Win32_OperatingSystem).Caption
[math]::Round((Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory/1GB,1) # RAM GB
(Get-CimInstance Win32_Processor).Name
(Get-CimInstance Win32_VideoController).Name # GPUs
if (Get-Command nvidia-smi -ErrorAction SilentlyContinue) { 'nvidia-smi present' }
wsl --list --verbose; wsl --version # WSL state
Get-PSDrive C | Select-Object @{n='FreeGB';e={[math]::Round($_.Free/1GB,1)}}
Once inside a WSL distro, the bash probe applies (WSL is treated as Linux).
After running the probe, present the detected values to the user and ask 4 questions. These answers stay in the session context and are reused for every backend install in this session.
Question 1 — Platform confirmation:
I see:
{platform}, {ram} GB RAM, {cpu}, {gpu}. Is that right?
- ✅ Yes, continue
- ❌ No, let me correct it (RAM/GPU/etc.)
If the user says no, ask which value is wrong and re-detect with corrections.
Question 2 — Clone location for ACE-Step (and any other large model repo):
Where should I clone ACE-Step? Common choices:
~/ACE-Step-1.5(default, simple, no path collision)~/projects/ace-step(if you keep projects in a subfolder)~/ml/ace-step(if you have a dedicated ML directory)- A custom path: ___________
If ACE-Step is already cloned somewhere, I detected it at:
{detected_path}. Use that? Or pick a different path?
Use the user's answer as ACE_STEP_PATH for the rest of the session. Never hardcode /Users/luis/Repos/....
Question 3 — Output directory for generated songs:
Where should I save generated songs and project files?
~/Music mix/(default)- A custom path: ___________
I'll create the directory if it doesn't exist.
Use the user's answer as OUTPUT_DIR for the rest of the session.
Question 4 — Cloud backends (optional, for backup or speed):
Do you have any of these set up? (Just for backup if ACE-Step fails)
- MiniMax API key (set
MINIMAX_API_KEYin your shell)- Stability AI API key (set
STABILITY_API_KEY)- No, just use local backends
You can always set these later.
After the user answers Question 1, the skill knows the hardware. Save these flags for the rest of the session:
IS_APPLE_SILICON = true if uname -m = arm64 AND uname -s = DarwinIS_INTEL_MAC = true if uname -m = x86_64 AND uname -s = DarwinIS_LINUX_NVIDIA = true if nvidia-smi worksIS_LINUX_AMD = true if rocm-smi works (not yet handled)IS_CPU_ONLY = true if no GPURAM_GB = total system RAM (24, 16, 32, 64, etc.)MEMORY_ARCH = "unified" (Apple Silicon, integrated graphics, AMD APU) OR "dedicated" (NVIDIA/AMD discrete GPU)ML_BUDGET_GB = how much memory is actually available for ML models (= free RAM minus 2 GB safety margin, or the smaller of free system + free VRAM for dedicated GPUs)GPU_VRAM_GB = total VRAM if dedicated GPU detectedGPU_FREE_GB = free VRAM right now if dedicated GPU detectedThese flags affect:
Apple Silicon note: M1/M2/M3/M4 all work with ACE-Step MLX backend. M3 Pro/Max/Ultra and M4 are faster but otherwise identical. No code changes needed across chips. Memory architecture: unified — your 24 GB is shared between the OS, your apps, and the ML model. A 24 GB MacBook with 4 GB of open apps has ~20 GB ML budget, not 24 GB. This is why xl-mixed (which needs ~25-30 GB peak) hits swap-thrashing on 24 GB Macs. On a 32 GB Mac Mini or 64 GB Mac Studio, the same model fits comfortably.
Intel Mac note: ACE-Step MLX backend requires Apple Silicon. On Intel Macs, ACE-Step runs on CPU only (very slow, ~10x slower than MLX) or PyTorch with MPS (which is also limited on Intel). Recommend cloud backends (MiniMax, Stable Audio) for Intel Mac users who want reasonable speed. MusicGen still works fine on Intel Mac. Memory architecture: unified (Intel iGPU shares system RAM, same caveat as Apple Silicon).
Linux + NVIDIA note: ACE-Step on Linux uses CUDA. Needs NVIDIA driver + CUDA toolkit. Generation is faster than Apple Silicon MLX for high-end GPUs (RTX 4090, A100), slower for low-end (RTX 3050, GTX 1660). Memory architecture: dedicated — VRAM is separate from system RAM, so a 12 GB GPU + 16 GB system can still run a 10 GB model (VRAM is the bottleneck, not system RAM). The probe script detects this and uses the smaller pool as the ML budget.
Windows note: Run local ACE-Step via WSL2 (treated as Linux, with CUDA passthrough to your NVIDIA GPU), not native Windows. On corporate machines behind a TLS-inspecting proxy, you must install the corporate root CA into the distro or model downloads fail with certificate errors — see references/windows-wsl-setup.md. Cloud backends are the no-WSL alternative (but may be firewall-blocked).
Before starting, detect any available music backend. Check in this priority order — use the first one that succeeds:
| Priority | Backend | Detection command | What it needs |
|---|---|---|---|
| 1 | Native tool | Inspect runtime's tool list for music_generate or similar | None — built into runtime |
| 2 | ACE-Step local (free, best quality) | curl -s http://127.0.0.1:8001/health 2>/dev/null returns {"status":"ok"} | git clone + uv sync + uv run acestep-api (REST API on port 8001) |
| 3 | MusicGen local (free, instrumental only) | python3 -c "import audiocraft" 2>/dev/null && echo OK | Conda or pip env with audiocraft + torch + xformers |
| 4 | mmx CLI (MiniMax) | which mmx 2>/dev/null | MiniMax API key in environment |
| 5 | Stable Audio REST | [ -n "$STABILITY_API_KEY" ] && echo OK | STABILITY_API_KEY env var |
| 6 | Any other CLI | which mmx 2>/dev/null, etc. | Provider-specific setup |
Run all detection checks. You only need one working backend. The skill adapts to whatever it finds.
If no backend is found after checking all paths, branch on detected hardware and present the right install path. Use the ACE_STEP_PATH and OUTPUT_DIR from the User & Hardware Setup answers.
IS_APPLE_SILICON = true)No music generation backend detected. The quickest free path is ACE-Step local — it supports vocals, lyrics, and up to 10-minute songs with no API key and no quota limits.
Install ACE-Step (Apple Silicon, MLX native):
git clone https://github.com/ace-step/ACE-Step-1.5.git "${ACE_STEP_PATH}" cd "${ACE_STEP_PATH}" && uv sync uv run acestep-api --port 8001On first launch, the API server starts in "no models loaded" state. The skill will ask before downloading any models (see the Model download consent flow in the ACE-Step Quality Tiers section). REST API runs on
http://127.0.0.1:8001.Alternative — MusicGen (instrumental only):
brew install miniforge conda create -n musicgen -c conda-forge python=3.11 audiocraft torch torchaudio xformers conda activate musicgen
IS_INTEL_MAC = true)No music generation backend detected. ACE-Step MLX is not supported on Intel Macs. Your options:
Option A — Cloud backends (recommended for speed):
- MiniMax: set
MINIMAX_API_KEYin your shell, installmmxCLI- Stable Audio: set
STABILITY_API_KEYin your shellOption B — ACE-Step on CPU (slow, ~10x slower than MLX):
git clone https://github.com/ace-step/ACE-Step-1.5.git "${ACE_STEP_PATH}" cd "${ACE_STEP_PATH}" && uv sync uv run acestep-api --port 8001 # No MLX backend on Intel — generation will use CPU, expect ~60 min/trackOption C — MusicGen (instrumental only, fine on Intel):
brew install miniconda conda create -n musicgen -c conda-forge python=3.11 audiocraft torch torchaudio xformers conda activate musicgen
IS_LINUX_NVIDIA = true)No music generation backend detected. The quickest free path is ACE-Step with CUDA.
Install ACE-Step (CUDA):
git clone https://github.com/ace-step/ACE-Step-1.5.git "${ACE_STEP_PATH}" cd "${ACE_STEP_PATH}" && uv sync uv run acestep-api --port 8001Requires NVIDIA driver + CUDA toolkit. Generation speed depends on GPU tier (see Linux + NVIDIA note above).
Alternative — MusicGen (instrumental only):
# Conda (preferred): conda create -n musicgen -c conda-forge python=3.11 audiocraft torch torchaudio xformers conda activate musicgen # Or venv + pip (CUDA required): python3 -m venv ~/musicgen-env source ~/musicgen-env/bin/activate pip install audiocraft torch torchaudio --index-url https://download.pytorch.org/whl/cu121
No NVIDIA GPU detected. ACE-Step will run on CPU (very slow, expect ~60-90 min/track). For better performance, consider:
- Cloud backends (MiniMax, Stable Audio) — fast, paid
- MusicGen — CPU-capable for shorter instrumentals
- Buy a GPU 😄 (or borrow a cloud instance with NVIDIA)
Local ACE-Step on native Windows is not recommended (bash setup scripts; the LM backend path is built for Linux). The supported local path is WSL2, which gives a real Linux with CUDA passthrough to your NVIDIA GPU. The full, verified walkthrough — including corporate machines behind a TLS-inspecting proxy, where the model download fails until the corporate root CA is installed in the distro — is in
references/windows-wsl-setup.md.Option A — WSL2 (recommended for local generation):
- Probe with PowerShell (see the Windows note under Hardware Probe) and read
wsl --list --verbose.- Create a dedicated, isolated distro. Never reuse or modify an existing (especially corporate) distro, and never edit the global
%USERPROFILE%\.wslconfig:wsl --install Ubuntu-24.04 --name acestep --no-launch
- Verify GPU passthrough:
wsl -d acestep -u root -- nvidia-smi. Then follow the Linux + NVIDIA install steps inside the distro.- On a corporate/proxied machine, do the CA-install and proxy-bypass steps in
references/windows-wsl-setup.mdbefore downloading models.Option B — Cloud backends (fast, paid; may be blocked by a corporate firewall):
- MiniMax: set
MINIMAX_API_KEYand installmmx- Stable Audio: set
STABILITY_API_KEYOption C — MusicGen (instrumental only):
winget install Anaconda.Miniconda3 conda create -n musicgen -c conda-forge python=3.11 audiocraft torch torchaudio xformers conda activate musicgen
After any install method, verify: python3 -c "import audiocraft; print('MusicGen ready')" (for MusicGen) or curl -s http://127.0.0.1:8001/health (for ACE-Step).
Other options: install mmx CLI, or set STABILITY_API_KEY for Stable Audio API.
Do not start the workflow loop without a backend.
MusicGen's dependency chain has a known blocker: xformers cannot build from source on macOS without conda (it requires CUDA build tools that don't exist outside Linux/nvidia). This is why conda/miniforge is the recommended path on macOS.
Why not plain pip install audiocraft torch?
audiocraft → requires xformers
xformers → requires CUDA build tools
macOS → no CUDA → build fails
conda → ships pre-built xformers wheels for all platforms ✓
If conda is NOT available and the user refuses to install it, fall back to any cloud backend (mmx, Stable Audio). Do not attempt a broken pip install chain.
Verification (run after any install method):
python3 -c "
import audiocraft
import torch
print(f'MusicGen {audiocraft.__version__} OK')
print(f'torch {torch.__version__}, CUDA: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU only\"}')
"
If this prints without error, MusicGen is ready. The agent should cache the activation command (e.g., conda activate musicgen) and use it for every MusicGen generation call in the session.
This skill does not require any optional tool, but the user may benefit from any of these. Ask once at the start of the workflow, before generating anything. Propose the install command for the detected platform using the table below.
| Tool | What it unlocks | Linux (apt/dnf/pacman/apk) | macOS (brew) | Windows (winget / choco) | Pip fallback (any OS) |
|---|---|---|---|---|---|
ffmpeg | Audio format conversion, trimming, re-encoding | apt install ffmpeg (Debian/Ubuntu) · dnf install ffmpeg (Fedora) · pacman -S ffmpeg (Arch) · apk add ffmpeg (Alpine) | brew install ffmpeg | winget install Gyan.FFmpeg (or choco install ffmpeg) | — |
yt-dlp | YouTube download for cover or mashup inputs | apt install yt-dlp (or pip) | brew install yt-dlp (or pip) | winget install yt-dlp (or choco install yt-dlp or pip) | pip install -U yt-dlp |
audiocraft | MusicGen local generation (free, no API key, no quota) | conda (preferred) or pip | conda (preferred): brew install miniforge then conda create -n musicgen -c conda-forge python=3.11 audiocraft torch torchaudio xformers | Same as Linux | — |
librosa | Audio analysis (BPM, key, energy, structure) | pip | pip | pip | pip install librosa numpy scipy |
parselmouth | Better pitch tracking (optional, Praat under the hood) | pip | pip | pip | pip install praat-parselmouth |
mmx CLI | Per-flag control (--avoid, --bpm, --key, --structure) with MiniMax | follow the MiniMax install guide for Linux | follow the MiniMax install guide for macOS | follow the MiniMax install guide for Windows (PowerShell) | — |
python3 | Required for audiocraft, librosa, and parselmouth | apt install python3 python3-pip · dnf install python3 python3-pip · pacman -S python python-pip · apk add python3 py3-pip | brew install python (ships pip) | winget install Python.Python.3.12 | — |
On Linux and macOS, the interpreter is usually python3. On Windows, it is usually python (no 3). When verifying, check both names so Windows users are not falsely reported as missing Python:
# POSIX
command -v python3 || command -v python
# Windows PowerShell
Get-Command python, python3 -ErrorAction SilentlyContinue | Select-Object -First 1
For each missing optional tool, present three options:
Do not auto-install. Do not silently fall through to a degraded path without confirmation. The user is in control of their machine.
If the active platform is not recognized (unknown base image, restricted shell, no package manager available), say so explicitly and ask the user to either name their environment or install the tools manually before continuing.
music-craft-minimaxIf the user's request implies any of:
--avoid, --bpm, --key, or --structure flagsStop the pre-flight and tell the user: "That needs <feature>, which is in music-craft-minimax. Switch to that skill and I will run the same pre-flight with the extended check list." Do not try to fake these features with the tools this skill has.
Once the Pre-Flight Check detects a backend, translate the production-sheet prompt and lyrics into that backend's format. Each backend has different capabilities and limitations — adapt accordingly.
Best for: local generation with real vocals, separate lyrics, song structure, up to 10 minutes (600s). No API key, no quota. Runs natively on Apple Silicon via MLX.
Prerequisites: REST API must be running on http://127.0.0.1:8001. Install with:
git clone https://github.com/ace-step/ACE-Step-1.5.git "${ACE_STEP_PATH}"
cd "${ACE_STEP_PATH}" && uv sync
uv run acestep-api --port 8001 # or: ./start_api_server_macos.sh
(See User & Hardware Setup above for how ACE_STEP_PATH is determined.)
Generation (3-step async):
# 1. Submit task
TASK_ID=$(curl -s -X POST http://127.0.0.1:8001/release_task \
-H "Content-Type: application/json" \
-d '{
"prompt": "<detailed caption, e.g.: dreamy 80s synthwave, warm analog synths, gated-reverb drums, arpeggiated bass, neon night-drive mood>",
"lyrics": "[Verse]\n<lyrics here>\n\n[Chorus]\n<lyrics here>",
"audio_duration": 210,
"bpm": 96,
"key_scale": "D major",
"time_signature": "4/4",
"vocal_language": "en",
"thinking": true,
"inference_steps": 8,
"guidance_scale": 7.0
}' | python3 -c "import json,sys; print(json.load(sys.stdin).get('data',{}).get('task_id',''))")
# 2. Poll for completion (status: 0=pending, 1=processing, 2=done)
# Wait, then check:
curl -s -X POST http://127.0.0.1:8001/query_result \
-H "Content-Type: application/json" \
-d "{\"task_ids\": [\"$TASK_ID\"]}"
# 3. Copy audio from cache dir when done
# Files saved to: ${ACE_STEP_PATH}/.cache/acestep/tmp/api_audio/
Polling caveat: The /query_result endpoint may return {"data": [], "code": 200} even while the task is actively running. This is a known server-side quirk. Don't treat empty data as "task failed" — instead, check for new MP3 files in the cache directory, or look at the server log (/tmp/acestep-api.log) for actual progress markers (e.g. MLX DiT diffusion: 24/50).
Prompt format: Prefer a detailed, multi-dimensional caption — ACE-Step's own docs call the caption "the most important factor affecting generated music", and the project's example prompts are rich 1–3 sentence descriptions, not bare tags. Cover, in order: genre, key instruments, vocal character, mood, and production/texture words. A short 2–6 word tag still works (and the LM expands it when thinking=true), but specificity measurably improves results. The earlier "keep it to short tags" advice was wrong for ACE-Step 1.5.
Two rules that matter:
[Guitar Solo - distorted] tag degrades output).The default audio_duration is 210s (3:30) — the typical user expectation for a "song". Use this default unless you have a specific reason to use a shorter or longer length.
Inputs to prepare (set up once, reuse for every song):
[Verse 1], [Pre-Chorus], [Chorus], [Verse 2], [Bridge], [Outro] tags. ~150-200 words fits a 3:30 song comfortably.bpm, key_scale, time_signature, vocal_language, audio_duration: 210, plus the prompt and lyrics.Request body template for a full song (copy/paste and fill in):
TASK_ID=$(curl -s -X POST http://127.0.0.1:8001/release_task \
-H "Content-Type: application/json" \
-d '{
"prompt": "<your detailed multi-dimensional caption here>",
"lyrics": "[Verse 1]\n<line>\n<line>\n\n[Pre-Chorus]\n<line>\n<line>\n\n[Chorus]\n<line>\n<line>\n\n[Verse 2]\n<line>\n<line>\n\n[Bridge]\n<line>\n<line>\n\n[Outro]\n<line>\n<line>\n",
"audio_duration": 210,
"bpm": 96,
"key_scale": "D major",
"time_signature": "4/4",
"vocal_language": "en",
"thinking": true,
"inference_steps": 8,
"guidance_scale": 7.0
}' | python3 -c "import json,sys; print(json.load(sys.stdin).get('data',{}).get('task_id',''))")
Expected wall-clock on M3 24GB (standard tier, 2B turbo, 8 steps):
The existing M1_idkw_dreampop_ACE_210s.mp3 in ~/Music mix/hello_cleveland/i_dont_know_why/ is the reference output for this workflow — same song with audio_duration: 210, the detailed prompt format, and standard tier. If your output sounds comparable (or better), the workflow is working.
If you only have a 60-second subset of lyrics (e.g. a hook for a jingle, or a single chorus to test the prompt), set audio_duration: 60 — that's perfectly fine, just not a full song. Use the full-lyrics version for the final generation.
# Good (detailed, multi-dimensional — matches ACE-Step's own examples)
"A groovy funk track with slap bass, tight horn stabs, rhythmic guitar scratching, a charismatic male lead with call-and-response backing vocals, and an irresistible pocket groove"
"Dreamy 80s synthwave: warm analog synths, gated-reverb drums, arpeggiated bassline, shimmering pads, nostalgic neon night-drive mood"
# Also fine (short tag; LM expands it with thinking=true)
"dreamy synthwave, 80s retro, atmospheric pads"
# Avoid: contradictory styles stacked in one static caption (express as evolution instead)
"classical chamber strings AND crushing hardcore metal AND lo-fi hip-hop, all at once"
Parameters:
| Parameter | Type | Default | Notes |
|---|---|---|---|
prompt | string | required | Detailed caption preferred (genre + instruments + vocal character + mood + production), per ACE-Step's docs and example prompts. A short tag also works and is expanded by the LM when thinking=true. Resolve style conflicts temporally rather than stacking them. |
lyrics | string | optional | [Verse]/[Chorus] tagged lyrics |
audio_duration | int | 210 | 10–600s. Default: 210s (3:30) for full songs — this is the typical user expectation and matches the existing M1_idkw_dreampop_ACE_210s.mp3 reference. Set to fit ALL lyrics (see Duration Guide below). Use shorter values (30–60s) only for jingles, hooks, or test drafts. |
thinking | bool | false | LM rewrites tags → richer caption. Always use true for best results |
use_format | bool | false | When true, the LM also enhances your caption/lyrics (similar to thinking but for prompt enrichment). Try true if the LM seems to be missing context from your prompt. |
inference_steps | int | 8 | Diffusion steps. For acestep-v15-turbo (standard): 8 is the documented setting, do not exceed 20. For acestep-v15-xl-sft (xl-mixed): 32-64 recommended, default 50. Using 8 with xl-sft produces "soup" output (all elements at same level, no dynamics). |
guidance_scale | float | 7.0 | Higher = stricter prompt adherence. Only effective for base/sft models, not turbo. For xl-sft, try 4.0-7.0 range. |
shift | float | 3.0 | Timestep shift factor (1.0-5.0). Officially documented as "only effective for base models, not turbo models" — but xl-sft is an SFT model, not turbo. Experiment with 1.0 or 5.0 if the default sounds off. |
infer_method | string | "ode" | Diffusion inference method. "ode" (Euler, faster) or "sde" (stochastic, sometimes more stable for SFT models). |
seed | int | -1 | -1 = random. Set for reproducibility |
vocal_language | string | "en" | BCP-47 language code for vocals. Important for non-English songs — the model picks the right phoneme set. |
bpm | int | none | Optional. When thinking=true and missing, the LM infers it. Set explicitly for tighter control. |
key_scale | string | "" | Optional. E.g. "D major", "A minor". Same as bpm. |
time_signature | string | "" | Optional. E.g. "4/4", "3/4". Same as bpm. |
cfg_interval_start | float | 0.0 | CFG application start ratio (0.0-1.0). Default applies CFG throughout the diffusion. |
cfg_interval_end | float | 1.0 | CFG application end ratio (0.0-1.0). |
use_adg | bool | false | Adaptive Dual Guidance. Base model only. Not applicable to xl-sft. |
Environment variables (set when starting the server, not in the request body):
| Env var | Default | Notes |
|---|---|---|
ACESTEP_CONFIG_PATH | acestep-v15-turbo | DiT model path. Set to acestep-v15-xl-sft for xl-mixed. |
ACESTEP_LM_MODEL_PATH | acestep-5Hz-lm-0.6B | LM model path. Use acestep-5Hz-lm-1.7B for higher quality. |
ACESTEP_LM_BACKEND | vllm | Backend for the LM. On Apple Silicon (macOS), set to mlx for native acceleration. vLLM is meant for Linux+CUDA. |
ACESTEP_GENERATION_TIMEOUT | 600 | Per-generation timeout in seconds. Set to 3600 (1 hour) when using xl-mixed on 24GB M3 — default 600s fires mid-generation. |
ACESTEP_OFFLOAD_TO_CPU | false | Set to true for low-VRAM environments to support longer audio generation. |
PYTORCH_MPS_HIGH_WATERMARK_RATIO | ~0.4 | On macOS, set to 0.0 to allow XL model to load (MPS otherwise enforces a tight memory cap that fails to load the 4B DiT). |
ACESTEP_CONFIG_PATH2, ACESTEP_CONFIG_PATH3 | empty | Optional secondary DiT models selectable via the model parameter in requests. |
Duration Guide (audio_duration):
The audio_duration parameter controls how much audio ACE-Step generates. If it's too short, lyrics get cut off. Estimate based on lyrics word count:
| Lyrics length | Words | Recommended audio_duration | Real-world length |
|---|---|---|---|
| Short (jingle, hook) | <50 | 30–60 | 0:30–1:00 |
| Single verse + chorus | 50–100 | 60–120 | 1:00–2:00 |
| Full song (2 verses, chorus, bridge) | 100–200 | 180–240 | 3:00–4:00 |
| Extended (3+ verses, long outro) | 200–350 | 240–360 | 4:00–6:00 |
| Epic (ballad, progressive) | 350+ | 360–600 | 6:00–10:00 |
Rule of thumb: Count lyrics words × 0.8–1.2 seconds per word, then add 20% for instrumental breaks between sections. Always round UP — ACE-Step will fade out naturally if lyrics end before duration.
Lyrics format: Use [Verse 1], [Chorus], [Bridge], [Outro] tags. ACE-Step follows these to create song structure. Add [Intro] and [Instrumental Break] tags for non-vocal sections.
M3 performance (tested, real-world verified June 2026):
For 60s audio (standard tier, 2B turbo, 1.7B LM, 8 steps):
For 210s audio (3:30 song, same tier):
First run adds ~90s for model loading. Subsequent runs are faster because the model stays in MPS memory.
Key advantages over MusicGen:
[Verse]/[Chorus] tagsACE-Step supports multiple model sizes with different quality/speed/RAM trade-offs. The skill must check available RAM before offering any tier and NEVER auto-download models.
Tier table:
| Tier | DiT Model | LM Model | Peak RAM | Disk cost | Speed (210s) | Quality | When to use |
|---|---|---|---|---|---|---|---|
| fast | v15-turbo (2B) | 5Hz-lm-0.6B (0.6B) | ~8 GB | Included in base ~10 GB | ~5 min | Good | Quick drafts, low-RAM machines |
| standard (default) | v15-turbo (2B) | 5Hz-lm-1.7B (1.7B) | ~11 GB | Included in base ~10 GB | ~10 min | Very Good | Daily driver, most users |
| xl-mixed (24GB M3: viable with extended timeout) | v15-xl-sft (4B) | 5Hz-lm-1.7B (1.7B) | ~25-30 GB | +~20 GB XL DiT download | ~52 min for 60s audio (verified); ~3-4 hours for 210s | Very High+ | Final production on any RAM tier — just slow on 24GB |
Real-world hardware limits (verified June 2026 on M3 24GB unified memory):
- The
besttier (4B XL + 4B LM, ~22 GB peak) requires ≥32 GB RAM. NOT offered on 24 GB systems.- The
xl-mixedtier (4B DiT + 1.7B LM) IS viable on 24 GB M3 if you extend the server timeout:
- Model loads successfully (~10 GB DiT, but MPS pool pressure ~20 GB with cached state)
- 50-step diffusion runs at ~50-100s/step (varies with audio length and memory pressure)
- The default 600s server timeout fires mid-generation. Set
ACESTEP_GENERATION_TIMEOUT=3600to allow up to 1 hour per generation.- Free RAM goes to 0 GB during generation, but it works
- Verified June 2026: 60s audio at 50 steps = ~52 min wall-clock on 24GB M3
- Recommendation for 24 GB M3 (unified memory):
- For fast iterations: use
standardtier (10 min for 210s, fast feedback)- For final production: use
xl-mixedtier with extended timeout (52 min for 60s, or ~3-4 hours for 210s)- Recommendation for 32 GB+ M3/M4 (unified memory, more headroom):
xl-mixedruns in ~15 min as documented.besttier becomes viable.- For dedicated GPU (NVIDIA/AMD, system RAM separate from VRAM): A 12 GB GPU + 16 GB system can run xl-mixed in ~15 min. The probe script auto-detects this and uses the smaller pool.
Memory safety check (run BEFORE any generation):
The memory probe must distinguish unified memory (Apple Silicon, integrated graphics) from dedicated memory (discrete NVIDIA/AMD GPU). On unified memory, the OS, your apps, and the ML model all share the same pool — so 24 GB total might mean only ~18 GB is actually available for ML after macOS and your open apps. On dedicated GPUs, the VRAM is separate from system RAM, so a 12 GB GPU can run a 10 GB model even on a 16 GB system.
# 1. Total system RAM and current free
case "$(uname -s)" in
Darwin)
TOTAL_RAM_GB=$(sysctl -n hw.memsize | awk '{printf "%.0f", $1/1024/1024/1024}')
FREE_RAM_KB=$(vm_stat | awk '/free page count/{print $3 * 4}')
FREE_RAM_GB=$(awk "BEGIN {printf \"%.1f\", $FREE_RAM_KB/1024/1024}")
;;
Linux)
TOTAL_RAM_GB=$(awk '/MemTotal/{printf "%.0f", $2/1024/1024}' /proc/meminfo)
FREE_RAM_GB=$(awk '/MemAvailable/{printf "%.1f", $2/1024/1024}' /proc/meminfo)
;;
*) echo "unknown" ;;
esac
echo "Total RAM: ${TOTAL_RAM_GB} GB"
echo "Free RAM now: ${FREE_RAM_GB} GB"
# 2. Memory architecture detection
if [ "$(uname -m)" = "arm64" ] && [ "$(uname -s)" = "Darwin" ]; then
MEM_ARCH="unified"
echo "Architecture: unified (Apple Silicon — GPU shares with system)"
elif command -v nvidia-smi >/dev/null 2>&1; then
GPU_VRAM_GB=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -1 | awk '{printf "%.0f", $1/1024}')
GPU_FREE_GB=$(nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits | head -1 | awk '{printf "%.0f", $1/1024}')
echo "GPU: NVIDIA (${GPU_VRAM_GB} GB total VRAM, ${GPU_FREE_GB} GB free)"
echo "Architecture: dedicated (system RAM and VRAM are separate pools)"
MEM_ARCH="dedicated"
elif command -v rocm-smi >/dev/null 2>&1; then
echo "Architecture: dedicated (AMD ROCm)"
MEM_ARCH="dedicated"
else
echo "Architecture: integrated/CPU-only (system RAM used for everything)"
MEM_ARCH="unified"
fi
# 3. ML budget calculation
# Unified: free RAM minus safety margin (OS can reclaim 1-2 GB more on demand)
# Dedicated: use the SMALLER of free RAM or free VRAM (the bottleneck)
# Reserve 2 GB safety margin for OS/other apps
if [ "$MEM_ARCH" = "unified" ]; then
ML_BUDGET_GB=$(awk "BEGIN {printf \"%.0f\", $FREE_RAM_GB - 2}")
echo "ML budget: ~${ML_BUDGET_GB} GB (free RAM minus OS safety margin)"
else
# Dedicated GPU: bottleneck is the smaller pool
BOTTLENECK_GB=$FREE_RAM_GB
if [ -n "$GPU_FREE_GB" ] && [ "$GPU_FREE_GB" -lt "$BOTTLENECK_GB" ]; then
BOTTLENECK_GB=$GPU_FREE_GB
fi
ML_BUDGET_GB=$(awk "BEGIN {printf \"%.0f\", $BOTTLENECK_GB - 2}")
echo "ML budget: ~${ML_BUDGET_GB} GB (smaller of free system RAM or free VRAM, minus safety margin)"
fi
Based on ML budget (NOT total RAM):
| ML budget | Available tiers | Use case |
|---|---|---|
| < 8 GB | fast only (warn user about tight fit, expect OOM) | Quick drafts |
| 8–11 GB | fast + standard | Daily driver |
| 12–20 GB | fast + standard + xl-mixed (with extended timeout) | Final production |
| ≥ 25 GB | ALL tiers including best (4B LM) | No constraints |
Why this differs from "total RAM" tables:
A 24 GB Apple Silicon Mac with macOS and 4 GB of open apps has only ~18 GB ML budget. The probe should report 18 GB, and the table should classify it as "fast + standard" — NOT "xl-mixed eligible" (which the old table would say based on 24 GB total).
Conversely, a Windows desktop with 16 GB system RAM and a 24 GB NVIDIA RTX 4090 has 22 GB ML budget (the VRAM is the bottleneck, not the system RAM). That system CAN run xl-mixed in ~15 min as documented.
The probe and the table together handle both cases correctly. The previous version of this skill used total RAM as the gating value, which was wrong for unified memory — leading to the "24GB Mac but xl-mixed sounds bad" surprise we hit in June 2026. The fix is to use ML budget (free memory minus OS safety margin, with unified-vs-dedicated awareness).
Model download consent flow (NEVER auto-download):
When ACE-Step is running but no models are loaded (fresh install), OR when the user requests a higher tier whose models aren't downloaded yet:
You: "ACE-Step is ready but needs audio models before generating. This will
download ~10 GB to your disk — I will NOT do this without your explicit
approval.
Your options:
① Download standard (~10 GB) → good quality, ~10 min/track, fits your {N}GB RAM ✓
② Download xl-mixed (+20 GB extra = ~30 GB total) → best quality your machine can run,
~15 min/track, needs you to close heavy apps during generation
③ Skip local → use a cloud backend instead
- MiniMax (if API key set) — fast, paid
- Stable Audio (if STABILITY_API_KEY set) — paid
- MusicGen (local fallback) — free, instrumental only
You currently have {X} GB free disk space.
Which option?"
Wait for the user to choose before doing anything. Do not download. Do not auto-load. Do not start generating.
Rules:
/v1/init)Switching tiers mid-session:
# Switch to xl-mixed (requires XL model already downloaded)
curl -s -X POST http://127.0.0.1:8001/v1/init \
-H "Content-Type: application/json" \
-d '{"dit_model": "acestep-v15-xl-sft", "lm_model": "acestep-5Hz-lm-1.7B"}'
# Switch back to standard
curl -s -X POST http://127.0.0.1:8001/v1/init \
-H "Content-Type: application/json" \
-d '{"dit_model": "acestep-v15-turbo", "lm_model": "acestep-5Hz-lm-1.7B"}'
M3 performance by tier (real-world verified, June 2026):
| Tier | LM time | DiT time (8 steps) | DiT time (50 steps) | VAE decode | First run | Subsequent | Status on 24GB M3 |
|---|---|---|---|---|---|---|---|
| fast (2B+0.6B) | ~6s | ~45s | n/a (model is turbo) | ~28s | ~3 min | ~5 min | ✅ Works great |
| standard (2B+1.7B) | ~12s | ~50s | n/a (model is turbo) | ~28s | ~3 min | ~10 min | ✅ Works great (default) |
| xl-mixed (4B+1.7B) | ~12s | ~720s (90s/step × 8) | ~1500s for 60s audio (~50s/step × 30+ steps in 30 min) — verified ~52 min wall-clock | ~65s | ~5 min | ~52 min for 60s; estimated 3-4 hours for 210s | ✅ Works with ACESTEP_GENERATION_TIMEOUT=3600 (1 hour). 8-step version produces "soup" output — use 50 steps. |
| best (4B+4B) | n/a | n/a | n/a | n/a | n/a | n/a | ❌ Excluded — needs 32GB+ |
Critical findings from real-world testing (June 2026):
xl-mixed CAN RUN on 24 GB M3 (verified 60s audio at 50 steps completes in ~52 min) with the right env vars:
ACESTEP_CONFIG_PATH=acestep-v15-xl-sft \
ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-1.7B \
ACESTEP_GENERATION_TIMEOUT=3600 \
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 \
uv run acestep-api --port 8001
ACESTEP_GENERATION_TIMEOUT=3600, the default 600s (10 min) timeout fires mid-generationPYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0, the model fails to load (MPS OOM)XL + 8 steps produces "soup" output — all elements at the same level, no dynamics. Always use 50 steps for XL.
Audio length affects time per step:
Memory pressure is the bottleneck, not model loading. Models load fine; generation just runs slow due to swap-thrashing.
XL 50-step fixes to try (in order of likelihood, each test takes ~52 min for 60s audio):
If you want to experiment with xl-mixed anyway, the most likely fixes are:
use_format: true — lets the LM enhance your promptshift: 1.0 or shift: 5.0 (default is 3.0, officially documented for "base models, not turbo" — xl-sft is SFT, not turbo, so worth experimenting)guidance_scale: 4.0 (default 7.0 may be too aggressive for sft CFG)infer_method: "sde" (stochastic, sometimes more stable than Euler for SFT models)thinking: false (DiT-only mode, skips LM) — if this works, the LM is the problem; if it still sounds bad, the DiT is the problemxl-turbo instead of xl-sft (counterintuitive but turbo is designed for fewer steps)Start with #1 (free, just data change) and work down. If none of these produce acceptable audio, fall back to the standard tier.
Conclusion: standard is the practical best quality tier for 24 GB M3. Reserve xl-mixed for 32GB+ hardware (M3 Max/Ultra, M4 Max).
Beyond text2music, ACE-Step 1.5 conditions on an input audio file. This is how you do a
melody-aware local cover — no cloud needed. Select the mode with task_type in the
/release_task body:
task_type | What it does | Audio field |
|---|---|---|
text2music (default) | Generate from caption + lyrics | none |
cover | Re-style a song while following its melody/structure | src_audio |
repaint | Regenerate only a time window, keep the rest | src_audio + repainting_start/end |
extract | Stem separation | src_audio |
Uploading the source audio (important): the API rejects absolute file paths
({"detail":"absolute audio file paths are not allowed"}). Upload the file as multipart
form-data, not JSON. Fields: src_audio (source for cover/repaint) or
reference_audio/ref_audio (style-transfer reference). Send other params as form fields:
curl -s -X POST http://127.0.0.1:8001/release_task \
-F "task_type=cover" \
-F "src_audio=@/path/to/song.wav" \
-F "audio_cover_strength=0.35" \
-F "prompt=dreamy 80s synthwave, warm analog synths, gated-reverb drums, arpeggiated bass, neon night-drive mood" \
-F "bpm=129" -F "key_scale=D major" -F "audio_format=wav"
Cover behavior (verified):
audio_cover_strength (0.0–1.0): lower = bigger restyle (~0.2–0.4 for a strong genre jump
like rock to synthwave; 0.7–0.9 for a subtle restyle; 1.0 = closest to source).thinking has no effect; the caption and
lyrics you send are used directly, so write a good caption.audio_duration is ignored for cover.reference_audio (style transfer) conditions global timbre/feel, NOT melody; src_audio
(cover) follows melody/structure. Melody capture is best on sparse, mid/slow-tempo songs —
expect melodic variation, not an exact copy.VRAM / time caveat (verified on a 12 GB laptop GPU): a full ~5-minute cover is impractical on this class of hardware — encoding the source alone took ~13 minutes and the job hit the server's default 600 s generation timeout and failed. Mitigations:
ffmpeg -ss <start> -t 60 -i in.wav out.wav.ACESTEP_GENERATION_TIMEOUT=3600.Repaint (fix one bad section instead of regenerating the whole track): task_type=repaint,
upload src_audio, set repainting_start/repainting_end (seconds) and repaint_mode
(conservative/balanced/aggressive) with repaint_strength (0–1). Use your structural
analysis to choose the window.
Local audio understanding (no cloud): ACE-Step can extract BPM, key, time-signature, and a
caption directly from an input file (the analysis_only / full_analysis_only request flags;
the "Audio Understanding" feature). This is a fully-local way to derive metas/caption from a
source song — an alternative to the librosa pipeline when ACE-Step is already running.
Operational note (single-worker API): the REST server processes one job at a time and may
not answer /query_result within a short timeout while mid-generation (60 s poll timeouts are
normal under load). Use a generous client timeout, tolerate poll timeouts, and detect
completion by watching .cache/acestep/tmp/api_audio/ for new files.
Quality loop: generate a small batch (batch_size 2–4) and keep the best; request
audio_format: "wav" to avoid a lossy MP3 round-trip; set seed with use_random_seed=false
for reproducibility.
Best for: fully offline generation, no API dependency, unlimited use. Trade-off: quality is lower than MiniMax/ACE-Step, especially for vocals. Use as a fallback or for instrumentals.
MusicGen takes a single text description. It does NOT have a separate lyrics parameter — the text description IS the entire input. MusicGen was trained on short natural-language descriptions, NOT structured production sheets like MiniMax. The skill must reformat the prompt before passing it.
Device selection (MPS / CUDA / CPU):
import torch
if torch.backends.mps.is_available(): # Apple Silicon (M1/M2/M3)
device = "mps"
elif torch.cuda.is_available(): # NVIDIA GPU
device = "cuda"
else:
device = "cpu"
Use model.to(device) after loading. MPS gives 2-5x speedup over CPU on Apple Silicon.
Generation command (run via bash heredoc):
python3 << 'MUSICGEN_EOF'
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch
# Device selection
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
model = MusicGen.get_pretrained("<model_name>")
model.to(device)
# Generation parameters tuned for quality
model.set_generation_params(
duration=30, # MusicGen's effective max per call
top_k=250, # Token sampling diversity
top_p=0.0, # Nucleus sampling (0 = disabled)
temperature=1.0, # Creativity (0.5 = conservative, 1.0 = default, 1.5 = wild)
cfg_coef=3.0, # How strictly to follow the prompt (higher = more faithful)
)
# MUSICGEN-SPECIFIC PROMPT FORMAT (see below)
desc = """<short genre/mood/instrument description, 1-2 sentences>
<short lyric snippet, if any>"""
wav = model.generate([desc])
audio_write("<output_path_no_ext>", wav[0].cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
print(f"Done: <output_path_no_ext>.wav")
MUSICGEN_EOF
Model selection:
| Model | Parameters | Vocal quality | Lyrics following | Best for |
|---|---|---|---|---|
small | 300M | ❌ Instrumental only | ❌ Ignores lyrics | Quick tests, instrumentals |
medium | 1.5B | ⚠️ Vague vocal-like sounds | ⚠️ Occasionally | Best CPU-vs-quality trade-off |
large | 3.3B | ✅ Best vocals MusicGen offers | ✅ Better | When you have GPU or patience |
melody | 1.5B | ✅ Melody-conditioned | ⚠️ Humming, not lyrics | Vocal-melody tracks |
Selection logic:
mediumlargelarge (10-20x faster than CPU)melodyMusicGen-specific prompt format:
MusicGen was trained on short descriptions like "upbeat indie rock with jangly guitars and energetic drums". Long structured production sheets like MiniMax prompts are NOT what it expects and produce worse results.
Translation rule: when adapting a MiniMax-style production-sheet prompt for MusicGen:
| MiniMax prompt element | MusicGen equivalent |
|---|---|
| 13-line production sheet with anti-sparse guards | Condense to 1-2 sentences with the core genre + mood + 2-3 key instruments |
[Verse], [Chorus] section tags in lyrics | Replace with a short lyric snippet (4-8 lines) |
| BPM/key/structure flags | Fold into natural language ("slow 80 BPM", "minor key") |
| Anti-sparse instructions ("always playing") | Drop entirely — MusicGen doesn't have that failure mode |
AVOID lists | Drop entirely — MusicGen doesn't interpret them well |
Example translation:
# MiniMax-style (13 lines, structured)
sludge doom metal, Melvins meets Eyehategod, crushingly heavy slow-motion,
oppressive dark cathartic, weight of a system collapsing,
male lead vocal, deep guttural growls, raw throat-shredding delivery,
FULL ARRANGEMENT: massively downtuned sludge guitar, sub-bass shakes floor,
tempo 82 BPM in E minor, doom reimagining, "the system will stand" becomes mantra,
extreme dynamic range whispered verses to screaming chaos,
sludge metal production thick muddy analog saturation,
vocal character: whispered verses, growling pre-chorus, full scream,
emotional arc: oppressive whisper opening, gradual crushing weight buildup,
dramatic pauses at 12s 55s 95s 130s, repeated "oh my god" lines,
avoid fast upbeat avoid clean singing avoid polished production
# MusicGen-style (2-3 sentences, natural language)
Slow crushing sludge doom metal at 82 BPM in E minor.
Downtuned detuned guitars, sub-bass, slow half-time drums,
oppressive dark mood. Melodic humming: "I don't know why,
I don't know why, what I know is how to get along."
The MusicGen version is shorter, uses natural phrasing, and inlines a short lyric fragment at the end.
MusicGen limitations (documented):
ffmpeg -i in.wav -codec:a libmp3lame -qscale:a 2 out.mp3 if ffmpeg is available.Generation parameters (tuning guide):
| Parameter | Default | Range | Effect |
|---|---|---|---|
top_k | 250 | 50-500 | Lower = more focused, higher = more diverse |
top_p | 0.0 | 0.0-1.0 | 0 = disabled. 0.9 = nucleus sampling, often better quality |
temperature | 1.0 | 0.5-1.5 | Lower = more predictable, higher = more creative |
cfg_coef | 3.0 | 1.0-10.0 | Higher = follows prompt more strictly but can sound artificial |
Recommended starting values per intent:
| Intent | top_k | top_p | temperature | cfg_coef |
|---|---|---|---|---|
| Faithful to prompt (style match) | 150 | 0.0 | 0.8 | 5.0 |
| Creative/experimental | 350 | 0.9 | 1.2 | 2.0 |
| Best vocals (singing attempt) | 200 | 0.0 | 1.0 | 4.0 |
| Instrumental only | 250 | 0.0 | 1.0 | 3.0 |
Chunked generation for longer tracks:
# Generate 30s segments and concatenate
import torch
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained("large")
model.to("mps") # or "cuda" / "cpu"
all_audio = []
for i in range(total_segments):
segment = model.generate([desc], progress=True)
all_audio.append(segment)
# Concatenate and save
combined = torch.cat(all_audio, dim=-1)
audio_write("output_long", combined[0].cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
Each segment takes 2-5 min on MPS, so a 3-minute song = 6-15 min total. Acceptable trade-off for free + local.
Quality expectation (honest):
| Output aspect | MusicGen best case | MiniMax baseline |
|---|---|---|
| Instrumental fidelity | ✅ Good | ✅ Excellent |
| Vocal presence | ⚠️ Vague humming | ✅ Clear singing |
| Lyrics accuracy | ❌ Ignores most | ✅ Word-level match |
| Song structure (verse/chorus) | ❌ Single texture | ✅ Follows tags |
| Audio polish | ⚠️ Lo-fi by default | ✅ Production-quality |
| Speed (CPU, 30s) | ~7 min | ~30s (cloud) |
Use MusicGen for instrumentals, sketches, or when cloud is unavailable. Use MiniMax/ACE-Step when you need actual sung lyrics.
Best for: highest quality with per-flag control, cloud-based.
mmx music generate \
--prompt "<production-sheet prompt>" \
--lyrics-file <lyrics.txt> \
--out <output.mp3> \
--model music-2.6-free
Supports separate lyrics file, explicit model selection, and per-flag control (--bpm, --key, --avoid, --structure). Subject to MiniMax API quota.
Best for: short instrumentals, sound design, text-to-audio.
curl -s -X POST https://api.stability.ai/v2alpha/audio/generate \
-H "Authorization: Bearer $STABILITY_API_KEY" \
-F "prompt=<prompt>" \
-F "duration=180" \
-F "output_format=mp3" \
-o <output.mp3>
Stable Audio may not support separate lyrics input. For vocal tracks, combine lyrics into the prompt text. Check current Stability AI docs for the latest endpoint and parameters.
If a music generation CLI is detected that is not listed above, use it with the production-sheet prompt as the text input and the lyrics file if the CLI supports it. Adapt the command to the CLI's expected arguments.
Before asking anything, infer what you can from the user's message:
Use these deterministic first responses before asking follow-up questions:
After auto-detect, ask 1–3 questions max. Use these exact patterns when needed, and prefer the shortest set that closes the gap:
Auto-detect what you can first. Do not ask about language, genre, mood, or duration if the request already makes them obvious. Do not ask more than three questions total.
The prompt you pass to music_generate is not a restatement of the user's words. It is a structured brief that follows the formula in references/prompt-formula.md. The short version:
[Genre/subgenre], [mood], [voice type and language],
[instruments — list EVERY instrument explicitly],
[anti-sparse instruction],
[BPM] BPM in [key],
[structure description with tags],
[dynamic/arrangement instructions],
[production quality],
[things to avoid]
If the user provides lyrics, add section tags ([Verse], [Chorus], and so on) without altering the words. If the skill writes the lyrics, structure them from the start.
Rules:
[Break] for dramatic pauses (1–2 seconds)[Build Up] before the first chorustoooooou, rieeeeenFor the full tag reference, see references/structure-tags.md.
Call the detected backend with the production-sheet prompt and structured lyrics. Use the backend-specific generation command from the Backend Generation section. Adapt the prompt format to the backend (e.g., MusicGen needs prompt + lyrics combined into one text block; mmx accepts them separately).
After the tool returns, verify:
If the output fails verification:
Never retry the same prompt plus lyrics combination twice in a row.
Common adjustment rules:
[Bridge], [Break], or [Build Up] so the shape is unambiguous.[Outro], request a full ending with the final line held, and avoid abrupt cut-off language.Before asking any question or writing any prompt, run a two-pass intake on the user's request: extract the required fields, then label each one's confidence.
Every request, after auto-detect, should land on this list. Mark each field as one of: clear, inferred, missing, conflicting.
| # | Field | What to look for |
|---|---|---|
| 1 | Language | The language of the lyrics and the vocals |
| 2 | Genre / subgenre | Pop, rock, lofi, reggaeton, synthwave, etc. — be specific |
| 3 | Mood | Emotional tone (sad, joyful, dark, hopeful, ...) |
| 4 | Theme | Topic or story (love, summer, road trip, heartbreak) |
| 5 | Vocal mode | Solo vocal, choir, instrumental, spoken word |
| 6 | Lyric source | User-provided, auto-generated, or instrumental-only |
| 7 | Duration | Seconds or minutes; jingle (~30s), standard (~3min), epic (~6min) |
| 8 | Structure | Number and order of sections (intro/verse/chorus/bridge/outro) |
| 9 | References | Named artists, songs, eras, or visual references |
| 10 | Output location | Where the audio file (and analysis files) should be saved |
The output path is part of the intake, not an afterthought. Confirm it before calling music_generate and let the user pick a per-song subfolder so the project does not end up as a flat folder of 30 MP3s called final_v3_take2.mp3.
Default question (ask only if the request is missing it):
Where should I save this and any analysis files? Two common shapes:
- Per-song subfolders (recommended when you are producing multiple versions or songs):
~/Music mix/<project>/<song-slug>/
- Inside the subfolder: the MP3, the analysis JSON (if you used
music-craft-minimax), the prompt.txt, the lyrics.txt- Each version of the same song lives in its own subfolder, or stacked under one subfolder with a version suffix on the MP3
- Single folder, single file: a flat path like
~/Music mix/<project>/<song-slug>.mp3If you do not have a strong preference, the default is
~/Music mix/<project>/<song-slug>/<song-slug>.mp3(per-song subfolder).
Conventions the LLM should follow when picking paths itself:
two-paths, family-acoustic, when-you-bleed)openclaw- (protected namespace on ClawHub)~/Music mix/dbc/two-paths/M1_synthwave.mp3 and M2_indiefolk.mp3If any field is missing, that is a question to ask. If any field is conflicting, pause and resolve before prompting. If everything is clear or inferred, the request is ready to translate.
Request: "Canción francesa melancólica, 80 BPM, con voz masculina teatral."
clear: language=fr, genre=chanson, mood=melancholic, vocal_mode=solo_male, bpm=80
inferred: theme=romantic, duration=~3min, structure=standard
missing: lyrics_source
Request: "Make a sad Spanish pop song but with upbeat energy."
conflicting: mood (sad vs upbeat)
→ pause, ask: "Sad lyrics with an upbeat tempo, or sad throughout?"
Request: "Use these lyrics" (followed by user text)
clear: lyrics_source=user_provided
inferred: language (from text), structure (from text length)
missing: genre, mood, vocal_mode, duration
The four "languages" of a music request must not contradict each other:
[Verse], [Chorus] (always English by convention)Conflict examples that mean a regeneration is needed:
Quick check before music_generate:
[Verse], not [Verso])Some common phrases hide the real intent. Match the phrase to a route before asking follow-ups.
| User says... | Route | First question to ask (if any) |
|---|---|---|
| "Make a song like X" | Text-only style reference | "Anything from X you want me to lean on — vocals, instruments, era, all of it?" |
| "Use these lyrics" | User-provided lyrics | "What style and voice should it have?" |
| "Instrumental only" / "no vocals" / "background music" | Instrumental / jingle | "What duration and use case?" |
| "Turn this image into music" / "vibe like this" | URL/image enrichment | (analyze the image first, ask only if mood/genre still unclear) |
| "Cover this song" / "in the style of this track" with audio | Audio cover — redirect | "That needs cover / style transfer from audio. Switching to music-craft-minimax." |
| "Make a song" / "something for my project" with no other info | Vague request | "Genre, mood, language, or theme you have in mind? Or want me to surprise you?" |
Always enrich before asking when the input is an image or URL. Fetch the page or analyze the image, then route based on what was extracted.
For the full nine input shapes (description, user-lyrics, audio file, YouTube audio, song name, lyrics URL, YouTube metadata, image, genre/cultural) and their routing rules, see references/input-workflows.md.
The single most common failure mode of music generators: interpreting "sparse", "quiet", or "minimal" as "remove all instruments and vocals".
accordion, upright bass, orchestral strings, piano, light percussion.ALL instruments ALWAYS playing throughout, NEVER a cappella or silent.AVOID sparse minimal arrangements, AVOID a cappella sections.quiet sections: reduced to accordion and bass only, still fully played.sparse arrangementminimal instrumentationstripped backa cappella sectionquiet with no instrumentsIf the user asks for any of these, translate them into the explicit-instrument form.
Every mood, energy, or emotion word in the prompt must be tied to at least one concrete production detail. A mood word with no grounding will be ignored — the model defaults to a "neutral pleasant" register.
| Mood word | Required grounding (pick at least one) |
|---|---|
sad | minor key, slow BPM, breathy vocal, sparse chord pattern, low strings |
energetic | fast BPM, driving drums, sharp synth hits, strong rhythm guitar |
romantic | warm strings, soft vocal register, sustained pads, slow harmonic rhythm |
dark | minor key, low register, distorted bass, low-pass mix, breathy vocal |
dreamy | reverb-heavy mix, soft attack, layered pads, sustained vocal |
aggressive | distorted guitars, fast BPM, shouted vocal, heavy drums |
triumphant | major key, building dynamic, brass hits, declarative vocal |
intimate | close-mic vocal, low dynamic range, soft attack, single voice |
If a mood word cannot be grounded, drop it. A grounded prompt with five moods beats an ungrounded prompt with fifteen. For the full emotion quick reference (21 emotions with prompt + lyrics + arrangement templates), see references/prompt-formula.md.
Different providers have different limits. The exact values depend on the active OpenClaw runtime configuration, but the common ranges are:
| Tier | Typical RPM | Notes |
|---|---|---|
| Free / trial | 5–20 | Lower concurrency, watermarking may apply |
| Standard paid | 60–120 | Generous for personal use |
| Heavy / batch | 1000+ | Dedicated plans |
Defaults to assume (for the most common providers):
Before submitting a batch, check the active provider's plan. Generating 10 variations in quick succession on a free tier will rate-limit the 3rd or 4th call.
If a call fails with 429 (rate limit):
Before delivering a generated song to the user, walk this list mentally. If 3 or more items fail, the prompt needs adjustment and a regeneration. If 1–2 fail, you can either accept the result and warn the user, or make a targeted fix and regenerate.
The eight items above are about audio quality. In addition, confirm that the output matches the user's original request on these specific points. If any item fails, the prompt needs adjustment before delivery.
| Check | What to confirm | Failure action |
|---|---|---|
| Language | Vocals are in the requested language | Restate language in prompt; ensure lyric body is in the same language |
| Vocal / instrumental mode | Instrumental request has no vocals; vocal request has vocals | Add or remove Instrumental only, no vocals; check instrumental flag |
| Section structure | Number and order of sections match the plan | Add or correct [Verse], [Chorus], [Bridge], [Outro] tags in the lyrics |
| Lyrics source | If user provided lyrics, the words appear recognizably | Stop and tell the user the generator paraphrased their text; revert to a tighter prompt |
| Duration | Output length is in the requested range | Add or trim sections in the lyrics body; long output rarely comes from more length, only from more tagged sections |
| Stylistic references | Named artists or eras come through in genre / instruments / vocals | Translate the reference into concrete descriptors; add them to the prompt |
If 3+ request-fit items fail, regenerate with a revised prompt. If 1–2 fail, warn the user and offer a targeted fix. For the specific fix patterns, see references/error-handling.md.
When the output is close but not right, do not regenerate from scratch. Build a revision prompt that only changes the failing element and keeps everything else intact.
Keep 80% of the original prompt. Add a single REVISION: block at the end that targets the specific failure.
[Original prompt, unchanged]
REVISION:
- [Specific change 1]
- [Specific change 2]
- Keep: [elements that already work and should NOT change]
Always pass the same lyrics body for revision requests unless the lyrics themselves are the failure.
Output is too sparse:
[Original prompt]
REVISION:
- Strengthen the anti-sparse guard: "ALL instruments ALWAYS playing, NEVER a cappella"
- Re-list every instrument: accordion, upright bass, strings, piano, light percussion
- Quiet sections: reduce to accordion and bass only, still fully played
- Keep: language, vocal type, structure
Output is in the wrong language:
[Original prompt]
REVISION:
- Vocals in [target language] only, no English
- Lyric body is already in [target language] — keep as is
- Keep: genre, mood, instruments
Chorus is weak:
[Original prompt]
REVISION:
- Chorus: melody-driven, hook-forward, with sustained vowels
- Add a [Pre Chorus] section before each [Chorus] for build
- Keep: verse style, language, instruments
Vocal mode is wrong (vocals in instrumental, or silent in vocal request):
[Original prompt]
REVISION:
- [Add or remove] "Instrumental only, no vocals" line
- If vocals were missing: add "lead vocal in [language], [register], [delivery]"
- Keep: genre, mood, structure
For the full retry recipe library (wrong language, weak chorus, sparse arrangement, vocals in instrumental, missing genre identity, too generic output), see references/error-handling.md.
When music_generate is called without explicit lyrics and the request implies a vocal track (not instrumental), the runtime may auto-generate lyrics from the prompt. The exact behavior depends on the provider:
| Provider | Behavior when no lyrics |
|---|---|
| MiniMax | Calls lyrics_optimizer: true automatically; generates structured lyrics matching the prompt's theme and language |
| ACE-Step | Uses 5Hz LM (when thinking=true) to rewrite tags and generate audio codes; can fill in missing metadata |
| ElevenLabs | Requires explicit lyrics; without them, returns instrumental or errors |
| Other | Varies — check provider docs |
This means: if the user did not provide lyrics and did not say "instrumental", the output will have AI-written lyrics in the language and theme implied by the prompt. If the user wants specific words, they must provide them. If they want no vocals, set instrumental: true.
The openclaw-music-workflow-minimax skill extends this with an optional fetch_lyrics_web.py script that looks up lyrics from LRCLib for known mainstream songs. This is a quality boost for popular tracks but is not a general solution — LRCLib has poor coverage for instrumentals, obscure bands, and friend-produced music. When the lookup returns nothing, the workflow silently falls back to Whisper transcription (or the lyrics optimizer). For most use cases you can ignore this; the user just needs better lyrics for a well-known song.
If the user is surprised by the AI-written lyrics, that is a workflow issue (the Pre-Flight should have asked), not a generation issue. Adjust by asking the user next time whether they want auto-lyrics or want to provide their own.
The skill does not start with a questionnaire. It starts by reading and inferring.
| User says... | Skill does... |
|---|---|
| "Make a sad love song in Spanish" | Auto-detect: ES, romantic, ~3 min. Ask: lyrics source and vocal register. |
| "Instrumental lofi for studying" | Auto-detect: lofi, no vocals, ~3 min. Ask: nothing. Generate. |
| "Here are the lyrics, make it pop" | Auto-detect: pop, user-lyrics. Ask: tempo and energy preference. |
| "Something that sounds like Rosalía" | Auto-detect: modern Latin pop, female vocal. Ask: lyrics source and theme. |
| "I don't know, surprise me" | Pick a coherent default (for example upbeat indie pop, EN, ~3 min, auto-lyrics) and confirm with the user before generating. |
For the full decision table and edge cases, see references/user-preference-flow.md.
A generation should never land as a stray MP3 at the project root. Always save inside a per-song subfolder so the analysis, the prompt, the lyrics, and every version stay grouped and reviewable.
<project-root>/ e.g. ~/Music mix/dbc/
└── <song-slug>/ e.g. two-paths/
├── <version-prefix>_<style>.mp3 e.g. M1_synthwave.mp3, M2_indiefolk.mp3
├── <song-slug>_analysis.json (only when using music-craft-minimax)
├── <song-slug>_lyrics.txt
├── <song-slug>_synthwave.txt (the prompt that generated M1)
└── <song-slug>_indiefolk.txt (the prompt that generated M2)
openclaw- (ClawHub protected namespace)When generating multiple versions of the same song (cover, mashup, style transfer, prompt revision), prefix the MP3 with a short label so the user can scan the folder:
| Prefix | When to use |
|---|---|
A_ | First attempt, base skill only |
B_ | First revision with a different style |
C_ | Cover version of an existing track |
M1_, M2_ | Mashup / mixed source (Song A + Song B) |
N1_, N2_ | Style transfer to a target style |
v2_, v3_ | Polish / second take of a previous version |
~/Music mix/dbc/
├── when_you_bleed/ (7 versions: A, B1, C, M1, M2, N1, N2)
├── family_acoustic/ (2 versions: cinematic + base_skill)
├── two_paths/ (2 versions: M1_synthwave, M2_indiefolk)
└── give_us_a_reason/ (2 versions: M3_industrial, M4_postpunk)
This is the structure the LLM should aim for by default. Ask the user before deviating.
references/prompt-formula.md — full production-sheet formula, worked examples across genres, prompt lint, and the emotion quick referencereferences/structure-tags.md — all section tags with rules, effects, and timing hintsreferences/user-preference-flow.md — the auto-detect plus ask decision table and edge casesreferences/examples.md — five worked examples (Spanish pop, English instrumental jingle, user-provided lyrics, image-inspired track, text-only style reference) with intake → prompt → verification for eachreferences/style-categories.md — 10 style categories with default instruments, BPM range, and moodreferences/input-workflows.md — 9 input types (description, user-lyrics, audio file, YouTube audio, song name, lyrics URL, YouTube metadata, image, genre/cultural), plus the signal-extraction rubric and confidence levelsreferences/error-handling.md — error table, retry recipes (wrong language, weak chorus, sparse, vocals in instrumental, missing genre, too generic), and recovery patternsreferences/free-tool-inputs.md — web_fetch, web_search, image, and memory tools for enriching inputs without scriptsmusic-craft-minimax/scripts/lint_music_request.py — optional standard-library helper for routing, blockers, missing fields, prompt, and mmx flag linting. Run it before generating to catch missing required slots, conflicting language signals, and vague mood words without grounding.references/prompt-formula.md under "Mood" and the full shared emotion recipes in music-craft-minimax/references/emotion-delivery.md