Install
openclaw skills install music-craft-minimaxAdvanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emotion-driven prompt engineering, and fine control via the `mmx` CLI. Extends `music-craft` with MiniMax-specific features.
openclaw skills install music-craft-minimaxThis is the power-user upgrade of music-craft. It does everything that skill does, plus the features that require the MiniMax Music 2.6 token plan:
mmx)For everything else (standard song generation, instrumentation, anti-sparse prompt engineering, structure tags, user preference flow), this skill uses the same workflow as music-craft. Read that skill first to understand the base, then come back here for the MiniMax-specific extensions.
Classify the request before analysis or generation:
The scripts/lint_music_request.py helper emits one of these routes:
| Route | When |
|---|---|
base_prompt | Standard generation, no MiniMax-specific feature needed. |
minimax_cover | Melody-preserving cover from audio or YouTube. |
minimax_mashup | Two-song mashup (A + B, both identified). |
minimax_style_transfer | Style transfer that does not preserve the source melody. |
minimax_emotion_prompt | Emotion analysis, or precision mmx flag usage. |
needs_clarification | At least one blocker is unresolved; ask the user first. |
Surface blockers before analysis:
After you have prompt text and mmx flags, lint them together before generation:
--bpm--key--structure--duration (or implicit length expectation)--vocals--language--avoidIf the user only has a text reference, route to the free-tool path in references/free-tool-inputs.md first. If the user has audio, analyze first and only then build the prompt. The linter returns a retry_guidance array with one hint per conflict so the operator can re-align prompt and flags on the next attempt.
Use this skill when the task involves:
lyrics_generation APIlyrics_generation edit mode)mmx CLI directly for fine control over --avoid, --bpm, --key, --structure, --vocals, --instruments as separate flagsmusic-cover or music-cover-free models for melody preservationAfter the Routing and Blocker Checks classify the request, run this 2-pass intake to extract the full set of fields the user cares about. Label each field's confidence: clear (user said it), inferred (sensible default), missing (need to ask), or conflicting (user said two incompatible things — pause to resolve).
| # | Field | What to look for | MiniMax-specific notes |
|---|---|---|---|
| 1 | Route | Cover / style transfer / mashup / standard / emotion prompt | From the Routing and Blocker Checks section. Determines which MiniMax features to use. |
| 2 | Source audio or URL | File path or playable YouTube URL | Required for cover, mashup, style transfer. For standard, optional (text-only style reference is also fine). |
| 3 | Song A identity | Name, artist, audio | For mashup: needed. For cover: this is the source. |
| 4 | Song B identity | Name, artist, audio | For mashup only. |
| 5 | Target style | Genre / mood / reference | The destination of the cover or style transfer. If user says "like Rosalía", that's clear. If user says "something good", that's missing. |
| 6 | Lyrics decision | Original / translated / new / instrumental | For cover, default to original (translated if user requests it). For standard, default to new (or user-provided). |
| 7 | Vocal mode | Solo / duet / choir / instrumental | Drives --vocals and --language flags. |
| 8 | Language | BCP-47 code (en, fr, es, etc.) | For lyrics language AND vocal language. |
| 9 | Duration | Approximate length (jingle ~30s, standard ~3min, epic ~6min) | mmx has no native duration control (see "Song length" section). Length is driven by lyrics + structure, so the intake needs the lyrics to control length. |
| 10 | BPM, key, structure | Exact values if user wants --bpm/--key/--structure | Optional. If provided, the prompt AND flags must agree (lint them). |
| 11 | Emotion arc | For emotion-prompt workflows: which emotions to emphasize | Drives the analysis-to-prompt translation. |
| 12 | Output location | Where the audio and analysis files go | Same as the base skill — per-song subfolder in ~/Music mix/<project>/<song-slug>/. |
Request: "Hazme un cover del 'Bizcochito' de Rosalía pero en reggaetón"
clear: source_audio=path, song_a=Bizcochito, target_style=reggaeton
inferred: language=es, vocal_mode=solo_female, lyrics_decision=original
missing: output_location (which project folder? per-song subfolder?)
vocal_register (full chest, head voice, whisper? — affects --vocals flag)
Request: "I have a YouTube link of an old rock song and want it as a dreamy shoegaze ballad, with English lyrics because the original is in French"
clear: source_url=URL, song_a=old_rock_song, target_style=shoegaze
lyrics_decision=translated, target_language=en
inferred: vocal_mode=duet or solo (depends on original), ~3min
missing: audio source for source audio analysis (YouTube needs to be downloaded first)
BPM/key from analysis output (will be filled in after analysis)
output_location
If any field is missing or conflicting, that's a question to ask. The Ambiguity Questions section below has specific patterns for each route. If everything is clear or inferred, the request is ready to translate.
The skill does not start with a questionnaire. It starts by reading and inferring from the user's natural-language request.
| User says... | Skill does... |
|---|---|
| "Haz un cover de X en Y" | Route: minimax_cover. Ask: source audio file (or download from YouTube), target language for lyrics, vocal register. |
| "Make this song sound like Rosalía" | Route: minimax_style_transfer. Ask: source audio, which album/era of Rosalía. |
| "I have audio of A, mash with B, keep A's melody" | Route: minimax_mashup. Ask: A vs B confirmation, source audio for A, B can be name or audio. |
| "Analyze the emotion curve of this track" | Route: minimax_emotion_prompt (analysis-only). Run analysis_orchestrator.py --audio first, then read the JSON. |
| "I want the lyrics to be about X, in French, melancholic" | Route: base_prompt (standard). Use the lyrics API to generate, then pass to mmx music generate --lyrics-file. Ask: target BPM/key/structure or derive from analysis. |
| "Recreate the song but in 90 BPM D minor" | Route: base_prompt with mmx flags. Lint prompt vs flags before generation. Verify BPM/key consistency. |
| "I don't know, surprise me" | Pick a coherent default (e.g. upbeat indie pop, EN, ~3min, auto-lyrics, standard generation) and confirm with the user before generating. |
| "Same song again but as a reggaeton version" | Route: minimax_cover with the existing song as source. Use the same project/song subfolder, suffix the MP3 (M1_original.mp3 + M2_reggaeton.mp3). |
This table is the abstract of references/user-preference-flow.md (which lives in the base skill). If you want a more detailed case, defer to the base skill's table and combine with this skill's route mapping.
MiniMax-specific additions (drop these into the per-song subfolder alongside the base items):
| File | Source | Notes |
|---|---|---|
<song-slug>_analysis.json | analysis_orchestrator.py --output | MiniMax-specific analysis results (emotion, BPM, key, segments) |
<song-slug>_lyrics.txt | mmx music generate --lyrics-file | Optional if user provided lyrics inline |
<song-slug>_<style>_prompt.txt | The exact text passed to --prompt | For reproducibility |
The LLM should aim for the base skill's layout by default. The MiniMax-specific files are added on top when MiniMax features are used (cover workflow, mashup, analysis, etc.).
For any input combination, the analysis_orchestrator.py script is the single entry point:
# Audio file
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav
# Two songs (mashup) - gets BPM + key compatibility scoring for free
python3 scripts/analysis_orchestrator.py --audio /tmp/song_a.wav --audio /tmp/song_b.wav
# Video - extracts audio + visual features (scenes, color, motion)
python3 scripts/analysis_orchestrator.py --video /tmp/clip.mp4
# Image (album art) - color palette + style hints
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg
# YouTube URL - downloads then analyzes
python3 scripts/analysis_orchestrator.py --youtube "https://youtube.com/watch?v=..."
# Combination: audio + image
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --image /tmp/art.jpg
# Demucs source separation — for TIMBRE/PITCH analysis of an isolated vocal, NOT for lyrics
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --use-demucs
# Whisper lyrics extraction — run on the FULL mix (do NOT pre-separate with Demucs)
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --lyrics
# VLM captioning for images (calls mmx vision describe / MiniMax 3.0 — cloud, skip if MiniMax is blocked)
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg --vlm
The orchestrator dispatches to the right analysis scripts and produces a unified JSON. Optional packages (CLAP, autochord, allin1, pyloudnorm, pylette, scenedetect, demucs, beat_this, basic-pitch, transformers/MERT, open_clip) are detected at runtime and used when available.
These are the rules that make the extracted data useful to the downstream generator. They are tool-agnostic — they apply whether the backend is MiniMax cloud or a local model.
large-v2 model for sung lyrics — large-v3 is reliably worse on singing. Use medium/base only as a speed compromise.low/medium detections ("around 128 BPM", "likely D minor") and never inject missing values as facts — see Analysis Quality below.[Verse]/[Chorus]/[Bridge] tag roadmap, and (for backends that support it) the repaint windows for fixing one bad section instead of regenerating the whole track.Every generation should be saved into a per-song subfolder that bundles the audio with its analysis, prompt, and lyrics. The LLM should ask the user for the project root and song slug up front (default: ~/Music mix/<project>/<song-slug>/), then run the full chain of commands below.
# Example: DBC - Two Paths, two versions
# 1. Make the subfolder
mkdir -p ~/Music\ mix/dbc/two-paths
# 2. Run the analysis and save JSON into the subfolder
python3 scripts/analysis_orchestrator.py \
--audio /tmp/two_paths.wav \
--use-demucs --lyrics --lyrics-source auto \
--output ~/Music\ mix/dbc/two-paths/two_paths_analysis.json
# 3. Build the prompt from the analysis, save it next to the JSON
python3 scripts/emotion_to_prompt.py \
--emotion ~/Music\ mix/dbc/two-paths/two_paths_analysis.json \
--output ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt
# 4. Generate each version, save the MP3 into the subfolder with a
# versioned filename so multiple takes stack cleanly
mmx music generate \
--prompt "$(cat ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt)" \
--lyrics-file ~/Music\ mix/dbc/two-paths/two_paths_lyrics.txt \
--out ~/Music\ mix/dbc/two-paths/M1_two_paths_synthwave.mp3
The result is a self-contained song folder that the user can review, archive, share, or re-generate from without losing any context.
v1.0.0 is the first stable release. It builds on the v0.x series (v0.3.0 / v0.4.0 dev line) with stronger preflight routing, wider prompt/flag consistency, and explicit post-generation verification:
Preflight routing:
lint_music_request.py now emits one of six routes: base_prompt, minimax_cover, minimax_mashup, minimax_style_transfer, minimax_emotion_prompt, or needs_clarificationretry_guidance array on every conflict so the operator can re-align prompt and flagsPrompt and flag consistency:
mmx prompt schema is documented in references/examples.mdAnalysis quality:
summary (tempo, key, sections, instrumentation, vocal traits, energy curve, hook points, mix notes)Output verification:
Tests and portability:
v0.3.0 builds on v0.2.0 with a substantially richer analysis pipeline:
New analysis scripts (8):
extract_stems.py — Demucs source separation (vocal/drums/bass/other)track_beats.py — beat_this beat + downbeat tracking (ISMIR 2024 SOTA)extract_melody.py — Spotify Basic Pitch polyphonic AMT → MIDI + key/scalecompute_audio_embedding.py — MERT v1-330M music embeddings (vibe similarity)classify_instruments.py — MIT AST 527-class AudioSet taggingextract_video_features.py — extended with camera motion + VLM captioninganalyze_image.py — extended with OpenCLIP, OCR, face detection, VLM captionanalysis_orchestrator.py — single entry point, --use-demucs, --vlm, --ocr flagsNew prompt slots (consumed in emotion_to_prompt.py):
beat grid: 4/4 at 150 BPM (confidence 0.80) from beat_thismelodic key from MIDI: E minor; interval motion: mostly leaps; modal character: pentatonic, blues from Basic PitchAST-detected sound palette: rock music (0.16), punk rock (0.14), grunge (0.20) from MIT ASTemotion signature from analysis: intense, passionate, dramatic, triumphant (expanded to 25-emotion classifier)vocal texture in verse: breathier / more intimate than average (per-section aggregation)tempo: tight, on-beat delivery (from tempo_consistency)tonal character: dark warm tone, rolled-off highs (from brightness)instruments detected: electronic / synthetic textures (from instrument_hints)natural dramatic pauses detected at: 2s (11.7s pause), 20s (3.3s pause) (from Demucs vocal-stem)style direction: ... (from analyze_two_songs mashup_plan)Bug fixes:
Prompting wins (verified end-to-end with DBC Woodstock 2013):
Do not use this skill when:
music-craft instead (lighter, no MiniMax dependency)music_generate tool and there is no MINIMAX_API_KEY configured — both skills need the runtimeUse the base skill unless one of these MiniMax-specific needs is present:
mmx control for BPM, key, structure, or avoid listsIf the user wants a new song that only borrows a style, stay in music-craft unless they also need exact flag control or lyrics API iteration.
If the source is a YouTube URL and download is blocked, ask for a local file before changing the workflow.
Use these defaults on the first pass:
mmx flags if exact BPM/key/structure matter.write_full_song for blank-page generation and edit for revisions.mmx flags the source of truth and keep the prompt descriptive but non-conflicting.Ask at most 1-3 questions. Separate blockers from quality tweaks:
Use these exact patterns when clarification is needed:
music-craftThis skill extends the base skill, it does not replace it. The shared concepts are:
| Concept | Where it lives |
|---|---|
| Pre-Flight Check (platform detection) | This skill (extended required list) |
| Anti-sparse rules (canonical text) | Base skill, referenced from here |
| Prompt formula (production sheet) | Base skill, referenced from here |
| Structure tags (14 tags) | Base skill, referenced from here |
| User preference flow (auto-detect + ask) | Base skill, referenced from here |
| Output file layout (per-song subfolders, slug rules, version prefix) | Base skill, referenced from here; MiniMax adds analysis.json and lyrics.txt |
| Rate limits (generic) | Base skill |
| Quality verification checklist | Base skill, extended here for MiniMax |
| Operating rules (6-step loop) | Base skill, summarized here with MiniMax-specific extensions |
The MiniMax-specific additions are:
| MiniMax concept | Where it lives |
|---|---|
mmx CLI quick reference | This skill |
mmx full flag reference | This skill, references/mmx-flags-reference.md |
| Cover workflow (one-step, two-step) | This skill, references/cover-workflow.md |
| Lyrics generation API | This skill, references/lyrics-generation.md |
| Mashup workflow (A + B) | This skill, references/mashup-workflow.md |
| Emotion analysis (vocal speed, intensity, pitch) | This skill, references/emotion-analysis.md |
| MiniMax-specific error handling | This skill, references/error-handling.md |
| Audio analysis scripts | This skill, scripts/ |
| Free tool inputs (web, image, memory) | Both skills — base layer in music-craft, MiniMax layer here in references/free-tool-inputs.md |
The platform detection block is the same as music-craft (run it first). The required and optional lists are extended for MiniMax.
python3, command -v, and normal shell export/PATH checks.python or py -3, and verify env vars with Get-ChildItem Env:MINIMAX_API_KEY or Test-Path Env:MINIMAX_API_KEY.ffmpeg, yt-dlp, and mmx are PATH-sensitive; if Get-Command/where.exe cannot find them, restart the shell or add the install directory to PATH.references/windows-wsl-setup.md.pip/HuggingFace/model downloads inside WSL fail with CERTIFICATE_VERIFY_FAILED until the corporate root CA is installed in the distro (and REQUESTS_CA_BUNDLE/SSL_CERT_FILE point at the system bundle). A proxy env var (HTTP_PROXY) can also hijack 127.0.0.1 calls to a local API — unset it / set no_proxy=127.0.0.1,localhost. Both are covered in the base reference above.mmx music cover, the lyrics API, and any analysis script that calls MiniMax (e.g. emotion_to_prompt.py) — will fail. In that case use only the local-capable tools (yt-dlp, ffmpeg, librosa, Whisper) for analysis and the local ACE-Step backend in music-craft for generation.| Check | What it is | How to verify | If missing |
|---|---|---|---|
music_generate tool | The runtime's built-in music generation tool | Inspect the active runtime's tool list | Tell the user: "This skill needs a music_generate tool, but the active runtime does not expose one. Configure a music provider in OpenClaw and try again." Stop. |
MINIMAX_API_KEY env var | API key for the MiniMax Music 2.6 plan | test -n "$MINIMAX_API_KEY" && echo "OK" | Tell the user: "This skill needs the MINIMAX_API_KEY environment variable. Get one from your MiniMax account and export it. If you do not have a MiniMax Token Plan, use music-craft instead — it works with any provider." Stop. |
mmx CLI | The MiniMax CLI for fine-flag control | command -v mmx && mmx --version (macOS/Linux) or Get-Command mmx; mmx --version (PowerShell) | Ask the user: install via the MiniMax install guide, or skip mmx-specific features and use the music_generate tool with prompts. Do not block — mmx is optional if the runtime has MiniMax configured, but Windows support is only partial and depends on PATH visibility. |
python3 | Required for the analysis scripts | command -v python3 (macOS/Linux) or python / py -3 (Windows PowerShell) | Tell the user: "The analysis pipeline (emotion analysis, mashup) needs Python 3.9+." Propose an install command for the active shell. Block emotion analysis if missing. |
| Tool | What it unlocks | Install per platform |
|---|---|---|
ffmpeg | Audio conversion (WAV for analysis, MP3 export, trimming) | apt install ffmpeg · brew install ffmpeg · winget install Gyan.FFmpeg (restart PowerShell after install so PATH updates apply) |
yt-dlp | YouTube audio download for cover and mashup | pip install -U yt-dlp or py -3 -m pip install -U yt-dlp on Windows; ensure the CLI is on PATH |
librosa | Audio analysis (BPM, key, energy, structure) | pip install librosa numpy scipy |
parselmouth | Better pitch tracking (Praat under the hood) | pip install praat-parselmouth |
scikit-learn | Audio clustering (segment detection) | pip install scikit-learn |
The full per-platform install table is in the base skill's music-craft Pre-Flight Check.
Same as the base skill: for each missing optional tool, present three options — install (propose exact command, let user approve), skip (use the simple path), or cancel. Never auto-install.
If MINIMAX_API_KEY is missing, the redirect is to the base skill, not "install MiniMax" — the user may not have a Token Plan at all.
Generation runs on MiniMax's cloud — your laptop just sends the prompt and downloads the MP3, so generation itself uses negligible local memory.
However, this skill's local analysis scripts run on your machine and can use real memory. Before running the full analysis pipeline, check available memory:
| Script | Models loaded | Approx peak RAM |
|---|---|---|
analyze_vocal_emotion.py | parselmouth (Praat) + scipy | ~500 MB |
analyze_audio.py | librosa + transformers (MERT or MIT AST) | 2–4 GB |
extract_lyrics_whisper.py | whisper model (tiny/base/medium) | 1–5 GB depending on model size |
extract_stems.py | Demucs (htdemucs) | 2–4 GB |
emotion_to_prompt.py | calls MiniMax API — negligible local | <100 MB |
compute_audio_embedding.py | MERT model | 1–2 GB |
classify_instruments.py | MIT AST | 1–2 GB |
Combined (full analysis pipeline on a 4-min song): ~6–10 GB peak on top of OS and other apps. On unified-memory systems (Apple Silicon, integrated graphics), this competes directly with macOS/Windows and your other applications. On dedicated-GPU systems (NVIDIA, AMD), model memory is taken from system RAM unless you have CUDA acceleration.
Recommendations:
extract_lyrics_whisper.py, use the tiny model by default — base/medium are 2-5x heavier with marginal quality gain for most songsextract_stems.py, the --quality flag controls Demucs model size; default htdemucs is the heaviest; htdemucs_ft is the lightestanalysis_orchestrator.py (which loads everything)The scripts/smoke_test.py script verifies the environment is set up; it does not test memory headroom. Run your own memory check before running a full analysis.
The OpenClaw runtime exposes several free tools (web_fetch, web_search, image analysis, memory, browser) that enrich the music generation workflow. The base layer is documented in music-craft → Free Tool Augmentation and references/free-tool-inputs.md. This section shows how they compose with MiniMax-specific features.
| Tool | Purpose |
|---|---|
web_fetch | Fetch URL content (lyrics pages, YouTube metadata, Wikipedia) |
web_search | Find lyrics, artist info, genre descriptions |
image / MiniMax__understand_image | Analyze album art, concert photos, music video screenshots |
memory_search / memory_get | Recall user's prior music preferences |
browser | JS-heavy site fallback (last resort) |
web_fetch + lyrics_generation: fetch the user's draft from a URL, run it through edit mode for cleanup, generate.web_search + cover workflow: find covers in the target style, extract their characteristics, apply to the user's track.image + mmx per-flag control: analyze album art, translate to --instruments, --bpm, --key, --structure for fine-grained style matching.memory + emotion analysis: combine the user's prior preferences with deep audio analysis of a reference track.For the full worked examples, parameter recommendations, and MiniMax-specific edge cases, see references/free-tool-inputs.md.
Same 6-step loop as music-craft, with MiniMax-specific extensions:
mmx flags (see references/mmx-flags-reference.md) instead of packing everything into the promptreferences/lyrics-generation.md)music-cover model for melody preservationFor the full 6-step detail, see music-craft → Operating Rules.
Unlike music-craft's ACE-Step backend (which takes audio_duration as a parameter), MiniMax Music 2.6 has no explicit duration flag. Output length is determined by:
[Verse]/[Chorus] section takes ~15-30 seconds depending on word count and singing pace. A typical 3:30 song has ~150-200 lyrics words across 2 verses + 2 choruses + bridge.[Intro], [Instrumental Break], [Outro] add silent/sparse sections that extend total length without lyrics.Practical recipe for a full 3:30 song:
[Verse 1], [Pre-Chorus], [Chorus], [Verse 2], [Bridge], [Outro] tags (full song structure, not just one chorus)"full 3-minute song with intro, 2 verses, 2 choruses, bridge, and outro" or use --structure "intro-verse-pre_chorus-chorus-verse-chorus-bridge-chorus-outro"[Instrumental Break] tags to control pacingDon't expect mmx to hit 3:30 exactly. Output length varies by ±20-30s depending on the model. If you need precise length, ACE-Step is the right tool (it has audio_duration). If you want MiniMax's vocal quality and the song length is flexible, mmx is fine.
The mmx CLI exposes MiniMax Music 2.6 parameters as separate flags. This gives finer control than packing everything into a single prompt string.
The most useful flags:
| Flag | Effect | Example |
|---|---|---|
--avoid | Elements to avoid (comma-separated) | --avoid "sparse, a cappella, electronic sounds" |
--bpm | Exact BPM | --bpm 80 |
--key | Musical key | --key "E minor" |
--structure | Song structure | --structure "intro-verse-pre chorus-chorus-verse-chorus-bridge-chorus-outro" |
--vocals | Vocal style | --vocals "passionate French male vocal" |
--instruments | Featured instruments | --instruments "accordion, upright bass, strings, piano" |
--genre | Genre | --genre "french chanson" |
--mood | Mood | --mood "melancholic romantic dramatic" |
--lyrics-optimizer | Auto-generate lyrics from prompt | (flag only) |
--model | Model name | --model music-2.6 (paid, highest RPM) or --model music-2.6-free (default, free tier) |
--cover-feature-id | Use a preprocessed cover (two-step workflow) | (from preprocess call) |
Full reference with all flags and examples: references/mmx-flags-reference.md.
When to use mmx vs music_generate:
mmx: when you need fine control over specific parameters (BPM, key, structure as separate flags)music_generate: when the prompt-only path is enough, and you want to keep the workflow provider-agnosticBoth produce equivalent results if the prompt and flags are aligned.
End-to-end verified invocations from this session (M5_idkw_dreampop_shoegaze + M5_idkw_opera_metal in ~/Music mix/hello_cleveland/i_dont_know_why/):
mmx music generate \
--prompt "dream pop reimagining, shoegaze-influenced indie rock turned ethereal and cinematic.
My Bloody Valentine meets Slowdive meets Radiohead.
Male lead vocal, breathy and vulnerable, double-tracked with slight detuning and tape warmth.
Wall of clean electric guitars with heavy chorus pedal and tremolo picking.
Shimmering washes of reverb, sub-bass synth pad foundation, soft brushed electronic drums.
Glockenspiel and celesta melody line high above the mix.
Organ pads swelling at choruses, reversed guitar samples between sections.
Heavy reverb and analog warmth throughout, lo-fi texture.
Emotional arc: hazy drifting opening building wave confusion overwhelming beautiful climax fading dreamlike denouement outro.
Avoid: sharp percussive agresivo distortion clear upfront vocals minimal sparse.
Tempo 96 BPM in D major, dreamlike half-time feel.
Suitable as a slow-burn alt-pop anthem, melodic and textural, intimate verses and soaring choruses.
Modern production, polished mix, atmospheric vocal production where vocals sit among the instruments rather than above them." \
--lyrics-file gen1_lyrics.txt \
--model music-2.6 \
--vocals "breathy vulnerable male lead, double-tracked with slight detuning" \
--genre "dream pop, shoegaze-influenced indie" \
--mood "hazy confusion building to overwhelming beautiful release, then dreamlike fade" \
--instruments "wall of clean electric guitars with heavy chorus pedal, sub-bass synth pad, soft brushed electronic drums, glockenspiel, celesta, organ pads, reversed guitar samples" \
--bpm 96 \
--key "D major" \
--structure "intro-verse-pre_chorus-chorus-post_chorus-verse2-chorus-repeat-outro" \
--use-case "slow-burn alt-pop anthem, suitable for late-night listening" \
--avoid "sharp percussive agresivo distortion, clear upfront vocals, minimal sparse arrangement" \
--references "My Bloody Valentine, Slowdive, Radiohead" \
--out M5_idkw_dreampop_shoegaze.mp3
Output: 167.9s MP3, 5.4 MB, -8.8 LUFS, 5.7 LRA (good dynamics).
mmx music generate \
--prompt "extreme dramatic contrast: powerful operatic tenor vocals over heavy metal instrumentation.
Like Freddie Mercury fronting Metallica. Epic, theatrical, over the top.
Thunderous double bass drums, distorted electric guitars with palm-muted chugging,
guttural rhythm section, blast beats, tremolo picking, minor key riffing.
Operatic vocals soaring above the metal wall of sound, belting high notes with vibrato.
Gothic theatrical atmosphere, dramatic dynamic shifts from whisper-quiet verses
to explosive metal choruses. Anthem-like, stadium-ready." \
--lyrics-file gen2_lyrics.txt \
--model music-2.6 \
--vocals "operatic tenor, powerful Freddie Mercury style, vibrato, theatrical belting" \
--genre "symphonic metal" \
--mood "dramatic, theatrical, anthemic, intense" \
--instruments "distorted electric guitars, double bass drums, blast beats, orchestral strings" \
--tempo "fast" \
--bpm 160 \
--key "D minor" \
--structure "verse-pre_chorus-chorus-verse-pre_chorus-chorus-outro" \
--use-case "epic music experiment" \
--avoid "pop, soft, gentle, acoustic, slow" \
--out M5_idkw_opera_metal.mp3
Output: 155.8s MP3, 5.0 MB, -9.6 LUFS, 4.3 LRA (compressed but still has dynamics).
| Model | When to use | Cost |
|---|---|---|
music-2.6 (default) | Production work, full quality | Token Plan / paid |
music-2.6-free | Free tier, lower RPM, "unlimited" quota for some plans | Free |
music-2.5+ | Older model, still good quality | Token Plan / paid |
music-2.5 | Legacy | Token Plan / paid |
music-cover | Cover/re-interpretation of source audio (one-step) | Token Plan / paid |
music-cover-free | Free cover variant | Free |
music-2.6-free is the default for most users — same model, free tier. The mmx CLI uses it as the default when no --model is specified.
is_instrumental and lyrics_optimizer flags (miniMax-specific paths)The mmx CLI exposes two important flags that bypass the --lyrics requirement:
| Flag | What it does | When to use |
|---|---|---|
--instrumental | Generate music without vocals (no lyrics needed) | When user wants BGM, intro, soundtrack, loop |
--lyrics-optimizer | Auto-generate lyrics from the prompt (no --lyrics needed) | When user says "make me a song about X" but doesn't have lyrics |
Examples:
# Pure instrumental (no vocals)
mmx music generate \
--prompt "Instrumental only, no vocals, no lyrics. Loopable coffee shop background, soft piano, brushed drums, 90 BPM, C major" \
--instrumental \
--length 180000 \
--out coffee_bgm.mp3
# Auto-generated lyrics from prompt
mmx music generate \
--prompt "Upbeat indie folk, melancholic but hopeful, male vocal, acoustic guitar, 100 BPM" \
--lyrics-optimizer \
--out indie_folk.mp3
Note: mmx music generate with --length uses milliseconds (the example shows --length 180000 for 3 minutes). This is mmx-specific; the underlying MiniMax API has no official duration parameter.
mmx music generate returns a saved: path. If you ever use --output-format url (the official API default), the URL expires after 24 hours. Download immediately. The mmx CLI auto-downloads to --out so this is not a problem when using --out directly.
Two cover backends exist — pick by what's available:
- MiniMax cloud cover (this skill):
mmx music cover, melody-preserving via MiniMax'smusic-covermodel. NeedsMINIMAX_API_KEYand network access to MiniMax.- Local ACE-Step cover (in
music-craft):task_type=coverwith the source audio uploaded (multipart) andaudio_cover_strengthcontrolling how far to restyle. Fully local, no cloud, follows the source melody/structure. Caveat: a full-length cover is slow and VRAM-heavy on a ~12 GB GPU and can hit the server's 600 s generation timeout — cover a shorter segment or raiseACESTEP_GENERATION_TIMEOUT. See music-craft's "ACE-Step Audio-Conditioned Generation" section.So if MiniMax is unavailable (no key, or blocked on your network), you can still do a melody-aware cover locally with ACE-Step — it is not cloud-only. Only pure text-prompt generation (no source audio) is a "reimagining" rather than a cover.
Cover workflow preserves the original song's melody while applying a different style. Two paths:
One-step (quick):
mmx music cover \
--prompt "French chanson, accordion, strings, passionate French vocal, 80 BPM" \
--audio-file /tmp/original.ogg \
--out /tmp/cover.mp3
MiniMax extracts lyrics via ASR and applies the new style.
Two-step (more control):
The two-step path gives better results when the original lyrics need correction or when the user wants different lyrics in the new style.
Full detail with payload examples, error handling, and use cases: references/cover-workflow.md.
MiniMax has a dedicated lyrics_generation endpoint that produces structured lyrics (with [Verse], [Chorus], etc. tags) from a theme prompt. Two modes:
write_full_song — create new lyrics from a themeedit — modify existing lyrics (e.g., make the chorus stronger, shift to a hopeful ending)The output is structured lyrics that can be passed directly to music_generate or mmx music generate.
Full detail with API examples, parameters, and use cases: references/lyrics-generation.md.
As an optional complement to Whisper transcription, the orchestrator can look up song lyrics from LRCLib (open, no auth, JSON API at https://lrclib.net/api) when the song is a known mainstream track. This is a graceful fallback — Whisper is the primary source, LRCLib is a quality boost for the right song.
Coverage reality check: LRCLib has good coverage for mainstream vocal music (pop, rock, hip-hop, R&B, country) and is poor or empty for:
When LRCLib is empty (the expected case for instrumentals), the script returns no_web_lyrics and the caller silently uses Whisper. This is the designed path, not a failure.
CLI usage:
# Standalone lookup
python3 scripts/fetch_lyrics_web.py \
--artist "Coldplay" --title "Yellow" \
--whisper-transcript "look at the stars..." \
--min-match 0.6 --json
Orchestrator integration via the --lyrics-source flag:
| Value | Behavior |
|---|---|
whisper (default) | Always use Whisper, never touch the web |
web | Always try LRCLib, never run Whisper |
auto | Whisper first; if the song is recognized AND LRCLib returns a confident match (>60% word overlap), use LRCLib; otherwise fall back to Whisper |
off | Skip lyrics extraction entirely |
The orchestrator auto-detects artist and title from the audio path stem (e.g. Coldplay - Yellow.wav → artist="Coldplay", title="Yellow"). Pass --name-a "Artist - Title" to override.
The result includes a web_lookup sub-dict with status, match_score, and the plain lyrics (when matched), so you can inspect what was used and why.
Full detail with scoring heuristic and exit codes: see scripts/fetch_lyrics_web.py docstring.
The signature MiniMax-specific feature: combine Song A (content + emotion) with Song B (style).
Workflow:
This is the most powerful feature in this skill. The output preserves what makes Song A recognizable (lyrics, melody, emotion) while applying Song B's production style.
Full detail with the emotion-to-prompt conversion and the two-song analysis script: references/mashup-workflow.md and references/emotion-analysis.md.
Emotion analysis extracts per-section features from input audio:
The analysis outputs JSON that the emotion_to_prompt.py script converts into a ready-to-use production-sheet prompt.
Local-only path (when MiniMax is unavailable):
emotion_to_prompt.pycalls the MiniMax cloud, so it fails when MiniMax is blocked or no key is set. In that case build the prompt locally from the analysis JSON without that script: take the extracted BPM and key/scale as explicit metadata fields; turn the energy curve and spectral brightness into texture words; turn the emotion classification and intensity curve into mood words and dynamic section tags; and feed transcribed lyrics (full-mix Whisper) as the lyric body. This is the same data, assembled by the agent instead of the cloud helper, and it feeds any backend (including a local model).
Scripts: scripts/analyze_vocal_emotion.py, scripts/analyze_audio.py, scripts/emotion_to_prompt.py.
Full detail: references/emotion-analysis.md.
For the generation side — how to use the analysis to evoke emotion in the OUTPUT, the 21 emotion recipes (joy, desperation, melancholy, triumph, yearning, anger, vulnerability, confidence, nostalgia, anxious, hopeful, tragic, heroic, tender, sensual, lonely, playful, haunting, serene, celebratory, bittersweet), the iteration loop, and common mistakes — see references/emotion-delivery.md.
Analysis scripts in scripts/ produce different views (emotion, beats, melody, structure, instrumentation). The skill expects them to converge on a single compact summary so downstream code and humans can read the same shape regardless of which scripts ran.
Every analysis result should include a summary object with these keys:
| Key | Type | Meaning |
|---|---|---|
tempo | string | BPM value with confidence, e.g. 120 BPM (confidence 0.92) |
key | string | Detected key, e.g. E minor (confidence 0.71) |
sections | list | Section labels with timing, e.g. [{"label": "verse", "start": 0.0, "end": 28.5}, ...] |
instrumentation | list | Detected instrument palette, e.g. ["electric guitar", "drums", "bass"] |
vocal_traits | dict | Breathiness, intensity, pitch range, e.g. {"breathiness": "high", "intensity": "medium"} |
energy_curve | list | Per-section energy values, e.g. [{"t": 0, "energy": 0.6}, ...] |
hook_points | list | Timestamps of detected hooks, e.g. [12.4, 48.0] |
mix_notes | list | Short strings, e.g. ["vocal upfront", "wide stereo drums", "rolled-off highs"] |
Scripts may add their own fields, but every script must return at least the keys above (use empty list / unknown string when a key has no data).
Every numeric or categorical detection in the analysis must carry a confidence value so weak detections do not get treated as facts.
| Confidence | Numeric range | Interpretation |
|---|---|---|
clear | n/a | The detection is unambiguous (e.g. user-supplied text, MIDI-confirmed key). |
high | >= 0.75 | Strong evidence from multiple sources or models. |
medium | 0.5 - 0.74 | Reasonable evidence but alternative interpretations exist. |
low | < 0.5 | Weak signal; treat as a hint, not a fact. |
inferred | n/a | Not measured directly; derived from context (e.g. lyrics from a YouTube URL). |
missing | n/a | Not available; the analysis did not run or did not find evidence. |
When feeding analysis into a prompt, prefix any low or medium detection with a hedge like "around" or "approximately", and never include missing values as if they were facts.
The advanced analysis scripts depend on optional packages (librosa, parselmouth, transformers, demucs, beat_this, basic_pitch, etc.). Each script must:
ImportError, return a JSON object that includes {"error": "install with pip install X", "summary": {}} instead of raising.The orchestrator at scripts/analysis_orchestrator.py collects per-script results and continues even if some scripts failed. The combined summary simply omits keys whose underlying analysis could not run. The linter, prompt builder, and generation step all read the summary and skip missing keys without erroring.
This means a user without demucs installed can still get tempo, key, and structure analysis from the base pipeline. The only loss is the per-stem vocal analysis, which is opt-in via --use-demucs.
The MiniMax Music 2.6 documented limits are:
Under the Token Plan 3.0 (June 2026+), the actual quota is credit-based rather than RPM-based:
general credit pool covers M3, M2.7, and M2.7-highspeedPractical implication: the documented 120 RPM is the API limit, but the Token Plan 3.0 quota is what determines your real ceiling. If you generate 4500 requests in 5 hours on Plus, you will be rate-limited regardless of RPM.
Before submitting a batch, check the active plan:
# Check current Token Plan usage
curl -s -H "Authorization: Bearer $MINIMAX_API_KEY" \
https://www.minimax.io/v1/token_plan/remains | jq .
If a call fails with 429 (rate limit):
The base skill's anti-sparse rules apply. The MiniMax-specific failure mode is more severe than other providers:
MiniMax interprets "sparse" or "minimal" as "remove all instruments", even more aggressively than other providers. The model has been observed to:
Mitigation:
"ALL instruments ALWAYS playing throughout, NEVER go a cappella or silent at any point"."quiet sections: reduced to accordion and bass only, still fully played, NOT silent".If a generation comes back sparse despite these rules, retry once with an even more explicit instrument list. If it fails again, the prompt has a structural issue — try a different style.
For the canonical anti-sparse text and worked examples, see the base skill's Anti-Sparse Rules section.
Same 8-point checklist as the base skill, plus 4 MiniMax-specific items:
--avoid flags are respected. If the user said "no electronic sounds", the output should not have synths.After generation, run a post-generation check that is specific to the route. Use the analysis orchestrator's output on the generated file when possible.
Cover (minimax_cover)
--avoid flags respectedMashup (minimax_mashup)
--avoid flags respected for Song B's styleStyle Transfer (minimax_style_transfer)
Emotion Prompt / Precision (minimax_emotion_prompt)
When the generated track does not match the request, identify the failure signature and apply the matching fix. The most common signatures:
| Failure signature | Likely cause | Fix |
|---|---|---|
| Copied too closely (cover sounds like a remaster, not a new style) | Prompt did not specify the new style firmly enough, or --avoid list left the original instrumentation unguarded. | Add explicit target style language, list new instruments, expand --avoid with the source's dominant sounds. Re-run. |
| Lost source melody (cover no longer recognisable) | --prompt overrode the cover model, or the source audio was too noisy / clipped. | Switch to the two-step cover workflow (preprocess + generate with cover-feature-id); reduce style strength in the prompt. |
| Wrong tempo (BPM noticeably off) | Prompt and --bpm disagreed, or vocal delivery speed misled the detector. | Lint prompt + flags first. Re-run with the linter-clean pair. If still off, set --bpm explicitly and drop the BPM number from the prompt. |
| Wrong key (key shifted up/down) | Prompt mentioned a key but flags used another. | Lint the pair. Use the same key in both. If MIDI confirms a different source key, trust MIDI over prompt. |
| Muddy mix (low clarity, washed out) | Overly dense instrumentation, lack of anti-sparse guard, or too many --avoid exclusions. | Reduce instrument count, raise --bpm for tightness, add explicit "all instruments clearly audible". |
| Vocals too neutral (no emotion) | Emotion analysis not run, or intensity curve not transferred. | Run analyze_vocal_emotion.py on the source and feed intensity_curve into the prompt. Add explicit "vocal intensity: ..." clause. |
| Weak chorus (chorus does not lift) | Structure line lacks a build cue, or the prompt was a single energy. | Add structure with explicit build cues: "verse: intimate, chorus: soaring, all instruments louder in chorus". |
| Style mismatch (output does not match the requested genre) | Prompt used vague genre words or the wrong dominant instrument. | Replace vague words with concrete genre + instrument list. Use the canonical mmx prompt schema in references/examples.md. |
When a generation comes back with one of the failure signatures above, build a revision prompt that preserves the source identity while changing the failing dimension.
Template: stronger style change (cover too close)
Same melody and lyrics as before. Re-imagine the production as [TARGET_STYLE] with [INSTRUMENT_LIST].
ALL instruments always playing throughout, never go a cappella.
Avoid: [STYLE_CONTRADICTING_WORDS from previous run].
Template: keep the melody (cover lost it)
Re-apply the original melody from the source audio. Keep the recognizable hook at [HOOK_TIME].
Use a softer production in [TARGET_STYLE] but DO NOT change the melodic contour.
Avoid: [WORDS_THAT_PUSHED_TOO_FAR].
Template: fix tempo drift
Keep the source BPM (use --bpm [SOURCE_BPM]). Do not slow down or speed up the vocal delivery.
Avoid: rubato, half-time, double-time, slowing down, speeding up.
Template: fix key shift
Stay in [SOURCE_KEY]. Do not transpose. Use the same chord progression as the source.
Avoid: key change, modulation, transpose.
Template: fix muddy mix
Make every instrument clearly audible. Reduce instrument count to [N].
Add contrast: quieter verses, louder choruses. Keep vocals upfront in the mix.
Avoid: dense layering, atmospheric washes, sustained pads throughout.
Template: lift the chorus
Chorus: soaring, all instruments louder than the verse, fuller chords, more reverb on the lead vocal.
Verse: intimate, single voice, soft drums, breathy delivery.
Bridge: build tension, add a melodic lift before the final chorus.
These templates pair with the failure-signature table. After the revision, re-run the verification checklist above.
Same as the base skill — when music_generate is called without explicit lyrics, MiniMax auto-generates. With this skill, you can also call the lyrics_generation API directly to preview the lyrics before generation, or to iterate via the edit mode.
If the user wants specific words, the lyrics_generation API's edit mode lets you modify auto-generated lyrics to match the user's intent without regenerating the whole song.
references/mmx-flags-reference.md — full mmx CLI flag reference with worked examplesreferences/examples.md — practical MiniMax examples with routing, first questions, workflow shapes, and prompt/flag lint catchesreferences/cover-workflow.md — one-step and two-step cover workflow with payloads, error handling, use casesreferences/lyrics-generation.md — the lyrics_generation API endpoint, both modes, examplesreferences/mashup-workflow.md — two-song mashup workflow, emotion-to-prompt conversion, decision treereferences/emotion-analysis.md — 25+ emotion classifications + per-emotion detection cookbook + emotion combinations + the analysis pipelinereferences/emotion-delivery.md — 21 emotion recipes for the OUTPUT + iteration loop + common mistakesreferences/advanced-audio-analysis.md — advanced free tools (Essentia, Demucs, Basic Pitch, Music21, CREPE) for deeper analysis when basic librosa/parselmouth is not enoughreferences/error-handling.md — MiniMax-specific error table, recovery patterns, anti-sparse failure recoveryscripts/check_environment.py — lightweight preflight diagnostic for Python, env vars, CLI tools, and optional packagesscripts/lint_music_request.py — standard-library helper for routing, blocker, missing-field, prompt, and mmx flag conflict checksscripts/smoke_test.py — standard-library smoke tests for pure helper behaviorscripts/ — Python helpers for audio analysis (download, segment, analyze, convert emotion to prompt)music-craft — base skill with shared concepts (Pre-Flight, anti-sparse, prompt formula, structure tags, Request Intake, User Preference Flow)music-craft → references/free-tool-inputs.md — base layer for free tool inputs (web_fetch, web_search, image, memory)references/free-tool-inputs.md — MiniMax layer: free-tool routing, blocker checks, and prompt/flag conflict lint before analysis