Music Craft — MiniMax

API key required
Data & APIs

Advanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emotion-driven prompt engineering, and fine control via the `mmx` CLI. Extends `music-craft` with MiniMax-specific features.

Install

openclaw skills install music-craft-minimax

Music Craft — MiniMax

This is the power-user upgrade of music-craft. It does everything that skill does, plus the features that require the MiniMax Music 2.6 token plan:

  • Cover and style transfer from a reference audio file or YouTube URL (preserves melody)
  • Two-song mashup (Song A's content and emotion + Song B's style)
  • Lyrics generation via the MiniMax API endpoint (with edit mode for iteration)
  • Emotion analysis on input audio to drive prompt construction (vocal speed, intensity curve, pitch bends)
  • Fine control over generation parameters (BPM, key, structure, avoid list as separate flags via mmx)

For everything else (standard song generation, instrumentation, anti-sparse prompt engineering, structure tags, user preference flow), this skill uses the same workflow as music-craft. Read that skill first to understand the base, then come back here for the MiniMax-specific extensions.

Routing and Blocker Checks

Classify the request before analysis or generation:

  • Text-only style reference means the user gave a song name, artist, era, or genre cue without source audio. Treat it as style inference, not cover analysis.
  • Reference audio or YouTube means the user provided a file or playable source that should be analyzed.
  • Cover preserves melody and usually needs a source file plus a target style decision.
  • Style transfer uses a reference track or analyzed audio as style input, then changes the production direction.
  • Mashup needs Song A and Song B, plus a decision about which one contributes content and which one contributes style.
  • Emotion prompt means the user wants analysis turned into descriptive prompt language, not a full cover.

The scripts/lint_music_request.py helper emits one of these routes:

RouteWhen
base_promptStandard generation, no MiniMax-specific feature needed.
minimax_coverMelody-preserving cover from audio or YouTube.
minimax_mashupTwo-song mashup (A + B, both identified).
minimax_style_transferStyle transfer that does not preserve the source melody.
minimax_emotion_promptEmotion analysis, or precision mmx flag usage.
needs_clarificationAt least one blocker is unresolved; ask the user first.

Surface blockers before analysis:

  • no source file or usable URL
  • unclear which track is Song A versus Song B
  • missing target style
  • missing lyrics decision, such as original, translated, rewritten, or instrumental
  • conflicting cover/style-transfer intent: the user asked for both "cover" (preserve melody) and "style transfer" (reproduce style) at once. These are mutually exclusive. Ask the user to pick one.

After you have prompt text and mmx flags, lint them together before generation:

  • compare prompt BPM with --bpm
  • compare prompt key with --key
  • compare prompt structure line with --structure
  • compare prompt duration with --duration (or implicit length expectation)
  • compare prompt vocal mode with --vocals
  • compare prompt language with --language
  • compare prompt avoid language with --avoid
  • stop when the prompt says one thing and the flags say another

If the user only has a text reference, route to the free-tool path in references/free-tool-inputs.md first. If the user has audio, analyze first and only then build the prompt. The linter returns a retry_guidance array with one hint per conflict so the operator can re-align prompt and flags on the next attempt.

When To Use

Use this skill when the task involves:

  • generating a cover of an existing song with a different style (chanson version of a rock track, reggaeton version of a pop hit, and so on)
  • style transfer from a YouTube URL or audio file to a target genre
  • two-song mashup where Song A's lyrics and emotional arc are kept, but Song B's style is applied
  • emotion analysis on input audio to extract intensity curves, vocal speed, pitch bends, and emotion classifications
  • generating lyrics in a specific language and theme via the MiniMax lyrics_generation API
  • editing existing lyrics to match a target style or emotional arc (MiniMax lyrics_generation edit mode)
  • using mmx CLI directly for fine control over --avoid, --bpm, --key, --structure, --vocals, --instruments as separate flags
  • accessing MiniMax's music-cover or music-cover-free models for melody preservation

Request Intake (adapted for MiniMax features)

After the Routing and Blocker Checks classify the request, run this 2-pass intake to extract the full set of fields the user cares about. Label each field's confidence: clear (user said it), inferred (sensible default), missing (need to ask), or conflicting (user said two incompatible things — pause to resolve).

Fields checklist (MiniMax-specific)

#FieldWhat to look forMiniMax-specific notes
1RouteCover / style transfer / mashup / standard / emotion promptFrom the Routing and Blocker Checks section. Determines which MiniMax features to use.
2Source audio or URLFile path or playable YouTube URLRequired for cover, mashup, style transfer. For standard, optional (text-only style reference is also fine).
3Song A identityName, artist, audioFor mashup: needed. For cover: this is the source.
4Song B identityName, artist, audioFor mashup only.
5Target styleGenre / mood / referenceThe destination of the cover or style transfer. If user says "like Rosalía", that's clear. If user says "something good", that's missing.
6Lyrics decisionOriginal / translated / new / instrumentalFor cover, default to original (translated if user requests it). For standard, default to new (or user-provided).
7Vocal modeSolo / duet / choir / instrumentalDrives --vocals and --language flags.
8LanguageBCP-47 code (en, fr, es, etc.)For lyrics language AND vocal language.
9DurationApproximate length (jingle ~30s, standard ~3min, epic ~6min)mmx has no native duration control (see "Song length" section). Length is driven by lyrics + structure, so the intake needs the lyrics to control length.
10BPM, key, structureExact values if user wants --bpm/--key/--structureOptional. If provided, the prompt AND flags must agree (lint them).
11Emotion arcFor emotion-prompt workflows: which emotions to emphasizeDrives the analysis-to-prompt translation.
12Output locationWhere the audio and analysis files goSame as the base skill — per-song subfolder in ~/Music mix/<project>/<song-slug>/.

Confidence map example (MiniMax-specific)

Request: "Hazme un cover del 'Bizcochito' de Rosalía pero en reggaetón"

clear:     source_audio=path, song_a=Bizcochito, target_style=reggaeton
inferred:  language=es, vocal_mode=solo_female, lyrics_decision=original
missing:   output_location (which project folder? per-song subfolder?)
            vocal_register (full chest, head voice, whisper? — affects --vocals flag)

Request: "I have a YouTube link of an old rock song and want it as a dreamy shoegaze ballad, with English lyrics because the original is in French"

clear:     source_url=URL, song_a=old_rock_song, target_style=shoegaze
            lyrics_decision=translated, target_language=en
inferred:  vocal_mode=duet or solo (depends on original), ~3min
missing:   audio source for source audio analysis (YouTube needs to be downloaded first)
            BPM/key from analysis output (will be filled in after analysis)
            output_location

If any field is missing or conflicting, that's a question to ask. The Ambiguity Questions section below has specific patterns for each route. If everything is clear or inferred, the request is ready to translate.

User Preference Flow (message patterns → action)

The skill does not start with a questionnaire. It starts by reading and inferring from the user's natural-language request.

User says...Skill does...
"Haz un cover de X en Y"Route: minimax_cover. Ask: source audio file (or download from YouTube), target language for lyrics, vocal register.
"Make this song sound like Rosalía"Route: minimax_style_transfer. Ask: source audio, which album/era of Rosalía.
"I have audio of A, mash with B, keep A's melody"Route: minimax_mashup. Ask: A vs B confirmation, source audio for A, B can be name or audio.
"Analyze the emotion curve of this track"Route: minimax_emotion_prompt (analysis-only). Run analysis_orchestrator.py --audio first, then read the JSON.
"I want the lyrics to be about X, in French, melancholic"Route: base_prompt (standard). Use the lyrics API to generate, then pass to mmx music generate --lyrics-file. Ask: target BPM/key/structure or derive from analysis.
"Recreate the song but in 90 BPM D minor"Route: base_prompt with mmx flags. Lint prompt vs flags before generation. Verify BPM/key consistency.
"I don't know, surprise me"Pick a coherent default (e.g. upbeat indie pop, EN, ~3min, auto-lyrics, standard generation) and confirm with the user before generating.
"Same song again but as a reggaeton version"Route: minimax_cover with the existing song as source. Use the same project/song subfolder, suffix the MP3 (M1_original.mp3 + M2_reggaeton.mp3).

This table is the abstract of references/user-preference-flow.md (which lives in the base skill). If you want a more detailed case, defer to the base skill's table and combine with this skill's route mapping.

Output File Layout (Per-Song Subfolders)

MiniMax-specific additions (drop these into the per-song subfolder alongside the base items):

FileSourceNotes
<song-slug>_analysis.jsonanalysis_orchestrator.py --outputMiniMax-specific analysis results (emotion, BPM, key, segments)
<song-slug>_lyrics.txtmmx music generate --lyrics-fileOptional if user provided lyrics inline
<song-slug>_<style>_prompt.txtThe exact text passed to --promptFor reproducibility

The LLM should aim for the base skill's layout by default. The MiniMax-specific files are added on top when MiniMax features are used (cover workflow, mashup, analysis, etc.).

Quick Start with the Orchestrator

For any input combination, the analysis_orchestrator.py script is the single entry point:

# Audio file
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav

# Two songs (mashup) - gets BPM + key compatibility scoring for free
python3 scripts/analysis_orchestrator.py --audio /tmp/song_a.wav --audio /tmp/song_b.wav

# Video - extracts audio + visual features (scenes, color, motion)
python3 scripts/analysis_orchestrator.py --video /tmp/clip.mp4

# Image (album art) - color palette + style hints
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg

# YouTube URL - downloads then analyzes
python3 scripts/analysis_orchestrator.py --youtube "https://youtube.com/watch?v=..."

# Combination: audio + image
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --image /tmp/art.jpg

# Demucs source separation — for TIMBRE/PITCH analysis of an isolated vocal, NOT for lyrics
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --use-demucs

# Whisper lyrics extraction — run on the FULL mix (do NOT pre-separate with Demucs)
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --lyrics

# VLM captioning for images (calls mmx vision describe / MiniMax 3.0 — cloud, skip if MiniMax is blocked)
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg --vlm

The orchestrator dispatches to the right analysis scripts and produces a unified JSON. Optional packages (CLAP, autochord, allin1, pyloudnorm, pylette, scenedetect, demucs, beat_this, basic-pitch, transformers/MERT, open_clip) are detected at runtime and used when available.

Extraction guidance (what actually improves the output)

These are the rules that make the extracted data useful to the downstream generator. They are tool-agnostic — they apply whether the backend is MiniMax cloud or a local model.

  • Lyrics: transcribe the FULL mix, do not Demucs-first. Feeding Demucs-isolated vocals into Whisper measurably worsens transcription word-error-rate in most configurations. Run the transcriber on the original mix. Use faster-whisper over vanilla whisper (same accuracy, much lower latency/VRAM), and prefer the large-v2 model for sung lyrics — large-v3 is reliably worse on singing. Use medium/base only as a speed compromise.
  • Use Demucs only for timbre/pitch. Source separation helps when you want clean vocal-stem features (breathiness, pitch range, vocal brightness) or per-instrument detection — never as a lyrics pre-step.
  • Prioritise the high-value features. For driving a generation prompt, the features that matter most (in order) are tempo/BPM, key/scale, beats/downbeats, chords, then structure (section boundaries). Energy/RMS and spectral centroid map to texture words (punchy, airy, sparse, dense) and to dynamic tags. Spend analysis budget there first.
  • Give key detection a long window. Estimate key/chroma over ~120s of audio (not a short clip) for a stable result; BPM is stable from ~60s.
  • Carry confidence through to the prompt. Hedge low/medium detections ("around 128 BPM", "likely D minor") and never inject missing values as facts — see Analysis Quality below.
  • Map structure boundaries to actions. Detected section boundaries become the [Verse]/[Chorus]/[Bridge] tag roadmap, and (for backends that support it) the repaint windows for fixing one bad section instead of regenerating the whole track.

Output file layout (per-song subfolders)

Every generation should be saved into a per-song subfolder that bundles the audio with its analysis, prompt, and lyrics. The LLM should ask the user for the project root and song slug up front (default: ~/Music mix/<project>/<song-slug>/), then run the full chain of commands below.

# Example: DBC - Two Paths, two versions

# 1. Make the subfolder
mkdir -p ~/Music\ mix/dbc/two-paths

# 2. Run the analysis and save JSON into the subfolder
python3 scripts/analysis_orchestrator.py \
  --audio /tmp/two_paths.wav \
  --use-demucs --lyrics --lyrics-source auto \
  --output ~/Music\ mix/dbc/two-paths/two_paths_analysis.json

# 3. Build the prompt from the analysis, save it next to the JSON
python3 scripts/emotion_to_prompt.py \
  --emotion ~/Music\ mix/dbc/two-paths/two_paths_analysis.json \
  --output ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt

# 4. Generate each version, save the MP3 into the subfolder with a
#    versioned filename so multiple takes stack cleanly
mmx music generate \
  --prompt "$(cat ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt)" \
  --lyrics-file ~/Music\ mix/dbc/two-paths/two_paths_lyrics.txt \
  --out ~/Music\ mix/dbc/two-paths/M1_two_paths_synthwave.mp3

The result is a self-contained song folder that the user can review, archive, share, or re-generate from without losing any context.

What's New in v1.0.0

v1.0.0 is the first stable release. It builds on the v0.x series (v0.3.0 / v0.4.0 dev line) with stronger preflight routing, wider prompt/flag consistency, and explicit post-generation verification:

Preflight routing:

  • lint_music_request.py now emits one of six routes: base_prompt, minimax_cover, minimax_mashup, minimax_style_transfer, minimax_emotion_prompt, or needs_clarification
  • New blockers: missing Song B, missing lyrics decision, and conflicting cover/style-transfer intent
  • A retry_guidance array on every conflict so the operator can re-align prompt and flags

Prompt and flag consistency:

  • Linter now detects conflicts in BPM, key, structure, duration, vocal mode, language, and avoid list
  • The canonical mmx prompt schema is documented in references/examples.md

Analysis quality:

  • All analysis scripts converge on a compact summary (tempo, key, sections, instrumentation, vocal traits, energy curve, hook points, mix notes)
  • Confidence levels (clear, high, medium, low, inferred, missing) attached to every detection
  • Missing optional dependencies fall back to a JSON error block instead of failing the whole workflow

Output verification:

  • Post-generation verification checklists for covers, mashups, style transfer, and emotion prompts
  • Eight failure signatures (copied too closely, lost melody, wrong tempo, wrong key, muddy mix, weak chorus, style mismatch, neutral vocals) with matching fixes
  • Revision prompt templates that preserve source identity while fixing one specific dimension

Tests and portability:

  • Smoke tests now cover all new linter routes, the new conflict types, and the stdlib-only import guarantee
  • Windows is documented as partial support; scripts stay POSIX-safe, audio tools may need platform install

What's New in v0.3.0

v0.3.0 builds on v0.2.0 with a substantially richer analysis pipeline:

New analysis scripts (8):

  • extract_stems.py — Demucs source separation (vocal/drums/bass/other)
  • track_beats.py — beat_this beat + downbeat tracking (ISMIR 2024 SOTA)
  • extract_melody.py — Spotify Basic Pitch polyphonic AMT → MIDI + key/scale
  • compute_audio_embedding.py — MERT v1-330M music embeddings (vibe similarity)
  • classify_instruments.py — MIT AST 527-class AudioSet tagging
  • extract_video_features.py — extended with camera motion + VLM captioning
  • analyze_image.py — extended with OpenCLIP, OCR, face detection, VLM caption
  • analysis_orchestrator.py — single entry point, --use-demucs, --vlm, --ocr flags

New prompt slots (consumed in emotion_to_prompt.py):

  • beat grid: 4/4 at 150 BPM (confidence 0.80) from beat_this
  • melodic key from MIDI: E minor; interval motion: mostly leaps; modal character: pentatonic, blues from Basic Pitch
  • AST-detected sound palette: rock music (0.16), punk rock (0.14), grunge (0.20) from MIT AST
  • emotion signature from analysis: intense, passionate, dramatic, triumphant (expanded to 25-emotion classifier)
  • vocal texture in verse: breathier / more intimate than average (per-section aggregation)
  • tempo: tight, on-beat delivery (from tempo_consistency)
  • tonal character: dark warm tone, rolled-off highs (from brightness)
  • instruments detected: electronic / synthetic textures (from instrument_hints)
  • natural dramatic pauses detected at: 2s (11.7s pause), 20s (3.3s pause) (from Demucs vocal-stem)
  • style direction: ... (from analyze_two_songs mashup_plan)

Bug fixes:

  • parselmouth 0.4.x API (get_value_at_time / get_value_at_xy)
  • ffmpeg 8.x image2 muxer workaround (per-frame extraction)
  • pylette 5.1+ capital-P import + Pylette fallback
  • open_clip 3.3 3-tuple return + get_tokenizer() for tokenizer
  • demucs 4.x apply_model API

Prompting wins (verified end-to-end with DBC Woodstock 2013):

  • Mix: 0 silence gaps, 35 pitch bends
  • Vocal stem: 19 silence gaps, 49 pitch bends, 2.32 syll/sec
  • BPM 150 (4/4) from beat_this, E minor (MIDI-confirmed G# minor)
  • AST: "Rock music", "Punk rock", "Heavy metal", "Grunge" — matches the actual band

When NOT To Use

Do not use this skill when:

  • the user only needs standard song generation without cover, mashup, or analysis — use music-craft instead (lighter, no MiniMax dependency)
  • the runtime does not expose a music_generate tool and there is no MINIMAX_API_KEY configured — both skills need the runtime
  • the user wants deterministic, single-shot generation with no iteration — overkill
  • the user wants to mutate a specific existing audio file (pitch shift, time stretch, stem split) — that is post-production, not generation
  • the user is not on a MiniMax Token Plan — the advanced features (cover, mmx per-flag control, lyrics API, emotion-driven prompts) require the plan

Decision Tree

Use the base skill unless one of these MiniMax-specific needs is present:

  • melody-preserving cover or style transfer from audio or YouTube
  • two-song mashup
  • lyrics API preview/edit flow
  • emotion analysis that feeds the prompt
  • exact mmx control for BPM, key, structure, or avoid lists

If the user wants a new song that only borrows a style, stay in music-craft unless they also need exact flag control or lyrics API iteration.

If the source is a YouTube URL and download is blocked, ask for a local file before changing the workflow.

First Response Defaults

Use these defaults on the first pass:

  • Cover from audio or YouTube: start with the one-step cover path. Switch to two-step only if the user wants translated lyrics, edited ASR lyrics, or custom lyrics.
  • Style transfer only: do not use cover unless melody preservation matters. Use standard generation plus mmx flags if exact BPM/key/structure matter.
  • Two-song mashup: anchor on Song A. If Song A has audio, default to the cover two-step workflow; if Song B is only named, ask for a short style description or fetch more context if free tools are available.
  • Lyrics API generation or edit: use write_full_song for blank-page generation and edit for revisions.
  • Emotion-analysis-to-prompt: run analysis first, then convert to a prompt; only ask whether the output should be cover, mashup, or standard generation, plus the target language if missing.
  • Exact BPM/key/structure control: make mmx flags the source of truth and keep the prompt descriptive but non-conflicting.

Ambiguity Questions

Ask at most 1-3 questions. Separate blockers from quality tweaks:

  • Required blockers first: source file or URL, which song is A vs B, whether lyrics already exist, whether the output must preserve melody.
  • Optional quality after blockers: target language, target style, BPM, key, structure, instruments, vocal color, avoid list.

Use these exact patterns when clarification is needed:

  • Cover: "Which source should I use?" "Do you want the original lyrics, translated lyrics, or new lyrics?" "Any target style, or should I derive it from the source?"
  • Mashup: "Which song is A and which is B?" "Do you have audio for Song B, or only the name?" "Should the lyrics stay the same or be rewritten?"
  • Lyrics API: "Write from scratch or edit existing lyrics?" "What language should I target?" "Any hard structure requirements?"
  • Emotion prompt: "Do you want cover, mashup, or standard generation?" "What language should the output use?" "Should I prioritize tenderness, energy, or structure?"
  • mmx precision: "Which values are mandatory: BPM, key, structure, or avoid list?" "Any instruments or vocals that must stay in or stay out?"

Relationship to music-craft

This skill extends the base skill, it does not replace it. The shared concepts are:

ConceptWhere it lives
Pre-Flight Check (platform detection)This skill (extended required list)
Anti-sparse rules (canonical text)Base skill, referenced from here
Prompt formula (production sheet)Base skill, referenced from here
Structure tags (14 tags)Base skill, referenced from here
User preference flow (auto-detect + ask)Base skill, referenced from here
Output file layout (per-song subfolders, slug rules, version prefix)Base skill, referenced from here; MiniMax adds analysis.json and lyrics.txt
Rate limits (generic)Base skill
Quality verification checklistBase skill, extended here for MiniMax
Operating rules (6-step loop)Base skill, summarized here with MiniMax-specific extensions

The MiniMax-specific additions are:

MiniMax conceptWhere it lives
mmx CLI quick referenceThis skill
mmx full flag referenceThis skill, references/mmx-flags-reference.md
Cover workflow (one-step, two-step)This skill, references/cover-workflow.md
Lyrics generation APIThis skill, references/lyrics-generation.md
Mashup workflow (A + B)This skill, references/mashup-workflow.md
Emotion analysis (vocal speed, intensity, pitch)This skill, references/emotion-analysis.md
MiniMax-specific error handlingThis skill, references/error-handling.md
Audio analysis scriptsThis skill, scripts/
Free tool inputs (web, image, memory)Both skills — base layer in music-craft, MiniMax layer here in references/free-tool-inputs.md

Pre-Flight Check (extended)

The platform detection block is the same as music-craft (run it first). The required and optional lists are extended for MiniMax.

Platform Notes

  • macOS/Linux are the primary targets: use python3, command -v, and normal shell export/PATH checks.
  • Windows is partial support only: prefer PowerShell, use python or py -3, and verify env vars with Get-ChildItem Env:MINIMAX_API_KEY or Test-Path Env:MINIMAX_API_KEY.
  • On Windows, ffmpeg, yt-dlp, and mmx are PATH-sensitive; if Get-Command/where.exe cannot find them, restart the shell or add the install directory to PATH.
  • If Windows path/dependency issues keep blocking analysis, use WSL for the script-heavy parts instead of claiming full native support. For the full WSL2 setup (including the corporate items below), follow the base skill's references/windows-wsl-setup.md.
  • Corporate machines (TLS-inspecting proxy): pip/HuggingFace/model downloads inside WSL fail with CERTIFICATE_VERIFY_FAILED until the corporate root CA is installed in the distro (and REQUESTS_CA_BUNDLE/SSL_CERT_FILE point at the system bundle). A proxy env var (HTTP_PROXY) can also hijack 127.0.0.1 calls to a local API — unset it / set no_proxy=127.0.0.1,localhost. Both are covered in the base reference above.
  • If MiniMax itself is blocked (corporate firewall) the cloud features here — mmx music cover, the lyrics API, and any analysis script that calls MiniMax (e.g. emotion_to_prompt.py) — will fail. In that case use only the local-capable tools (yt-dlp, ffmpeg, librosa, Whisper) for analysis and the local ACE-Step backend in music-craft for generation.

Required (skill will not work without these)

CheckWhat it isHow to verifyIf missing
music_generate toolThe runtime's built-in music generation toolInspect the active runtime's tool listTell the user: "This skill needs a music_generate tool, but the active runtime does not expose one. Configure a music provider in OpenClaw and try again." Stop.
MINIMAX_API_KEY env varAPI key for the MiniMax Music 2.6 plantest -n "$MINIMAX_API_KEY" && echo "OK"Tell the user: "This skill needs the MINIMAX_API_KEY environment variable. Get one from your MiniMax account and export it. If you do not have a MiniMax Token Plan, use music-craft instead — it works with any provider." Stop.
mmx CLIThe MiniMax CLI for fine-flag controlcommand -v mmx && mmx --version (macOS/Linux) or Get-Command mmx; mmx --version (PowerShell)Ask the user: install via the MiniMax install guide, or skip mmx-specific features and use the music_generate tool with prompts. Do not block — mmx is optional if the runtime has MiniMax configured, but Windows support is only partial and depends on PATH visibility.
python3Required for the analysis scriptscommand -v python3 (macOS/Linux) or python / py -3 (Windows PowerShell)Tell the user: "The analysis pipeline (emotion analysis, mashup) needs Python 3.9+." Propose an install command for the active shell. Block emotion analysis if missing.

Optional (skill works without these, but quality improves with them)

ToolWhat it unlocksInstall per platform
ffmpegAudio conversion (WAV for analysis, MP3 export, trimming)apt install ffmpeg · brew install ffmpeg · winget install Gyan.FFmpeg (restart PowerShell after install so PATH updates apply)
yt-dlpYouTube audio download for cover and mashuppip install -U yt-dlp or py -3 -m pip install -U yt-dlp on Windows; ensure the CLI is on PATH
librosaAudio analysis (BPM, key, energy, structure)pip install librosa numpy scipy
parselmouthBetter pitch tracking (Praat under the hood)pip install praat-parselmouth
scikit-learnAudio clustering (segment detection)pip install scikit-learn

The full per-platform install table is in the base skill's music-craft Pre-Flight Check.

The "ask the user" pattern

Same as the base skill: for each missing optional tool, present three options — install (propose exact command, let user approve), skip (use the simple path), or cancel. Never auto-install.

If MINIMAX_API_KEY is missing, the redirect is to the base skill, not "install MiniMax" — the user may not have a Token Plan at all.

Local analysis memory (separate from generation)

Generation runs on MiniMax's cloud — your laptop just sends the prompt and downloads the MP3, so generation itself uses negligible local memory.

However, this skill's local analysis scripts run on your machine and can use real memory. Before running the full analysis pipeline, check available memory:

ScriptModels loadedApprox peak RAM
analyze_vocal_emotion.pyparselmouth (Praat) + scipy~500 MB
analyze_audio.pylibrosa + transformers (MERT or MIT AST)2–4 GB
extract_lyrics_whisper.pywhisper model (tiny/base/medium)1–5 GB depending on model size
extract_stems.pyDemucs (htdemucs)2–4 GB
emotion_to_prompt.pycalls MiniMax API — negligible local<100 MB
compute_audio_embedding.pyMERT model1–2 GB
classify_instruments.pyMIT AST1–2 GB

Combined (full analysis pipeline on a 4-min song): ~6–10 GB peak on top of OS and other apps. On unified-memory systems (Apple Silicon, integrated graphics), this competes directly with macOS/Windows and your other applications. On dedicated-GPU systems (NVIDIA, AMD), model memory is taken from system RAM unless you have CUDA acceleration.

Recommendations:

  • Close heavy apps (browser with many tabs, IDE, Docker) before running the full pipeline
  • For extract_lyrics_whisper.py, use the tiny model by default — base/medium are 2-5x heavier with marginal quality gain for most songs
  • For extract_stems.py, the --quality flag controls Demucs model size; default htdemucs is the heaviest; htdemucs_ft is the lightest
  • If you run out of memory, run analysis steps individually rather than via analysis_orchestrator.py (which loads everything)

The scripts/smoke_test.py script verifies the environment is set up; it does not test memory headroom. Run your own memory check before running a full analysis.

Free Tool Augmentation (Input Enrichment)

The OpenClaw runtime exposes several free tools (web_fetch, web_search, image analysis, memory, browser) that enrich the music generation workflow. The base layer is documented in music-craft → Free Tool Augmentation and references/free-tool-inputs.md. This section shows how they compose with MiniMax-specific features.

Quick recap of free tools

ToolPurpose
web_fetchFetch URL content (lyrics pages, YouTube metadata, Wikipedia)
web_searchFind lyrics, artist info, genre descriptions
image / MiniMax__understand_imageAnalyze album art, concert photos, music video screenshots
memory_search / memory_getRecall user's prior music preferences
browserJS-heavy site fallback (last resort)

MiniMax compositions (high-value combos)

  • web_fetch + lyrics_generation: fetch the user's draft from a URL, run it through edit mode for cleanup, generate.
  • web_search + cover workflow: find covers in the target style, extract their characteristics, apply to the user's track.
  • image + mmx per-flag control: analyze album art, translate to --instruments, --bpm, --key, --structure for fine-grained style matching.
  • memory + emotion analysis: combine the user's prior preferences with deep audio analysis of a reference track.

For the full worked examples, parameter recommendations, and MiniMax-specific edge cases, see references/free-tool-inputs.md.

Operating Rules

Same 6-step loop as music-craft, with MiniMax-specific extensions:

  1. Read and auto-detect — same
  2. Ask only the ambiguous parts — same, plus ask if the user wants cover / mashup / standard
  3. Translate to a production-sheet prompt — same, but consider whether to use mmx flags (see references/mmx-flags-reference.md) instead of packing everything into the prompt
  4. Structure the lyrics — same, plus consider lyrics API for generation or edit (see references/lyrics-generation.md)
  5. Generate and verify — same, plus the music-cover model for melody preservation
  6. Iterate — same, plus emotion analysis to inform the next prompt adjustment

For the full 6-step detail, see music-craft → Operating Rules.

Song length (mmx has no native duration control)

Unlike music-craft's ACE-Step backend (which takes audio_duration as a parameter), MiniMax Music 2.6 has no explicit duration flag. Output length is determined by:

  • Lyrics length (primary): each [Verse]/[Chorus] section takes ~15-30 seconds depending on word count and singing pace. A typical 3:30 song has ~150-200 lyrics words across 2 verses + 2 choruses + bridge.
  • Structure tags: [Intro], [Instrumental Break], [Outro] add silent/sparse sections that extend total length without lyrics.
  • Prompt hints (secondary): phrases like "3 minute song" or "4 minute track" nudge the model toward that length.
  • BPM and section count (minor effect): faster BPMs and more sections tend to produce slightly longer outputs.

Practical recipe for a full 3:30 song:

  1. Lyrics: ~150-200 words with [Verse 1], [Pre-Chorus], [Chorus], [Verse 2], [Bridge], [Outro] tags (full song structure, not just one chorus)
  2. Prompt: include structure hints like "full 3-minute song with intro, 2 verses, 2 choruses, bridge, and outro" or use --structure "intro-verse-pre_chorus-chorus-verse-chorus-bridge-chorus-outro"
  3. Check output length — if it's a 1-minute hook, the lyrics are probably too short
  4. If output is too short: regenerate with longer lyrics (the model can't add sections that aren't in the lyrics)
  5. If output is too long: trim lyrics to ~120 words or add [Instrumental Break] tags to control pacing

Don't expect mmx to hit 3:30 exactly. Output length varies by ±20-30s depending on the model. If you need precise length, ACE-Step is the right tool (it has audio_duration). If you want MiniMax's vocal quality and the song length is flexible, mmx is fine.

mmx CLI Quick Reference

The mmx CLI exposes MiniMax Music 2.6 parameters as separate flags. This gives finer control than packing everything into a single prompt string.

The most useful flags:

FlagEffectExample
--avoidElements to avoid (comma-separated)--avoid "sparse, a cappella, electronic sounds"
--bpmExact BPM--bpm 80
--keyMusical key--key "E minor"
--structureSong structure--structure "intro-verse-pre chorus-chorus-verse-chorus-bridge-chorus-outro"
--vocalsVocal style--vocals "passionate French male vocal"
--instrumentsFeatured instruments--instruments "accordion, upright bass, strings, piano"
--genreGenre--genre "french chanson"
--moodMood--mood "melancholic romantic dramatic"
--lyrics-optimizerAuto-generate lyrics from prompt(flag only)
--modelModel name--model music-2.6 (paid, highest RPM) or --model music-2.6-free (default, free tier)
--cover-feature-idUse a preprocessed cover (two-step workflow)(from preprocess call)

Full reference with all flags and examples: references/mmx-flags-reference.md.

When to use mmx vs music_generate:

  • mmx: when you need fine control over specific parameters (BPM, key, structure as separate flags)
  • music_generate: when the prompt-only path is enough, and you want to keep the workflow provider-agnostic

Both produce equivalent results if the prompt and flags are aligned.

mmx Music Generation — verified patterns (June 9, 2026)

End-to-end verified invocations from this session (M5_idkw_dreampop_shoegaze + M5_idkw_opera_metal in ~/Music mix/hello_cleveland/i_dont_know_why/):

Pattern A — full song with detailed prompt + 6 metas (production-grade output)

mmx music generate \
  --prompt "dream pop reimagining, shoegaze-influenced indie rock turned ethereal and cinematic.
My Bloody Valentine meets Slowdive meets Radiohead.
Male lead vocal, breathy and vulnerable, double-tracked with slight detuning and tape warmth.
Wall of clean electric guitars with heavy chorus pedal and tremolo picking.
Shimmering washes of reverb, sub-bass synth pad foundation, soft brushed electronic drums.
Glockenspiel and celesta melody line high above the mix.
Organ pads swelling at choruses, reversed guitar samples between sections.
Heavy reverb and analog warmth throughout, lo-fi texture.
Emotional arc: hazy drifting opening building wave confusion overwhelming beautiful climax fading dreamlike denouement outro.
Avoid: sharp percussive agresivo distortion clear upfront vocals minimal sparse.
Tempo 96 BPM in D major, dreamlike half-time feel.
Suitable as a slow-burn alt-pop anthem, melodic and textural, intimate verses and soaring choruses.
Modern production, polished mix, atmospheric vocal production where vocals sit among the instruments rather than above them." \
  --lyrics-file gen1_lyrics.txt \
  --model music-2.6 \
  --vocals "breathy vulnerable male lead, double-tracked with slight detuning" \
  --genre "dream pop, shoegaze-influenced indie" \
  --mood "hazy confusion building to overwhelming beautiful release, then dreamlike fade" \
  --instruments "wall of clean electric guitars with heavy chorus pedal, sub-bass synth pad, soft brushed electronic drums, glockenspiel, celesta, organ pads, reversed guitar samples" \
  --bpm 96 \
  --key "D major" \
  --structure "intro-verse-pre_chorus-chorus-post_chorus-verse2-chorus-repeat-outro" \
  --use-case "slow-burn alt-pop anthem, suitable for late-night listening" \
  --avoid "sharp percussive agresivo distortion, clear upfront vocals, minimal sparse arrangement" \
  --references "My Bloody Valentine, Slowdive, Radiohead" \
  --out M5_idkw_dreampop_shoegaze.mp3

Output: 167.9s MP3, 5.4 MB, -8.8 LUFS, 5.7 LRA (good dynamics).

Pattern B — crazy combo: opera vocals + heavy metal music (for fun experiments)

mmx music generate \
  --prompt "extreme dramatic contrast: powerful operatic tenor vocals over heavy metal instrumentation.
Like Freddie Mercury fronting Metallica. Epic, theatrical, over the top.
Thunderous double bass drums, distorted electric guitars with palm-muted chugging,
guttural rhythm section, blast beats, tremolo picking, minor key riffing.
Operatic vocals soaring above the metal wall of sound, belting high notes with vibrato.
Gothic theatrical atmosphere, dramatic dynamic shifts from whisper-quiet verses
to explosive metal choruses. Anthem-like, stadium-ready." \
  --lyrics-file gen2_lyrics.txt \
  --model music-2.6 \
  --vocals "operatic tenor, powerful Freddie Mercury style, vibrato, theatrical belting" \
  --genre "symphonic metal" \
  --mood "dramatic, theatrical, anthemic, intense" \
  --instruments "distorted electric guitars, double bass drums, blast beats, orchestral strings" \
  --tempo "fast" \
  --bpm 160 \
  --key "D minor" \
  --structure "verse-pre_chorus-chorus-verse-pre_chorus-chorus-outro" \
  --use-case "epic music experiment" \
  --avoid "pop, soft, gentle, acoustic, slow" \
  --out M5_idkw_opera_metal.mp3

Output: 155.8s MP3, 5.0 MB, -9.6 LUFS, 4.3 LRA (compressed but still has dynamics).

Model selection

ModelWhen to useCost
music-2.6 (default)Production work, full qualityToken Plan / paid
music-2.6-freeFree tier, lower RPM, "unlimited" quota for some plansFree
music-2.5+Older model, still good qualityToken Plan / paid
music-2.5LegacyToken Plan / paid
music-coverCover/re-interpretation of source audio (one-step)Token Plan / paid
music-cover-freeFree cover variantFree

music-2.6-free is the default for most users — same model, free tier. The mmx CLI uses it as the default when no --model is specified.

is_instrumental and lyrics_optimizer flags (miniMax-specific paths)

The mmx CLI exposes two important flags that bypass the --lyrics requirement:

FlagWhat it doesWhen to use
--instrumentalGenerate music without vocals (no lyrics needed)When user wants BGM, intro, soundtrack, loop
--lyrics-optimizerAuto-generate lyrics from the prompt (no --lyrics needed)When user says "make me a song about X" but doesn't have lyrics

Examples:

# Pure instrumental (no vocals)
mmx music generate \
  --prompt "Instrumental only, no vocals, no lyrics. Loopable coffee shop background, soft piano, brushed drums, 90 BPM, C major" \
  --instrumental \
  --length 180000 \
  --out coffee_bgm.mp3

# Auto-generated lyrics from prompt
mmx music generate \
  --prompt "Upbeat indie folk, melancholic but hopeful, male vocal, acoustic guitar, 100 BPM" \
  --lyrics-optimizer \
  --out indie_folk.mp3

Note: mmx music generate with --length uses milliseconds (the example shows --length 180000 for 3 minutes). This is mmx-specific; the underlying MiniMax API has no official duration parameter.

URL expiration warning

mmx music generate returns a saved: path. If you ever use --output-format url (the official API default), the URL expires after 24 hours. Download immediately. The mmx CLI auto-downloads to --out so this is not a problem when using --out directly.

Cover Workflow

Two cover backends exist — pick by what's available:

  • MiniMax cloud cover (this skill): mmx music cover, melody-preserving via MiniMax's music-cover model. Needs MINIMAX_API_KEY and network access to MiniMax.
  • Local ACE-Step cover (in music-craft): task_type=cover with the source audio uploaded (multipart) and audio_cover_strength controlling how far to restyle. Fully local, no cloud, follows the source melody/structure. Caveat: a full-length cover is slow and VRAM-heavy on a ~12 GB GPU and can hit the server's 600 s generation timeout — cover a shorter segment or raise ACESTEP_GENERATION_TIMEOUT. See music-craft's "ACE-Step Audio-Conditioned Generation" section.

So if MiniMax is unavailable (no key, or blocked on your network), you can still do a melody-aware cover locally with ACE-Step — it is not cloud-only. Only pure text-prompt generation (no source audio) is a "reimagining" rather than a cover.

Cover workflow preserves the original song's melody while applying a different style. Two paths:

One-step (quick):

mmx music cover \
  --prompt "French chanson, accordion, strings, passionate French vocal, 80 BPM" \
  --audio-file /tmp/original.ogg \
  --out /tmp/cover.mp3

MiniMax extracts lyrics via ASR and applies the new style.

Two-step (more control):

  1. Preprocess the audio to extract features and structure
  2. Edit the lyrics (correct ASR errors, add section tags)
  3. Generate with the edited lyrics

The two-step path gives better results when the original lyrics need correction or when the user wants different lyrics in the new style.

Full detail with payload examples, error handling, and use cases: references/cover-workflow.md.

Lyrics Generation

MiniMax has a dedicated lyrics_generation endpoint that produces structured lyrics (with [Verse], [Chorus], etc. tags) from a theme prompt. Two modes:

  • write_full_song — create new lyrics from a theme
  • edit — modify existing lyrics (e.g., make the chorus stronger, shift to a hopeful ending)

The output is structured lyrics that can be passed directly to music_generate or mmx music generate.

Full detail with API examples, parameters, and use cases: references/lyrics-generation.md.

Web Lyrics Lookup (LRCLib)

As an optional complement to Whisper transcription, the orchestrator can look up song lyrics from LRCLib (open, no auth, JSON API at https://lrclib.net/api) when the song is a known mainstream track. This is a graceful fallback — Whisper is the primary source, LRCLib is a quality boost for the right song.

Coverage reality check: LRCLib has good coverage for mainstream vocal music (pop, rock, hip-hop, R&B, country) and is poor or empty for:

  • Instrumental tracks (Joe Satriani, King Crimson, much jazz, classical)
  • Obscure bands / friend bands
  • Live / bootleg / unofficial releases
  • Non-English lyrics for English titles (and vice versa)

When LRCLib is empty (the expected case for instrumentals), the script returns no_web_lyrics and the caller silently uses Whisper. This is the designed path, not a failure.

CLI usage:

# Standalone lookup
python3 scripts/fetch_lyrics_web.py \
  --artist "Coldplay" --title "Yellow" \
  --whisper-transcript "look at the stars..." \
  --min-match 0.6 --json

Orchestrator integration via the --lyrics-source flag:

ValueBehavior
whisper (default)Always use Whisper, never touch the web
webAlways try LRCLib, never run Whisper
autoWhisper first; if the song is recognized AND LRCLib returns a confident match (>60% word overlap), use LRCLib; otherwise fall back to Whisper
offSkip lyrics extraction entirely

The orchestrator auto-detects artist and title from the audio path stem (e.g. Coldplay - Yellow.wav → artist="Coldplay", title="Yellow"). Pass --name-a "Artist - Title" to override.

The result includes a web_lookup sub-dict with status, match_score, and the plain lyrics (when matched), so you can inspect what was used and why.

Full detail with scoring heuristic and exit codes: see scripts/fetch_lyrics_web.py docstring.

Mashup Workflow

The signature MiniMax-specific feature: combine Song A (content + emotion) with Song B (style).

Workflow:

  1. Get Song A (audio file, YouTube URL, or song name)
  2. Get Song B (audio file, YouTube URL, or song name)
  3. Run emotion analysis on Song A (if audio available) to extract the emotional arc
  4. Build a prompt that applies Song B's style to Song A's content and emotion
  5. Generate using the cover workflow (preserves melody) or standard generation (creative reimagining)

This is the most powerful feature in this skill. The output preserves what makes Song A recognizable (lyrics, melody, emotion) while applying Song B's production style.

Full detail with the emotion-to-prompt conversion and the two-song analysis script: references/mashup-workflow.md and references/emotion-analysis.md.

Emotion Analysis

Emotion analysis extracts per-section features from input audio:

  • Intensity (loudness) — drives dynamic range
  • Pitch (Hz range, trend) — drives vocal intensity
  • Vocal effort (low / medium / high) — drives delivery style
  • Breathiness — drives intimacy vs full voice
  • Spectral centroid (brightness) — drives timbre matching
  • Emotion classification (list of 30+ emotions) — drives mood keywords for the prompt
  • Repetitive intensification — drives chorus build
  • Emotional shifts (sudden vs gradual) — drives transitions
  • Vocal speed (syllables per second) — drives elongation cues
  • Pitch bends at phrase endings — drives emotional emphasis

The analysis outputs JSON that the emotion_to_prompt.py script converts into a ready-to-use production-sheet prompt.

Local-only path (when MiniMax is unavailable): emotion_to_prompt.py calls the MiniMax cloud, so it fails when MiniMax is blocked or no key is set. In that case build the prompt locally from the analysis JSON without that script: take the extracted BPM and key/scale as explicit metadata fields; turn the energy curve and spectral brightness into texture words; turn the emotion classification and intensity curve into mood words and dynamic section tags; and feed transcribed lyrics (full-mix Whisper) as the lyric body. This is the same data, assembled by the agent instead of the cloud helper, and it feeds any backend (including a local model).

Scripts: scripts/analyze_vocal_emotion.py, scripts/analyze_audio.py, scripts/emotion_to_prompt.py.

Full detail: references/emotion-analysis.md.

For the generation side — how to use the analysis to evoke emotion in the OUTPUT, the 21 emotion recipes (joy, desperation, melancholy, triumph, yearning, anger, vulnerability, confidence, nostalgia, anxious, hopeful, tragic, heroic, tender, sensual, lonely, playful, haunting, serene, celebratory, bittersweet), the iteration loop, and common mistakes — see references/emotion-delivery.md.

Analysis Quality (Summary Format, Confidence, Fallbacks)

Analysis scripts in scripts/ produce different views (emotion, beats, melody, structure, instrumentation). The skill expects them to converge on a single compact summary so downstream code and humans can read the same shape regardless of which scripts ran.

Compact Analysis Summary

Every analysis result should include a summary object with these keys:

KeyTypeMeaning
tempostringBPM value with confidence, e.g. 120 BPM (confidence 0.92)
keystringDetected key, e.g. E minor (confidence 0.71)
sectionslistSection labels with timing, e.g. [{"label": "verse", "start": 0.0, "end": 28.5}, ...]
instrumentationlistDetected instrument palette, e.g. ["electric guitar", "drums", "bass"]
vocal_traitsdictBreathiness, intensity, pitch range, e.g. {"breathiness": "high", "intensity": "medium"}
energy_curvelistPer-section energy values, e.g. [{"t": 0, "energy": 0.6}, ...]
hook_pointslistTimestamps of detected hooks, e.g. [12.4, 48.0]
mix_noteslistShort strings, e.g. ["vocal upfront", "wide stereo drums", "rolled-off highs"]

Scripts may add their own fields, but every script must return at least the keys above (use empty list / unknown string when a key has no data).

Confidence Levels

Every numeric or categorical detection in the analysis must carry a confidence value so weak detections do not get treated as facts.

ConfidenceNumeric rangeInterpretation
clearn/aThe detection is unambiguous (e.g. user-supplied text, MIDI-confirmed key).
high>= 0.75Strong evidence from multiple sources or models.
medium0.5 - 0.74Reasonable evidence but alternative interpretations exist.
low< 0.5Weak signal; treat as a hint, not a fact.
inferredn/aNot measured directly; derived from context (e.g. lyrics from a YouTube URL).
missingn/aNot available; the analysis did not run or did not find evidence.

When feeding analysis into a prompt, prefix any low or medium detection with a hedge like "around" or "approximately", and never include missing values as if they were facts.

Fallback Behavior for Missing Optional Dependencies

The advanced analysis scripts depend on optional packages (librosa, parselmouth, transformers, demucs, beat_this, basic_pitch, etc.). Each script must:

  1. Try to import the optional dependency at the top of the function.
  2. On ImportError, return a JSON object that includes {"error": "install with pip install X", "summary": {}} instead of raising.
  3. Never let a missing optional dependency crash the whole workflow.

The orchestrator at scripts/analysis_orchestrator.py collects per-script results and continues even if some scripts failed. The combined summary simply omits keys whose underlying analysis could not run. The linter, prompt builder, and generation step all read the summary and skip missing keys without erroring.

This means a user without demucs installed can still get tempo, key, and structure analysis from the base pipeline. The only loss is the per-stem vocal analysis, which is opt-in via --use-demucs.

Rate Limits (MiniMax-specific)

The MiniMax Music 2.6 documented limits are:

  • RPM: 120 requests per minute
  • Concurrent connections: 20
  • Output URL expiry: 24 hours (download the audio promptly)
  • Cover feature ID validity: 24 hours (use the preprocess output within a day)

Under the Token Plan 3.0 (June 2026+), the actual quota is credit-based rather than RPM-based:

  • A unified general credit pool covers M3, M2.7, and M2.7-highspeed
  • A 5-hour rolling window resets continuously
  • A weekly window runs Monday 02:00 CEST → next Monday 02:00 CEST
  • Weekly status may be inactive on Plus plan (no weekly cap enforced, but the schema is there)

Practical implication: the documented 120 RPM is the API limit, but the Token Plan 3.0 quota is what determines your real ceiling. If you generate 4500 requests in 5 hours on Plus, you will be rate-limited regardless of RPM.

Before submitting a batch, check the active plan:

# Check current Token Plan usage
curl -s -H "Authorization: Bearer $MINIMAX_API_KEY" \
  https://www.minimax.io/v1/token_plan/remains | jq .

If a call fails with 429 (rate limit):

  1. Wait at least 60 seconds.
  2. Check the Token Plan usage endpoint.
  3. If 5h window is exhausted, wait for the reset.
  4. Reduce concurrency if running a batch.

Anti-Sparse (MiniMax-Specific Deep Dive)

The base skill's anti-sparse rules apply. The MiniMax-specific failure mode is more severe than other providers:

MiniMax interprets "sparse" or "minimal" as "remove all instruments", even more aggressively than other providers. The model has been observed to:

  • Remove all instruments in quiet sections when the prompt uses the word "quiet"
  • Drop percussion entirely when the prompt uses "intimate"
  • Go a cappella on build-up sections when the prompt uses "build"

Mitigation:

  • Never use the words "sparse", "minimal", "stripped back", "quiet" in a MiniMax prompt without pairing them with explicit instruments.
  • Always add: "ALL instruments ALWAYS playing throughout, NEVER go a cappella or silent at any point".
  • Always list every instrument you want to hear.
  • For quiet sections, use the explicit form: "quiet sections: reduced to accordion and bass only, still fully played, NOT silent".

If a generation comes back sparse despite these rules, retry once with an even more explicit instrument list. If it fails again, the prompt has a structural issue — try a different style.

For the canonical anti-sparse text and worked examples, see the base skill's Anti-Sparse Rules section.

Quality Verification Checklist

Same 8-point checklist as the base skill, plus 4 MiniMax-specific items:

  1. Cover preserves melody recognisably. If the user said "make it sound like Song X", the new version should be recognisable as Song X's melody with Song Y's style.
  2. Emotion curve matches Song A (for mashups). The dynamic arc of the output should follow the original's intensity, not flatten to a single energy.
  3. --avoid flags are respected. If the user said "no electronic sounds", the output should not have synths.
  4. Per-flag control worked (BPM, key, structure). If the user asked for 80 BPM in E minor, the output should be in that range, not "close enough".

Output Verification (Covers, Mashups, Style Transfer)

After generation, run a post-generation check that is specific to the route. Use the analysis orchestrator's output on the generated file when possible.

Verification Checklist per Route

Cover (minimax_cover)

  • Melody recognisable as the source (basic-pitch MIDI compare or ear-test)
  • Target style is clearly audible (genre/mood keywords present)
  • Source BPM is within ±10 BPM
  • Source key is preserved (or user agreed to shift)
  • Lyrics decision respected (original / translated / new / instrumental)
  • --avoid flags respected

Mashup (minimax_mashup)

  • Song A's lyrics and emotional arc recognisable
  • Song B's style is dominant in the production
  • Vocal intensity matches Song A's emotion curve
  • Section structure feels coherent (not random)
  • --avoid flags respected for Song B's style

Style Transfer (minimax_style_transfer)

  • Source style (the reference track) is reproduced in timbre, instrumentation, and feel
  • Output melody is NOT recognisable as the source (it is a new composition in the source style)
  • Target genre/mood keywords audible
  • BPM and key reasonable for the new style (not forced from source)

Emotion Prompt / Precision (minimax_emotion_prompt)

  • Per-flag values (BPM, key, structure, avoid) match the flags
  • Prompt language and flags are not contradicting (linter clean)
  • Lyrics reflect the requested theme and language

Failure Signatures and Fixes

When the generated track does not match the request, identify the failure signature and apply the matching fix. The most common signatures:

Failure signatureLikely causeFix
Copied too closely (cover sounds like a remaster, not a new style)Prompt did not specify the new style firmly enough, or --avoid list left the original instrumentation unguarded.Add explicit target style language, list new instruments, expand --avoid with the source's dominant sounds. Re-run.
Lost source melody (cover no longer recognisable)--prompt overrode the cover model, or the source audio was too noisy / clipped.Switch to the two-step cover workflow (preprocess + generate with cover-feature-id); reduce style strength in the prompt.
Wrong tempo (BPM noticeably off)Prompt and --bpm disagreed, or vocal delivery speed misled the detector.Lint prompt + flags first. Re-run with the linter-clean pair. If still off, set --bpm explicitly and drop the BPM number from the prompt.
Wrong key (key shifted up/down)Prompt mentioned a key but flags used another.Lint the pair. Use the same key in both. If MIDI confirms a different source key, trust MIDI over prompt.
Muddy mix (low clarity, washed out)Overly dense instrumentation, lack of anti-sparse guard, or too many --avoid exclusions.Reduce instrument count, raise --bpm for tightness, add explicit "all instruments clearly audible".
Vocals too neutral (no emotion)Emotion analysis not run, or intensity curve not transferred.Run analyze_vocal_emotion.py on the source and feed intensity_curve into the prompt. Add explicit "vocal intensity: ..." clause.
Weak chorus (chorus does not lift)Structure line lacks a build cue, or the prompt was a single energy.Add structure with explicit build cues: "verse: intimate, chorus: soaring, all instruments louder in chorus".
Style mismatch (output does not match the requested genre)Prompt used vague genre words or the wrong dominant instrument.Replace vague words with concrete genre + instrument list. Use the canonical mmx prompt schema in references/examples.md.

Revision Prompt Templates

When a generation comes back with one of the failure signatures above, build a revision prompt that preserves the source identity while changing the failing dimension.

Template: stronger style change (cover too close)

Same melody and lyrics as before. Re-imagine the production as [TARGET_STYLE] with [INSTRUMENT_LIST].
ALL instruments always playing throughout, never go a cappella.
Avoid: [STYLE_CONTRADICTING_WORDS from previous run].

Template: keep the melody (cover lost it)

Re-apply the original melody from the source audio. Keep the recognizable hook at [HOOK_TIME].
Use a softer production in [TARGET_STYLE] but DO NOT change the melodic contour.
Avoid: [WORDS_THAT_PUSHED_TOO_FAR].

Template: fix tempo drift

Keep the source BPM (use --bpm [SOURCE_BPM]). Do not slow down or speed up the vocal delivery.
Avoid: rubato, half-time, double-time, slowing down, speeding up.

Template: fix key shift

Stay in [SOURCE_KEY]. Do not transpose. Use the same chord progression as the source.
Avoid: key change, modulation, transpose.

Template: fix muddy mix

Make every instrument clearly audible. Reduce instrument count to [N].
Add contrast: quieter verses, louder choruses. Keep vocals upfront in the mix.
Avoid: dense layering, atmospheric washes, sustained pads throughout.

Template: lift the chorus

Chorus: soaring, all instruments louder than the verse, fuller chords, more reverb on the lead vocal.
Verse: intimate, single voice, soft drums, breathy delivery.
Bridge: build tension, add a melodic lift before the final chorus.

These templates pair with the failure-signature table. After the revision, re-run the verification checklist above.

Lyrics Optimizer Behavior

Same as the base skill — when music_generate is called without explicit lyrics, MiniMax auto-generates. With this skill, you can also call the lyrics_generation API directly to preview the lyrics before generation, or to iterate via the edit mode.

If the user wants specific words, the lyrics_generation API's edit mode lets you modify auto-generated lyrics to match the user's intent without regenerating the whole song.

Reference Map