Genor-Comfy-Gate

Workflows

Comprehensive multi-modal gateway for ComfyUI enabling audio generation with ACE-Step 1.5 and photorealistic image creation via SDXL workflows.

Install

openclaw skills install genor-comfy-gate

Genor-Comfy-Gate — Comprehensive Skill

THE authoritative reference for ALL ComfyUI operations through our gateway. Multi-modal: audio, images, video (future). Read this before any generation. Updated as we learn.

Modalities

TypeStatusWorkflowModel
🎵 Audio✅ Activeacestep-rapcoreACE-Step 1.5 SFT merge
🎬 Video🔜 Planned

The gateway is modality-agnostic — it submits any workflow JSON to ComfyUI, polls, waits, downloads, and saves. Adding a new modality means adding a workflow file + WORKFLOW_INFO entry. The type field determines output dir (audio/ or images/).

Gateway

PropertyValue
Endpointhttp://127.0.0.1:8188
Authx-api-key: gcg-4d... header (localhost exempt)
Managed bypm2 (genor-comfy-gate)
Location./ (installed dir)
Configenv / COMFY_SERVERS var

Backend Servers

Configure your ComfyUI backends via the COMFY_SERVERS environment variable:

[
  {"url": "http://127.0.0.1:8188", "id": "local", "priority": true, "weight": 1}
]

Default: single local server at http://127.0.0.1:8188

IDURLPriority
localhttp://127.0.0.1:8188★ (default)

Load Balancing Logic (in pickServer())

  1. PRIMARY always preferred when IDLE (0 running tasks)
  2. If PRIMARY has ANY running task → ALL new requests → SECONDARY
  3. If SECONDARY offline → fallback to PRIMARY regardless
  4. Download ALWAYS from the server that generated the file (server.url)

Workflows

acestep-aio — ACE-Step 1.5 Audio Generation

Model: aceStep15Music_sft17BAIO.safetensors (ACE-Step 1.5 SFT merge)

Workflow Pipeline:
  CheckpointLoader(160) → AnySwitch(model/clip/vae) → TextEncode(94) → KSampler(35 steps, dpmpp_3m_sde, beta, cfg=1) → VAEDecodeTiled → SaveAudioMP3(104)
  Lyrics: String(252) → TextEncode.lyrics
  Duration: mxSlider(274) → TextEncode + EmptyLatent
  Negative: ConditioningZeroOut(47) → zeroes the positive conditioning

Node Map

NodeClassRoleInjections
94TextEncodeAceStepAudio1.5Main text encoderprompttags, lyrics ← 252, bpm, keyscale, duration ← 274, language
252StringLyrics feed into node 94lyricsString
3KSamplerDenoising (35 steps, dpmpp_3m_sde, beta, cfg=1)seed ← 307
98EmptyAceStep1.5LatentAudioCreates latent audio spaceseconds ← 274
104SaveAudioMP3Output V0 MP3
128VAEDecodeAudioTiledVAE decode (tile=512, overlap=64)
160CheckpointLoaderSimpleLoads model
274mxSliderSong duration (seconds)durationXi and Xf
307Seed (rgthree)Global seedseedseed
257Text ConcatenateBuilds output filenameartist+title+path
47ConditioningZeroOutNegative prompt (zeroed)
78ModelSamplingAuraFlowShift=13Bypassed by default — use model_sampling: true to enable

Reference Nodes (informational, in workflow but not connected)

NodeContent
317Genre description table (38 genres with tags)
318Keyscale/BPM reference table (38 genres × scale + key + BPM)
320Structure example (metalcore duet with timeline)
321Preset example (detailed scene-by-scene prompt)
319LLM input example (NSFW lyrics prompt format)
400Disconnected tags node (original rapcore tags, kept for reference)

Generation Parameters

{
  "workflow": "acestep-rapcore",
  "prompt": "comma-separated tags (under 512 chars)",
  "lyrics": "structured lyrics with [section] tags",
  "duration": 180,
  "bpm": 150,
  "keyscale": "E minor",
  "language": "en",
  "seed": -1
}

All parameters EXCEPT prompt and lyrics are optional. Omitted parameters keep their workflow defaults.

model_sampling (optional, boolean): Enables ModelSamplingAuraFlow (shift=13) for acestep-aio. Bypassed by default — it's 50/50 whether it improves quality, so safer to leave off. Set model_sampling: true if you want to experiment with it on.


The 8 Dimensions

Every caption should cover as many as possible, in 5-8 comma-separated tags:

  1. Style/Genre — metalcore, synthwave, drum and bass, pop, folk
  2. Emotion/Atmosphere — melancholic, euphoric, aggressive, dreamy, dark
  3. Instruments — distorted guitar, 808 bass, strings, piano, synths
  4. Timbre/Texture — warm, crisp, punchy, lush, airy, bright
  5. Vocal — male/female, raspy, clean, powerful, breathy, belting
  6. Production — polished, lo-fi, live, studio, dry, glossy
  7. Era — 80s, 90s, modern, retro, vintage
  8. Speed/Rhythm — driving, groovy, frantic, mid-tempo, laid-back

Rules

  • 5-8 tags max — more degrades quality
  • BPM/key in parameters, NOT caption — they're separate fields
  • No conflicting pairs — e.g. "classical strings" + "death metal growls"
  • Texture words matter heavily — they control mix/production quality
  • Specific > vague — "melancholic piano ballad, female breathy vocal" > "sad song"
  • Repeat what you want more of — repetition reinforces

Known Good Captions

pop, piano+strings+guitar, female warm vocal, melancholic intimate, bedroom pop
rock, metal, heavy distorted guitar, powerful drums, melodic vocals, aggressive, epic, dramatic, guitar solo
heavy distorted guitar, fast thrash drums, pounding bass, aggressive, dark
rapcore metal fusion, nu-metal, punchy bass, warm distorted guitar, crisp drums, melodic chorus, heavy grooves, atmospheric, polished production, angsty female vocal, emotional

Tags That Cause Problems

  • raw, gritty, distorted (without balancing warmth) → metallic scraping, flat bass
  • heavy bass → boomy/muddy; prefer punchy bass, deep sub-bass, defined bass
  • aggressive on instruments → harsh overtones; use on emotion/vocal instead
  • Too many instrument tags → cluttered, muddy mix
  • "classical" + any heavy genre → contradictory, degrades both

Texture Word Guide

WordEffect
warmAnalog-style saturation, smooth high end
crispClean transients, defined attacks
punchyTight, compressed low-mids, good for bass/kick
brightBoosted highs, airy presence
lushWide stereo, rich harmonics, reverb-heavy
dryClose-mic sound, minimal reverb
airySpacious high end, breathy
polishedStudio-quality, balanced EQ
rawUSE WITH CAUTION — unprocessed, potentially harsh
grittyUSE WITH CAUTION — distortion artifacts

Lyrics Engineering (ACE-Step)

Required Structure Tags

ACE-Step REQUIRES section markers to align music with lyrics:

[Intro], [Verse], [Pre-Chorus], [Chorus], [Bridge], [Build], [Drop],
[Breakdown], [Guitar Solo], [Piano Interlude], [Outro]

Vocal Control Tags (on own line inside sections)

[whispered], [raspy vocal], [powerful belting], [spoken word],
[falsetto], [harmonies], [clean vocal]

Energy Tags (on own line inside sections)

[high energy], [low energy], [building energy], [euphoric],
[melancholic], [dreamy], [aggressive]

Lyric Writing Rules

  • 6-10 syllables per line — fits the 5Hz LM planner
  • Natural phrasing — write like human speech, not poetry
  • Avoid AI clichés: "neon skies", "electric hearts/dreams", "breaking chains", "rising up", "fire inside"
  • Section description hints on intro/outro lines: (bass rumbles in), (drums fade to silence)
  • UPPERCASE = shouted/emphasized
  • (parentheses) = background vocals/harmonies

🔴 OBOWIĄZKOWA CHECKLISTA PRZED WYSŁANIEM TEKSTU DO GENERACJI

Zanim wyślesz jakikolwiek tekst do ACE-Step — musisz odpowiedzieć sobie na każde z tych pytań i nie wysłać dopóki wszystkie nie są "TAK":

  1. „Czy ten tekst ma sens?” — czy opowiada spójną historię? Czy ma flow od intro do outro? Czy sekcje łączą się logicznie?
  2. „Czy jest gramatycznie poprawny?” — bez błędów ortograficznych, interpunkcyjnych, składniowych. Sprawdź szczególnie polskie znaki, odmianę, przecinki.
  3. „Czy pasuje do autora/projektu?” — czy ton, styl, przekleństwa, energia pasują do artysty (KOSTI/Bonnie Bones)? Czy brzmi jak ta postać?
  4. „Czy muzyka i jej kolejność ma sens?” — czy struktura (Intro→Verse→Chorus→Verse→Bridge→Chorus→Outro) jest logiczna? Czy energia rośnie i opada naturalnie? Czy długość ogólnie ma sens (~120-180s)?
  5. „Czy duration jest odpowiednie?” — 120-180 sekund standard. NIGDY nie wysyłaj duration=150 jeśli nie sprawdziłeś że tyle ma być.
  6. „Czy wiek autora brzmi wiarygodnie?” — nie pisz „mam 15 lat”, „young girl”, „teen” w tekstach dorosłych artystów. KOSTI/Bonnie Bones to dorośli wykonawcy.

Dopiero gdy na każde pytanie odpowiedź brzmi TAK — możesz wysłać do generacji.

Energy Flow Pattern

Intro       → [low energy]       — sparse, building
Verse 1     → [low energy]       — verse, storytelling, restrained
Pre-Chorus  → [building energy]  — tension rising
Chorus      → [high energy]      — maximum impact, full instrumentation
Verse 2     → [low energy]       — second verse, slightly more energy
Pre-Chorus  → [building energy]
Chorus      → [high energy]      — second chorus often bigger (harmonies)
Bridge      → [low energy]       — stripped back, different perspective
Breakdown   → [high energy]      — instrumental intensity (optional)
Final Chorus→ [high energy]      — biggest version
Outro       → [low energy]       — fade out

Genre Reference (from workflow node 317)

Key Genres & Their Tags

Electronic

  • EDM/House: four-on-the-floor, bright synths, uplifting, dance-driven, glossy production, rhythmic, energetic
  • Techno: mechanical, hypnotic rhythms, minimalistic, pulsing bass, industrial textures, dark, repetitive
  • Trance: euphoric, soaring leads, emotional pads, rolling basslines, uplifting, spacious, melodic, anthemic
  • Drum & Bass: rapid breakbeats, deep sub-bass, high-energy, sharp percussion, rolling rhythms, crisp, driving
  • Dubstep: heavy bass drops, wobbling synths, aggressive textures, syncopated rhythms, dark, cinematic, gritty
  • Future Bass: shimmering chords, side-chained synths, emotional, bright leads, bouncy rhythms, glossy, melodic
  • Trap: booming 808s, sharp hi-hats, atmospheric pads, swaggering, dark, punchy, spacious

Rock/Metal

  • Classic Rock: crunchy guitars, steady drums, warm analog tone, energetic, melodic, vintage, riff-driven
  • Hard Rock: heavy riffs, powerful drums, gritty vocals, aggressive, energetic, distorted, bold, driving
  • Metal: distorted guitars, fast drums, dark atmosphere, aggressive, heavy, intense, powerful, tight
  • Progressive Metal: complex structures, technical riffs, atmospheric layers, dramatic, epic, polished, dynamic

Urban

  • Boom Bap: dusty drums, soulful samples, rhythmic, warm textures, punchy kicks, nostalgic, organic
  • Lo-Fi Hip-Hop: mellow beats, vinyl crackle, soft keys, relaxed, dreamy, warm, minimal, hazy
  • Drill: sliding 808s, haunting melodies, gritty textures, cold atmosphere, syncopated, tense, urban

Pop

  • Pop: catchy hooks, bright synths, polished production, upbeat, melodic, modern, radio-ready, clean
  • Synth-Pop: retro synths, bright pads, melodic, nostalgic, electronic, polished, dreamy, airy
  • K-Pop: glossy production, bright synths, genre-blending, catchy hooks, polished, theatrical, vibrant

Soft/Ambient

  • Ambient: soft pads, atmospheric textures, spacious, minimal, calm, evolving, dreamy, subtle, meditative
  • Cinematic: sweeping strings, dramatic percussion, epic, emotional, grand, polished, powerful

Keyscale & BPM Reference (from workflow node 318)

GenreScaleKey RangeBPM Range
EDM/HouseMinor, DorianD#m–Am120–128
TechnoPhrygian, MinorFm–A#m125–135
TranceMajor, MixolydianA–D130–142
Drum & BassMinor, DorianEm–Gm170–178
DubstepMinor, PhrygianFm–G#m138–150
Future BassMajor, MinorC–F140–160
TrapHarmonic MinorFm–Am130–150
Hip-HopMinor, DorianDm–Gm85–95
Lo-FiDorian, LydianCm–Fm60–85
PopMajor, MixolydianC–G90–130
Classic RockMinor PentatonicEm–Am100–140
Hard RockMinor, PhrygianEm–Gm120–160
MetalPhrygian, Harmonic MinorDm–F#m140–200
Prog MetalDorian, Melodic MinorC#m–F#m120–180
BluesBlues Scale, Minor PentatonicEm–Am70–120
FunkMixolydian, DorianE–A100–120
DiscoMixolydian, MajorF–Bb110–130
R&BDorian, MinorDm–Gm60–100
AmbientLydian, DorianC–F60–90
CinematicMinor, Harmonic MinorCm–Fm60–120
ReggaeMajor, MixolydianA–D70–90
K-PopMajor, MinorC–F#100–140
Anime OSTLydian, MajorC–E80–160

Structure Planning (from workflow node 320)

The workflow includes an example of how to structure a caption WITH a song structure plan:

metalcore, symphonic elements, theatrical, duet, heavy distorted guitar,
bright piano, studio-polished, dramatic, melodic, epic, intense.

Structure:
- Intro: brief intro dramatically builds to first verse
- Verse 1: atmospheric piano, sets scene, raspy male vocal only
- Verse 2: guitar power chords, groovy, young female vocal only
- Chorus: anthemic, layered, male+female duet harmonies
- Bridge: atmospheric, dreamy, calm, female vocal only
- Build-up: builds to epic instrumental solo
- Instrumental: fast guitar solo, lead licks, virtuoso shred
- End: powerful ending

This can go in the caption to give the model a temporal roadmap.


Scene-by-Scene Prompting (from workflow node 321)

For maximum control, describe each section's instrumentation and mood in prose:

Intro: A metalcore-tinged, symphonic swell opens the track, with bright piano glimmering
over theatrical strings. Tension rises—studio-polished, dramatic—until it snaps into verse.

Verse 1: Drops to atmospheric piano, soft but charged. Raspy male vocal, intimate, whispered.
No guitars—just piano, subtle pads, suspended breath.

Verse 2: Guitar power chords crash in, groovy pulse. Young female vocal, bright and soaring.
Symphonic elements widen the space, cinematic lift.

Chorus: Erupts into anthemic, epic chorus. Male+female duet harmonies. Distorted guitars,
sweeping strings, pounding drums—polished, intense.

Bridge: Everything falls away. Dreamy, atmospheric, weightless. Soft pads, distant piano,
female vocal airy and ethereal. Suspended.

Build-up: Rhythmic pulses return. Low strings, tom rolls, rising synths. Guitars re-enter
in bursts. Energy coils toward instrumental break.

Instrumental: Fast guitar solo, virtuoso shred, rapid licks, melodic flourishes.
Symphonic backing, metalcore precision drums. Flashy, intense, climactic.

Full API Reference

Core Endpoints

MethodPathDescription
GET/Health check + server statuses
GET/workflowsList available workflows with types
POST/generate-and-waitPRIMARY — submit, wait, download, save. Use this for all generation.
POST/promptSubmit workflow, return prompt_id
GET/history/:prompt_idGet single prompt result
GET/historyAggregated history from all servers
GET/queueAggregated queue (running + pending)
GET/viewProxy media file download
GET/system_statsFirst alive server system info
GET/object_infoProxy to ComfyUI object_info
GET/extensionsProxy to ComfyUI extensions

Image Generation (legacy, use generate-and-wait instead)

MethodPathDescription
GET/generateGet generation options form
POST/generateSubmit image generation
POST/upload/imageUpload image to ComfyUI input dir

Media Management

MethodPathDescription
GET/media-listList generated files (name, size, date, preview URLs)
POST/media-link-onceCreate one-time access token for a file
GET/media-once/:tokenAccess file via one-time token (no API key needed)

Workflow Injection

MethodPathDescription
POST/workflow/:name/promptQuick prompt submit for named workflow (auto-injects)

POST /generate-and-wait — Full Reference

curl -s -X POST http://127.0.0.1:8188/generate-and-wait \
  -H "Content-Type: application/json" \
  -d '{
    "workflow": "acestep-rapcore",
    "prompt": "...",
    "lyrics": "...",
    "duration": 200,
    "bpm": 150,
    "keyscale": "E minor",
    "language": "en",
    "seed": -1
  }'

Audio params: prompt (required), lyrics, duration, bpm, keyscale, language, seed
Image params: prompt (required), aspect_ratio, seed, steps, cfg
Common: workflow (default: acestep-rapcore), client_id

Success response:

{
  "status": "ok",
  "file": "/var/data/comfy-media/audio/example-output.mp3",
  "filename": "example-output.mp3",
  "type": "audio",
  "server": "sec",
  "workflow": "acestep-rapcore",
  "file_size": 5882890
}

Output saved with metadata sidecar (.json) in ~/media/comfy/<audio|images>/.


Operational Notes

Restart

pm2 restart genor-comfy-gate
pm2 logs genor-comfy-gate --lines 20

Status Check

curl -s http://127.0.0.1:8188/ | python3 -m json.tool
curl -s http://127.0.0.1:8188/queue | python3 -m json.tool

Media Location

~/media/comfy/audio/    — generated MP3 files + .json sidecars
~/media/comfy/images/   — generated PNG files + .json sidecars

Gateway Behavior

  • Submits workflow JSON with injected parameters
  • Polls /history/:prompt_id every 2s until complete/fail/timeout
  • Timeout: 600s (10 min) per generation
  • After completion: waits 3s for file write, then downloads
  • Saves to media dir with timestamped name + incrementing sequence number
  • Metadata sidecar written alongside media file

Growing Our Knowledge

When we discover new caption patterns, texture word effects, or workflow tricks:

  1. Update this SKILL.md
  2. Note the date and what we learned in CHANGELOG.md (next to this skill)

Lessons Learned

Full Pipeline

CheckpointLoader(43) → LoRA stack(47,80) → Resolution(17) → KSampler(7, 12 steps LCM) →
  UltimateSDUpscale(88, 2x, 4x-UltraSharp) →
  FaceDetailer NIP(97) → FaceDetailer V(98) → FaceDetailer P(101) →
  FaceDetailer face(104, 1024px, 6 steps) → FaceDetailer hands(105, 2048px, 6 steps) →
  SeedVR2VideoUpscaler(114, 2048px final) → CRT Post-Process(115) → SaveImage(200)

Active LoRAs (node 80)

LoRAStrengthPurpose
AddMicroDetails v60.2Skin texture, fine details
PersonEnhanceV2 ILL0.1Better anatomy/face
TrendCraft Style Detailer v2.4I0.1Overall polish/detail

Active LoRAs (node 47)

LoRAStrengthPurpose
DTLVVTT DMD2 V5-LITE1.0DMD2 distillation (faster/better LCM)

FaceDetailer Pipeline

Sequential detailers with YOLO detectors:

  1. NIP (nipples_yolov8s-seg.pt) — nipple detection, 1024px, denoise 0.4
  2. V (nsfw-seg-vagina-x.pt) — vagina detection, 1024px, denoise 0.4
  3. P (nsfw-seg-penis-x.pt) — penis detection, 1024px, denoise 0.4
  4. Face (Anzhc Face seg 768MS v2 y8n.pt) — face detection, 1024px, 6 steps, denoise 0.4
  5. Hands (PitHandDetailer-v2-Test-v9c.pt) — hand detection, 2048px, 6 steps, denoise 0.5

SeedVR2 Upscaler (node 114)

  • Model: seedvr2_ema_7b_sharp-Q4_K_M.gguf (quantized 7B)
  • VAE: ema_vae_fp16.safetensors
  • Final resolution: 2048
  • Color correction: lab

CRT Post-Process (node 115)

  • Vibrance: +0.015 (subtle saturation boost)
  • Vignette: 0.5 strength, 0.7 radius, 2.0 softness

Danbooru Tag Prompting (LUSTIFY)

CRITICAL: LUSTIFY is Illustrious-based — use Danbooru-format tags, NOT natural language descriptions.

Quality/Priority Tags (always include)

masterpiece, best quality, amazing quality, very aesthetic, absurdres

Subject Tags

1girl, solo, cute, petite, pale skin, medium breasts

Clothing/Accessories

gym uniform, white shirt, sports shorts, sneakers, ponytail

Action/Pose (keep it SIMPLE — complex actions confuse the model)

jumping, dynamic pose, looking at viewer

Setting/Light

gym background, afternoon light, dutch angle, from below

Negative Prompt (always)

blurry, worst quality, bad quality, error, melted body, bad anatomy, bad hands, disfigured

What Works

  • Character portraits work best — this is a hentai/character model
  • Simple dynamic poses (jumping, running, leaning) — YES
  • Quality tags firstmasterpiece, best quality are weighted
  • POV/camera tagsdutch angle, from below, from above, close-up
  • Lighting tagssunlight, god rays, afternoon light, backlight
  • Keep tags under ~25 — more dilutes quality

What Fails

  • Natural language descriptions — "mid-jump over a vaulting horse" → model doesn't understand
  • Complex multi-object composition — "vaulting horse + girl midair" = garbled anatomy
  • "photorealistic" tag — fights the anime/illustrious base, produces uncanny results
  • Overloaded action tags — "jumping + spread legs + leaning forward + vaulting horse" = nightmare
  • Multiple characters — this workflow is tuned for 1girl, solo

Image Generation Parameters

{
  "workflow": "acestep-aio",
  "prompt": "masterpiece, best quality, 1girl, cute, ...",
  "aspect_ratio": "7:9 (Portrait)",
  "seed": -1
}

Valid aspect ratios:

  • 1:1 (Square)
  • 4:5 (Portrait)
  • 7:9 (Portrait) ← default, best for single character
  • 3:2 (Landscape)
  • 16:9 (Landscape)
  • 9:16 (Portrait)

Additional optional params: megapixels (default 1.5), steps, cfg, denoise, sampler_name, scheduler

Adding a New Workflow (any modality)

  1. Export workflow JSON from ComfyUI → save to workflows/<name>.json
  2. Add entry to WORKFLOW_INFO in server.js:
    '<name>': { file: '<name>.json', type: 'audio'|'image'|'video', ext: 'mp3'|'png'|'mp4',
                promptNode: '94', promptField: 'tags', lyricsNode: '252', lyricsField: 'String',
                outputNode: '104' }
    
  3. Restart: pm2 restart genor-comfy-gate
  4. Test, then document in this SKILL.md

The gateway auto-handles: prompt injection, duration, BPM/keyscale (audio), aspect_ratio (image), seed, polling, download from correct server, save to media dir, metadata sidecar.

Lessons Learned

2026-05-19 — Image Generation

  • LUSTIFY is Illustrious-based, uses Danbooru tags — natural language prompts produce garbled results
  • Quality tags (masterpiece, best quality) must come FIRST — they're weighted
  • Complex action scenes fail — model is trained for character portraits, keep poses simple
  • "photorealistic" tag on anime model = uncanny valley, avoid
  • Keep prompts under 25 tags — overloading dilutes quality
  • Pipeline has SeedVR2 upscaler (7B GGUF) + 5-stage FaceDetailer → 2048px final output
  • Face/hand detailers produce excellent close-up quality

2026-05-19 — Audio Generation

  • Download 400 bug: getOutputInfo() function returned undefined filenames despite reading them from history correctly. Fixed by inlining output scanning in the handler.
  • Load balancer: PRIMARY-first when idle, ALL→SECONDARY when PRIMARY busy (not round-robin).
  • Workflow cleanup: Removed duplicate nodes 401, 402. Lyrics now go through node 252 (String) → node 94.
  • Caption quality: raw, gritty, heavy drops cause metallic scraping and flat bass. Use warm, crisp, punchy, polished for clean instruments.
  • 5-8 tags sweet spot for SFT merge model. More degrades quality.
  • 8 dimensions matter: Missing emotion/timbre = flat results. Cover: genre, emotion, instruments, timbre, vocal, production, era, rhythm.