Install
openclaw skills install gstack-openclaw-diagnoseStructured diagnosis for hard bugs and performance regressions. Builds a deterministic feedback loop FIRST, then reproduces, hypothesises (3-5 ranked), instruments, fixes with regression test, cleans up. Use when: bug resists a quick fix, flaky failure, performance regression, user has tried 2+ things, or user says diagnose.
openclaw skills install gstack-openclaw-diagnoseA discipline for hard bugs. Skip phases only when explicitly justified.
Core insight: If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause. Everything else is mechanical. If you don't have one, no amount of staring at code will save you.
This is the skill. Spend disproportionate effort here.
git bisect run it.Treat the loop as a product:
A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
Goal: raise reproduction rate. Loop 100x, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake is debuggable; 1% is not.
Stop and say so. List what you tried. Ask the user for: (a) access to the reproducing environment, (b) a captured artifact (HAR file, log dump, core dump, screen recording), or (c) permission to add temporary production instrumentation. Do NOT proceed to hypothesise without a loop.
Run the loop. Watch the bug appear. Confirm:
Generate 3-5 ranked hypotheses before testing any. Single-hypothesis generation anchors on the first plausible idea.
Each hypothesis must be falsifiable:
"If <X> is the cause, then <changing Y> will make it disappear / <changing Z> will make it worse."
If you can't state the prediction, the hypothesis is a vibe — discard or sharpen it.
Show the ranked list to the user before testing. They often re-rank instantly with domain knowledge. Don't block — proceed with your ranking if AFK.
Each probe maps to a specific prediction from Phase 3. One variable at a time.
Tool preference:
Tag every debug log with a unique prefix: [DEBUG-a4f2]. Cleanup = single grep. Untagged logs survive; tagged logs die.
Performance bugs: logs are usually wrong. Establish a baseline measurement (timing harness, profiler, query plan), then bisect. Measure first, fix second.
Write the regression test before the fix — but only if there's a correct seam.
A correct seam exercises the real bug pattern as it occurs at the call site. If the only seam is too shallow, a regression test there gives false confidence.
If no correct seam exists, that itself is the finding — note it.
Before declaring done:
[DEBUG-...] instrumentation removed (grep the prefix)Then ask: what would have prevented this bug? If the answer involves architectural change, note it for the user — don't bundle it into this fix.