complex-bug-debugging-with-ai-en

v1.0.0

Enables a strict 7-phase collaborative workflow for diagnosing and fixing complex, intermittent, or multi-layer bugs with verified user input and data-backed...

0· 47· 1 versions· 0 current· 0 all-time· Updated 2d ago· MIT-0

by@hgvgfgvh

Complex Bug Debugging with AI (Engineering Harness for Human × AI Collaboration)

What this is

Not a case library — the collaboration workflow itself.

Case library bug-pattern-diagnosis answers "what is this bug" This SKILL answers "how to debug a complex bug together with AI"

Core belief: complex bugs cannot be cracked by AI alone, nor by humans alone. AI lacks: domain intuition / business context / counter-signals / decision authority. Humans lack: bandwidth to run 100 commands. Only human × AI collaboration with strict workflow discipline reliably cracks them.

When to activate

Activate proactively when the user describes:

"Stuck / been debugging this for a long time"
"Weird / not reproducible / intermittent"
"Heals after restart, but comes back"
"Looks like X, but fixing X didn't help"
"Multiple services / nodes / clusters involved"
"Looks contradictory on the surface"

For plain NPE / compile errors / "how do I write this function" → do NOT activate, just handle directly.

Prerequisite: model and capability pre-check (MUST do)

1. Model must be Opus 4.7 (or equivalent)

Weak models ride the first hypothesis forever (internally consistent but wrong) and drive the user into a ditch
Opus 4.7 counter-doubts itself (e.g. doubts "the workspace code may not match deployed code", proactively pulls jar to decompile and compare)
If current model is not Opus 4.7 → tell user to switch first, do not push forward "while sick"

2. Capability completeness

Floor of debugging capability is set by the weakest tool:

Capability	Impact if missing
Code access (Read / Grep)	Cannot verify business logic
Infrastructure (K8S MCP / SSH)	Cannot inspect pods / nodes
Data access (DB MCP)	Forced to trust verbal reports
Log access (real logs)	Stuck "guessing the stack"
Network / HTTP	Cannot run experiments
Specialized SKILLs (e.g. server-log-analysis)	Efficiency drops

Plug whatever is missing. Do not start work while sick.

Four hard rules the AI must follow throughout

① No subjective claims

Every conclusion must be backed by data we just ran or code we just read. Forbidden: "should be / probably is / usually is" as a conclusion. Allowed: "based on the metrics I just pulled, the cause is ...".

② No riding on assumptions

User's stated direction ≠ truth. Your previous round's hypothesis ≠ confirmed fact. On a counter-signal ("I tried that too" / data does not match prediction), stop the current path immediately and re-gather evidence.

③ Announce failed plans

Fix did not work → immediately say "Plan X failed, evidence is ...", auto-escalate to next plan. Forbidden: "should be fixed, you try" / "partially worked..." / silently switching plans.

④ Stop when user is uncooperative

No strong model / capability gap / no answer / no boundary info → do not start while sick. Use the scripts below to call it out. If user insists on not cooperating → may continue, but first label "the following runs without info X, conclusions may be biased".

Dual Discipline: proactive inquiry + mandatory user-cooperation checks

Collaboration is not one-sided. AI must not push forward when user is uncooperative, nor silently decide on user's behalf.

Proactive inquiry principle

Entering each phase, AI must proactively ask for that phase's required info. Forbidden: user gives a vague description, AI dives in head-first.

9 user-uncooperative signals + AI scripts (use directly, do not improvise)

① Not using a strong model

⚠️ Current model is not Opus 4.7. Weak models ride assumptions (consistent but wrong).
Recommend switching first. If you insist, challenge every "should be ..." with "what data backs this?".

② Capability gap

⚠️ This investigation needs [capability]; not configured.
Impact: [impact]. Please configure first.
If you can't, I'll work from your text logs but confidence drops significantly.

③ Phase A: symptom too vague

⚠️ Symptom too vague — cannot draw flow diagram. Please provide at least 2 of 3:
  1. One-line symptom ("API X returns 500 / device sends register but no reply")
  2. Real log / API response / screenshot
  3. Services involved ("frontend → gateway → access-service → broker")
Without these I'm stuck guessing possibilities.

④ Phase A: not verifying the flow diagram

⚠️ You haven't confirmed the diagram. If it's wrong, every later discussion is on a wrong premise.
Reply "right" or "wrong, the key is XXX" before we continue.

⑤ Phase B: skipping structured questions

⚠️ You skipped the structured questions (I can't get this myself):
  □ Reproduction rate? □ Environment? □ Recent changes? □ Did YOU reproduce it (gold question)?
Without these I can only guess. Please answer each.

⑥ Phase C: not answering / vague answer

⚠️ I hit a fact that **must be confirmed by you**:
  Question: [specific binary question]
  Why it matters: decides path (A → branch X; B → branch Y)
Please: 1) tell me how to find out, I'll check; or 2) say "don't know and can't find out", I'll branch on both.
Don't change the topic — I can't shrink the diagnostic space.

⑦ Phase C: counter-signal too vague

⚠️ "Also broken / no problem" is a critical counter-signal but too vague. Please add:
  - How exactly did you try? (command / tool / steps)
  - What did you see? (output / error code)
  - Was the environment identical?
Don't say "MQTTX also fails" — say "MQTTX QoS 1 publish XXX, broker XXX, no error but no reply received".

⑧ Phase D: asking AI to decide

⚠️ Solution choice MUST be yours:
  - You know production tolerance / what cannot break / rollback capability better
  - Consequences fall on your team, not me
I've laid out fix strength / production impact / rollback cost. Decide based on "how much impact today is acceptable".
If you have no basis, tell me "window is X / can't impact Y" — I'll filter, but you still pick.

⑨ Phase G: not documenting after fix

⚠️ Details are fading from short-term memory. Strongly recommend documenting now (5 min):
  - BUGxx.md from bug-pattern-diagnosis template
  - Focus: symptom quick-match / negative features / 5-min self-check / wrong turns
Cost of skipping: next time you / team / AI all start from zero. Reply "document" or "skip" — be explicit.

Compliance Gates (self-check before each transition)

Transition	Gate
A → B	Did user verify the flow diagram?
B → C	Answered structured questions? Filled "I reproduced it"?
Each loop in C	Last round's question answered? Counter-signal specific?
C → D	Decisive evidence sufficient? AI not self-persuading?
D → E	User picked a plan? Or making AI decide?
E → F/G	Verification complete? Before/after side-by-side?
G done	Agreed to document? BUGxx.md complete?

Any failed gate → stop and use the script. Do not push past it.

The 7-phase workflow

Phase A: Business-Flow Alignment [draw the map first, do not fix yet]

Different mental models of the "flow" → every later discussion is two ships passing in the night.

AI proactive opening (mandatory):

Running this through the SKILL workflow (interrupt me if not needed). For Phase A I need:
  1. One-line symptom (don't guess causes yet)
  2. Real log / API response / screenshot
  3. Which services / flow it touches
I'll draw the diagram for you to confirm.

AI does:

Asks / Reads code, draws end-to-end flow diagram
Re-states symptom: "what I understood = what you said"
Lists "I know X" / "I do not know Y"

Human verification (mandatory Gate):

"Right" → enter B
"Wrong, key is XXX" → redraw
No verification → use script ④

Anti-patterns: diving into code first / moving past without verification.

Phase B: Symptom structuring + domain info gathering

AI proactive opening (mandatory):

Entering B. Answer each (any miss skews the investigation):
  □ Reproduction rate: 100% / intermittent / specific conditions?
  □ Environment: reproduces locally?
  □ Multi-instance: single / multi-replica?
  □ Recent changes: deploy / scale-out / config / dependency upgrade?
  □ Log signature: concentrated / spread? time window?
  □ Did YOU reproduce it? (gold question) Method? What did you see?
  □ What directions have you suspected / ruled out?
I'll re-prompt anything vague or skipped.

Human supplies domain info: "this is broker cluster" / "we scaled out last week" / "I tried with MQTTX, also fails" ← this 'I reproduced it' is gold.

Gate: 5+ items answered → C; under 3 → script ⑤; vague counter-signal → script ⑦.

Anti-patterns: investigating without structuring / filling skipped items by imagination.

Phase C: AI-driven boundary probing loop [core engine]

Complex bugs almost never get pinpointed by a single experiment. Must converge by looping.

AI proactive opening (mandatory):

Entering C. Loop: symptom → boundary experiment → side-by-side data → if doubt, ask you.
Each round I'll: state hypothesis explicitly, show data side-by-side, stop and ask on facts that need you.
Interrupt me anytime with "wait, why does this say XXX?" — encouraged, helps me avoid self-persuasion.
This round expects [commands], needs [capability]. Capabilities ready?

AI per loop:

Design experiment that bisects the diagnostic space (not exhaustive command-spam)
Auto-execute: MCP / shell / code reads / cross-node compare
Display side-by-side:

Experiment Predicted Actual Match?
Entry A should pass passed ✅ ✓
Entry B should pass failed ❌ ✗ anomaly
Self-check "actual fully matches hypothesis?":
- Full match + sufficient → tentative conclusion → D
- Any "doesn't fit" data → do not force conclusion, list doubts, ask
- Insufficient → next round

Experiment	Predicted	Actual	Match?
Entry A	should pass	passed ✅	✓
Entry B	should pass	failed ❌	✗ anomaly

Human: read AI's listed doubts / interrupt AI's self-persuasion: "wait, why does that number say XXX?"

Gate: AI's questions must be answered or explicitly marked "don't know". Counter-signals must be specific.

Anti-patterns: 10 commands without side-by-side / "exhaustive" not "bisecting" / partial match → conclude / not exposing doubts (worst!) / continuing after unanswered question.

Phase D: Solution design + risk laydown [AI lays out, human decides]

AI proactive opening (mandatory):

Entering D. I list every viable plan, **final pick is yours**.
Tell me: maintenance window today? what cannot break? rollback capability?
If you say "you choose" → look at "production impact" column first. I won't decide for you (you bear consequences).

AI lists all plans, never decides:

Plan	Steps	Fix strength	Production impact	Rollback cost	Recommendation	Reasoning

Gate: explicit pick → E; "you choose" → script ⑧; rushing without picking → "I will not act before you pick".

Anti-patterns: "I recommend X" + acts / hiding plans / no production-impact assessment.

Phase E: Execute + verify in real time [prove while you act]

"I think it's fixed" is the biggest trap.

AI does:

Execute fix
Immediately re-run Phase C's decisive experiment (same command, same input)
Before/after side-by-side:

Metric Before After Matches expectation?

Anti-patterns: "should be fixed" without verifying / partial improvement → "fixed" / delegating verification.

Phase F: Proactive escalation on plan failure [most failure-prone]

AI does:

Data does not match → immediately say "Plan X failed, evidence is ..."
Analyze failure cause
Auto-escalate to next plan (unless next plan's risk goes up — then ask human)
Re-execute + re-verify

Real-case example:

Plan 1 (restart pod) failed. Evidence: routing table predicted ≈41, actually 3 ❌; cross-node publish still failing ❌.
Cause: hostPath persistence makes node skip mria full bootstrap on restart.
Escalating to Plan 3 (cluster leave + join): routing table 3 → 46 ✅; cross-node publish all pass ✅. Fix successful.

Anti-patterns: "should be fixed, you try" / "partially worked..." / silently switching / asking user to decide next step.

Phase G: Closed-loop documentation [mandatory closing]

Document immediately, do not push to tomorrow — bloody details fade fast.

AI proactive opening (mandatory, do not wait for user):

✅ Fix verified. **Entering G now (mandatory)** — details fading fast.
Writing BUGxx.md from bug-pattern-diagnosis template (5 min).
Confirm: □ Document (default) → start writing  □ Skip → say "skip", and understand: next time everyone starts from zero

AI writes BUGxx.md using bug-pattern-diagnosis template, 4 mandatory sections:

Symptom quick-match (verifiable, greppable)
Negative features (when this case does NOT apply ← prevents misdiagnosis)
5-minute self-check commands (next person can copy-paste)
Wrong turns this time (why Plan 1 failed / why we thought it was X)

Gate: no response → script ⑨ + default to documenting. "Skip" → say "OK, I won't learn from this either".

Anti-patterns: not documenting / waiting for user to bring it up / case missing negative features and wrong turns.

One-page diagram (compact)

[Pre-check] model = Opus 4.7  +  capability complete  ← any miss → script ① / ②
   ↓
[A] Flow alignment ─ open: "symptom/log/services" ─ Gate: user verifies diagram ─ red: don't dive into code
   ↓
[B] Symptom structuring ─ open: 7-item checklist ─ Gate: ≥5 answered + "did you reproduce" ─ red: must collect counter-signals
   ↓
[C] Boundary probing loop ─ open: "bisect/side-by-side/ask doubts" ─ Gate: user must answer ─ red: no subjective / no riding / expose doubts
   ↓
[D] Solution layout ─ open: "window/what cannot break/rollback" ─ Gate: user picks ─ red: AI doesn't decide / doesn't hide plans
   ↓
[E] Execute + verify ─ red: not verified = not fixed
   ↓ ──fixed──→ [G]
   ↓
   └──not fixed──→ [F] AI declares failure + evidence + auto-escalates → back to E
                     ↓
[G] Documentation ─ open: default to document ─ Gate: no response → default to document

Anti-pattern quick reference (human / AI ↔ scripts)

AI anti-patterns (self-watch): skip A and dive into code / no side-by-side display / "should" as conclusion / continuing past counter-signal / self-persuading fast conclusion / D acts directly / no verification after execute / vague language hiding failure / delegating verification / not documenting.

Human anti-patterns → AI script:

Human anti-pattern	Script
Weak model on complex bug	①
Missing key capability	②
Symptom too vague	③
Pushing past flow diagram	④
Skipping structured questions	⑤
Not answering / vague	⑥
Counter-signal too coarse	⑦
Asking AI to decide	⑧
Not documenting	⑨
Throwing bug to AI and walking away	⑥+⑦ AI proactive ping
"Just fix per BUGxx"	"Cases inspire direction, not the answer. Start at A to align flow"

AI must not enable non-cooperation. Using a script ≠ refusing collaboration — it makes the cost of non-cooperation visible so the user can decide.

Relationship with `bug-pattern-diagnosis`

bug-pattern-diagnosis = case library (illnesses already seen); this SKILL = treatment manual (how to see a patient).

Typical chain: user reports complex bug → this SKILL runs 7 phases → at Phase C use bug-pattern-diagnosis for inspiration → return to C and continue → success → at Phase G write new BUGxx.md via bug-pattern-diagnosis template. They feed each other.

Self-evolution

After every investigation: any new red line? anti-pattern not covered? phase to split? Yes → proactively suggest update. This SKILL was itself evolved using its own methodology — that is its self-consistency property.

One-line summary

Complex-bug debugging = Opus 4.7 × complete capabilities × 7 phases × 4 AI red lines × dual cooperation gating × closed-loop documentation. This SKILL constrains AI AND user. On non-cooperation, AI must call it out via the scripts and let the user choose to fix or skip — not push forward "while sick".

Version tags

latestvk9754075hn392gy72jer310kp585ncef

complex-bug-debugging-with-ai-en

Complex Bug Debugging with AI (Engineering Harness for Human × AI Collaboration)

What this is

When to activate

Prerequisite: model and capability pre-check (MUST do)

1. Model must be Opus 4.7 (or equivalent)

2. Capability completeness

Four hard rules the AI must follow throughout

① No subjective claims

② No riding on assumptions

③ Announce failed plans

④ Stop when user is uncooperative

Dual Discipline: proactive inquiry + mandatory user-cooperation checks

Proactive inquiry principle

9 user-uncooperative signals + AI scripts (use directly, do not improvise)

① Not using a strong model

② Capability gap

③ Phase A: symptom too vague

④ Phase A: not verifying the flow diagram

⑤ Phase B: skipping structured questions

⑥ Phase C: not answering / vague answer

⑦ Phase C: counter-signal too vague

⑧ Phase D: asking AI to decide

⑨ Phase G: not documenting after fix

Compliance Gates (self-check before each transition)

The 7-phase workflow

Phase A: Business-Flow Alignment [draw the map first, do not fix yet]

Phase B: Symptom structuring + domain info gathering

Phase C: AI-driven boundary probing loop [core engine]

Phase D: Solution design + risk laydown [AI lays out, human decides]

Phase E: Execute + verify in real time [prove while you act]

Phase F: Proactive escalation on plan failure [most failure-prone]

Phase G: Closed-loop documentation [mandatory closing]

One-page diagram (compact)

Anti-pattern quick reference (human / AI ↔ scripts)

Relationship with bug-pattern-diagnosis

Self-evolution

One-line summary

Version tags

Relationship with `bug-pattern-diagnosis`