Install
openclaw skills install deep-debuggingEvidence-first debugging and incident triage for unclear, recurring, production-like, or high-risk software bugs. Use when the user asks for root cause analysis, says a fix did not work, reports 401/403/500 with context, deploy/runtime failures, broken integrations, or needs investigation before code changes. Do not use for obvious typos, missing installs, or trivial one-line fixes.
openclaw skills install deep-debuggingNo guessing. No random fixes. Stabilize incidents first, then prove the root cause.
Use this skill for:
500, broken login, failed deploy, red healthcheck401, 403, cookie/JWT weirdnessDo not use it for obvious compiler errors, typos, missing install/setup steps, or cosmetic UI tweaks.
0. Incident Gate → user/prod impact? stabilize first
1. Quick Triage → obvious setup/runtime misses
2. Evidence → exact error, repro, affected path, last change
3. Hypothesis → one testable cause + one test
4. Narrow → binary-search the failure chain
5. Fix → smallest reversible change
6. Verify → exact repro/test/build/log evidence
7. Prevent → regression/monitoring/learning when recurring or prod-like
If users, production, money flows, auth, data integrity, or external integrations are affected, switch to incident mode before debugging.
Output first:
INCIDENT SNAPSHOT
Impact: [who/what is affected]
Severity: [low/medium/high/critical + why]
Started: [time/commit/deploy if known]
Evidence: [logs/status/metrics; redacted]
Stabilize: [rollback, feature flag, pause job, monitor, or no-op]
Next step: [one concrete diagnostic action]
Rules:
For detailed incident checklists read references/incident-first.md.
Check these before deeper analysis:
□ Server/process restarted after config/code change?
□ Correct env file/keys present? Key names only, never values.
□ Dependencies installed/generated after package/schema changes?
□ Migration/schema state matches runtime?
□ Browser/client cache or stale build ruled out?
□ Repro uses test data, not live credentials/customer data?
If a quick triage item explains the issue, fix that minimally and still verify.
Collect real proof:
Error: exact message/status/stack excerpt
Path: endpoint/function/job/component
Repro: minimal steps or request shape
Scope: all users vs specific role/input/tenant/environment
Expected: what should happen
Actual: what happens
Last change: commit/deploy/config/schema/provider change
Optional helper: run scripts/incident_snapshot.sh locally to collect safe environment metadata. It prints env key names only, not values.
State exactly one hypothesis before touching code:
HYPOTHESIS: The failure happens because [specific cause],
which I will prove/disprove by [specific test].
Bad: “Something is wrong with auth.”
Good: “The 401 happens because the login token is set but not sent on /me, which I will prove by comparing the login response headers with the follow-up request headers.”
Pick the chain and split it:
Frontend → request creation → network → API gateway/middleware → controller → service → DB/external API → response → UI
After each test report:
✅ Ruled out: [component] because [evidence]
❌ Found: [component] fails because [evidence]
For stack-specific checklists read references/stack-checklists.md.
Only after evidence supports the hypothesis:
Before saying done, provide evidence:
DEBUG REPORT
Failure: [exact issue]
Root cause: [specific cause]
Proof: [test/log/code evidence]
Fix: [minimal change]
Verified: [command/test/repro result]
Prevention: [test/monitoring/doc/learning, or "not needed" + why]
Remaining: [risk/blocker, or "none known"]
For report variants read references/output-templates.md.
Required when the bug is production-like, recurring, security-adjacent, or took more than one hypothesis:
□ Regression test or smoke test added/identified
□ Monitoring/logging improved or gap named
□ Runbook/rollback note captured for future incidents
□ Durable learning written if likely to recur
Next optimization:.