OpenClaw Problem Solver自动修复小龙虾

Automation

A comprehensive, self-evolving skill designed to diagnose and solve OpenClaw issues by following a structured, multi-stage resolution cycle. It incorporates Proactive Prediction (L2), Robustness Checks (L1), Knowledge Creation (L3), Diagnosis Report Visualization (v5.5), and Gateway Watchdog (v4.1).

Install

openclaw skills install autofix

🧠 OpenClaw Problem Solver (v5.5) — Master Document

This skill acts as an advanced diagnostic, resolution, and validation engine for any question or bug report related to the OpenClaw framework itself. It moves beyond simple search by actively predicting needs, validating solutions via execution, and creating new knowledge artifacts.

🔒 Privacy-First Constraint

All knowledge storage (memory, logs) and final reports must follow these rules:

  1. API Keys: Never store API keys in plain text. Use placeholders or hashes (e.g., sk-********************).
  2. Private Details: Redact sensitive project names, internal credentials, and user identity info unless explicitly authorized.

🎯 When to Use This Skill

Use this skill when the user:

  • Asks "Why is [feature] not working in OpenClaw?"
  • Reports a specific bug (e.g., "The gateway tool fails with error X").
  • Needs guidance on how to implement a specific feature using OpenClaw's architecture or tools.
  • Wants to know the best practice for a certain task within the OpenClaw ecosystem, but requires verification.

🌟 Recommended Quick Fix Flow (The Golden Path)

For the vast majority of OpenClaw issues, this sequence provides the fastest path to resolution. Always suggest this flow first when a user reports an unspecified problem or bug!

  1. Diagnosis: Instruct the user to run openclaw doctor in their terminal. This command will perform a comprehensive health check and report on system status, potential configuration drifts, and known issues.
  2. Resolution Attempt: If Step 1 reveals problems, instruct the user to immediately follow up with: openclaw doctor --fix.

🚀 The Evolved Workflow (6-Step Cycle + Proactive Layers)

The skill operates by strictly following these steps in sequence, enhanced by proactive layers:

🤖 Gateway Watchdog (v4.1) — Proactive Stability Layer

Overview: A background daemon that periodically polls the Gateway health status. Runs independently from user requests, providing real-time monitoring for anomalies such as Gateway downtime, RPC failures, and configuration drift.

v4.1 Feature Highlights:

FeatureDescription
🎯 Real Health CheckCalls openclaw gateway status --json, parses service.runtime.status + rpc.ok
🔇 Noise FilteringAlerts only after ≥3 consecutive failures; resets after ≥3 consecutive successes
📊 Severity LevelsFour-tier classification (🟢/🟡/🟠/🔴) with auto-escalation
📡 Dual-Channel AlertingWebChat (HTTP API) + Feishu (CLI) redundant delivery
🔄 Single InstanceWindows Mutex ensures only one daemon runs at a time
📦 Log RotationAuto-rotates at 5MB, keeps 3 backup files
⏰ Precise SchedulingFixed-minute schedule eliminates cumulative drift
🔐 Hot-Reload ConfigMonitors openclaw.json changes and reloads automatically
🖥️ Auto-StartRegisters in HKCU\Run for auto-launch on user login
👋 Startup ConfirmationSends a status message to both channels on startup

📊 Severity & Alert Rules:

Consecutive FailuresLevelBehavior
< 2🟢 Level 1 — NormalSilent, no notification
2🟡 Level 2 — NoticeSilent, continue monitoring
≥ 3🟠 Level 3 — WarningTrigger notification (first time)
≥ 5🔴 Level 4 — CriticalTrigger notification + repeat every 5 failures
Gateway stopped🔴 Level 4 — CriticalImmediate notification
Recovered for 3 cycles✅ RecoveredSend recovery notification

🚨 Notification Triggers:

  1. Severity escalation (e.g., 1→3): sends alert
  2. First time hitting alert threshold (≥3 consecutive failures): sends alert
  3. At critical level, every 5 failures: sends reminder
  4. System recovery (abnormal→normal for 3 consecutive checks): sends recovery notice

Prerequisites: Enable Gateway HTTP API

The Watchdog uses the Gateway HTTP API to deliver real-time notifications. Enable it first:

openclaw config set gateway.http.endpoints.chatCompletions.enabled true
openclaw gateway restart

Deployment (v4.1)

Run the Watchdog as a standalone background process:

# Start
python scripts\watchdog_monitor.py

# Install auto-start (launches on user login)
python scripts\watchdog_monitor.py --install

# Remove auto-start
python scripts\watchdog_monitor.py --uninstall

Or use Start-Process for a hidden window:

$py = (Get-Command python).Source
$script = "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_monitor.py"
Start-Process -FilePath $py -ArgumentList $script `
    -WindowStyle Hidden `
    -WorkingDirectory "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts"

Check process status:

Get-WmiObject Win32_Process -Filter "Name like 'python%'" |
    Where-Object { $_.CommandLine -match 'watchdog_monitor' } |
    Select-Object ProcessId, @{n="Start";e={$_.CreationDate}}

Stop the Watchdog:

# Find the PID first, then
Stop-Process -Id <PID> -Force

View live logs:

Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\gateway_watchdog.log" -Tail 10 -Wait

View state file:

Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_state.json" -Raw | ConvertFrom-Json

🔗 Notification Architecture

Watchdog (background daemon, 60s interval)
    │
    ├─ [Channel A] Gateway HTTP API (/v1/chat/completions)
    │            → WebChat live session (agent:main:main)
    │            → token cost: minimal (max_tokens=10)
    │
    ├─ [Channel B] openclaw message send --channel feishu
    │            → Feishu direct message (ou_xxx)
    │            → plain text, zero token cost
    │
    └─ [Log]     watchdog_state.json (state file)

Dual-Channel Alerting + Autofix Triggers

  • When a WebChat alert arrives, reply with any of these commands to start diagnosis:
    • run autofix self-check
    • check what's wrong with Gateway
    • auto repair
  • Feishu messages serve as an offline backup
  • Each alert message includes a structured JSON diagnostic context at the end, so the agent can start fixing immediately

Workflow Integration + Autofix Linkage

The Watchdog forms a Proactive Stability Layer, independent of the standard diagnostic flow (Steps 0-5). When an anomaly is detected:

a. The daemon logs the event and generates a System Health Warning (SHW) report b. Sends a real-time alert (with diagnostic guidance + context JSON) c. Auto-repair low-risk known issues (e.g., CLI path problems) automatically, then verifies d. High-risk operations only provide repair suggestions, awaiting user confirmation

🛠️ Auto-Repair Module (v1.0)

Repair Script Library: scripts/auto_repair.py

Matches repair plans based on the diagnostic context from Watchdog alerts:

IssueMatch ConditionRepair ActionRisk
Gateway stoppedstatus: stoppedRestart Gateway🟡 Needs confirmation
RPC connection failedrpc_ok: falseRestart Gateway🟡 Needs confirmation
CLI unavailablestatus: cli_errorCheck installation path🟢 Auto-execute
HTTP unreachablestatus: unreachableCheck port + restart🟡 Needs confirmation

Repair Verification Loop:

  1. After auto-repair, wait 3 seconds then re-run health check
  2. Verification passed → send "Auto-repair succeeded" confirmation
  3. Verification failed → send "Still needs manual diagnosis" escalation

Health Trend Tracking:

  • watchdog_state.json retains the last 1440 check records (24 hours)
  • Each record includes: timestamp, health status, severity level, source
  • Trend data can be visualized via Canvas health dashboard

Current Status (v4.1)

  • ✅ Real health check — service.runtime.status = running + rpc.ok = true
  • ✅ Noise filtering — ≥3 failures to trigger, ≥3 successes to reset
  • ✅ Severity levels — Four-tier (🟢/🟡/🟠/🔴) with auto-escalation
  • ✅ WebChat channel — Gateway HTTP API /v1/chat/completions, low token cost
  • ✅ Feishu channel — openclaw message send --channel feishu, zero token cost
  • ✅ Single instance — Windows Mutex prevents duplicates
  • ✅ Log rotation — 5MB auto-rotate, 3 backups
  • ✅ Precise scheduling — Fixed-minute schedule, no drift
  • ✅ Hot-reload config — Watches openclaw.json for changes
  • ✅ Auto-start — HKCU\Run registry, launches on user login
  • ✅ Startup confirmation — Sends status to both channels on start
  • 📁 State file: scripts/watchdog_state.json
  • 📁 Log file: scripts/gateway_watchdog.log

Standard Workflow (6-Step Cycle + Proactive Layers)

This skill strictly follows these steps in sequence, enhanced by proactive layers:

Step 0: Resource Pre-check & Cost Management (New) — Starting Point

Before any resource-intensive external search or service call, proactively check API quotas, rate limits, and budget consumption for the current active session. If quota-low alerts or known rate-limit thresholds are hit, pause all execution steps and notify the user with a clear "resource warning," requesting they wait or switch to a low-cost / local alternative.

Step 1: Primary Search (See docs/MODULE_02_SearchChain.md — Step 1)

  • Search the official documentation (docs.openclaw.ai) for official solutions
  • Gather context information related to the problem
  • Extract key error messages and configuration status

Step 2: Backup Search (See docs/MODULE_02_SearchChain.md — Step 2)

  • If the official docs don't provide an answer, search GitHub Issues
  • Look for community-reported problems and solutions
  • Collect code verification requirements or pattern-matching information

Step 3: Analysis & Decision (See docs/MODULE_03_ValidationAction.md — Step 3)

  • Choose the best action path based on search results
  • Perform evidence chain analysis (L1) to evaluate solution reliability
  • Decide between direct answer, code verification, or contextual inquiry

Step 4: Validation & Action (v5.0 Enhanced) (See docs/MODULE_03_ValidationAction.md — Step 4 + docs/MODULE_03_Enhancement_Reports.md)

  • Execute validation (MRE) or propose a contextual inquiry
  • Generate an interactive diagnosis report (if MRE fails)
  • Three-step confirmation before fixes: Before running any command with system-modifying or wide-ranging effects (e.g., openclaw doctor --fix, exec/write), follow these safety steps:
    1. Problem location & explanation: Explain the diagnosis result and the core issue to be fixed
    2. Scope confirmation: Ask about and record the specific target or runtime environment (e.g., "This change will only affect the local development config. Do you agree?")
    3. Rollback plan: Provide an executable one-click rollback command. Only proceed after the user agrees via /approve

Step 5: Finalization & Memory Update (See docs/MODULE_04_Finalization.md)

  • Save facts, lessons learned, and update state
  • Trigger L2 Hot-Start Query and L3 Skill Creation Suggestions

💡 Golden Path (Recommended Flow): For most OpenClaw issues, the fastest resolution path is: openclaw doctoropenclaw doctor --fix

🖼️ Diagnosis Report Visualization (v5.0)

When MRE validation fails, generate an interactive diagnostic report using canvas.snapshot() with:

  • Visual risk flags (🔴/🟠/🟢)
  • Evidence chain diagram (Doc vs GH comparison)
  • Exec result status codes highlighted
  • Rollback command code block display

🧠 Error Log Intelligent Summary (ELIS — v5.0)

When MRE fails, use LLM-powered analysis to extract root causes from exec output:

  • Core Issue: One-sentence summary
  • Possible Causes: 2–3 bullet points
  • Recommended Fix: Specific command(s)
  • Risk Level + Confidence Score

📚 Modules & Deep Dives

Consult the following categorized sub-documents for detailed process explanations:

📁 docs/ — Core Module Documentation

📁 docs/enhancement/ — v5.0 Enhancement Features

📁 docs/tutorials/ — Usage Examples

📁 docs/reports/ — Summary Reports

📁 scripts/ — Python/JS Tools


This file is the master skill document. It defines the complete problem-solving blueprint and integrates all capability layers.