OpenClaw Problem Solver自动修复小龙虾

Automation

A comprehensive, self-evolving skill designed to diagnose and solve OpenClaw issues by following a structured, multi-stage resolution cycle. It incorporates Proactive Prediction (L2), Robustness Checks (L1), Knowledge Creation (L3), Diagnosis Report Visualization (v5.5), v6.0 Runtime Health + Key Validation + Unified Report + Health Dashboard, and Gateway Watchdog (v6.1).

Install

openclaw skills install autofix

🧠 OpenClaw Problem Solver (v6.0-M4) — Master Document

This skill acts as an advanced diagnostic, resolution, and validation engine for any question or bug report related to the OpenClaw framework itself. v6.0 adds four layers: Runtime Health Check (M1), API Key Validation + Resource Monitor (M2), Unified Diagnosis Report + Regression Check (M3), and Interactive Health Dashboard (M4).

🔒 Privacy-First Constraint

All knowledge storage (memory, logs) and final reports must follow these rules:

  1. API Keys: Never store API keys in plain text. Use placeholders or hashes (e.g., sk-********************).
  2. Private Details: Redact sensitive project names, internal credentials, and user identity info unless explicitly authorized.

🎯 When to Use This Skill

Use this skill when the user:

  • Asks "Why is [feature] not working in OpenClaw?"
  • Reports a specific bug (e.g., "The gateway tool fails with error X").
  • Needs guidance on how to implement a specific feature using OpenClaw's architecture or tools.
  • Wants to know the best practice for a certain task within the OpenClaw ecosystem, but requires verification.

🌟 Recommended Quick Fix Flow (The Golden Path)

For the vast majority of OpenClaw issues, this sequence provides the fastest path to resolution. Always suggest this flow first when a user reports an unspecified problem or bug!

  1. Unified Diagnosis (一键诊断, v6.0-M3): Run python scripts/diagnosis_formatter.py which auto-collects all three sources (openclaw doctor + runtime_health_check + api_key_validator) into one severity-sorted report.
  2. Visual Dashboard (可视化仪表盘, v6.0-M4): Run python scripts/health_dashboard.py --canvas to render the report as an interactive HTML dashboard (embed with [embed ref="health_dashboard" height="740"]).
  3. Save Baseline (保存基线, v6.0-M3): Run python scripts/diagnosis_formatter.py --save-baseline before making any fix.
  4. Resolution Attempt(修复): If the report reveals problems, run openclaw doctor --fix or apply suggested fixes manually.
  5. Regression Check (回归验证, v6.0-M3): After fixes, run python scripts/diagnosis_formatter.py --compare to validate what was fixed, what's new, and what's unchanged.

🚀 The Evolved Workflow (6-Step Cycle + Proactive Layers)

The skill operates by strictly following these steps in sequence, enhanced by proactive layers:

🤖 Gateway Watchdog (v6.1) — Proactive Stability Layer

Overview: A background daemon that periodically polls the Gateway health status. Runs independently from user requests, providing real-time monitoring for anomalies such as Gateway downtime, RPC failures, and configuration drift.

v6.1 Feature Highlights:

FeatureDescription
🎯 Real Health CheckCalls openclaw gateway status --json, parses service.runtime.status + rpc.ok
🔇 Noise FilteringAlerts only after ≥3 consecutive failures; resets after ≥3 consecutive successes
📊 Severity LevelsFour-tier classification (🟢/🟡/🟠/🔴) with auto-escalation
📡 Dual-Channel AlertingFeishu DM (instant, primary) + WebChat (async thread, secondary)
🔄 Single InstanceWindows Mutex ensures only one daemon runs at a time
📦 Log RotationAuto-rotates at 5MB, keeps 3 backup files
⏰ Precise SchedulingFixed-minute schedule eliminates cumulative drift
🔐 Hot-Reload ConfigMonitors openclaw.json changes and reloads automatically
🖥️ Auto-StartRegisters in HKCU\Run for auto-launch on user login
👋 Startup ConfirmationSends status to both channels on startup
🐛 Config Cache FixFixed load_gateway_config() returning token=None on cache hit (v6.1)
⏱️ Async WebChatFixed background thread with 60s timeout for model loading (v6.1)
📝 Detailed Error LogsFixed full stack traces in Feishu + WebChat notifications (v6.1)

📊 Severity & Alert Rules:

Consecutive FailuresLevelBehavior
< 2🟢 Level 1 — NormalSilent, no notification
2🟡 Level 2 — NoticeSilent, continue monitoring
≥ 3🟠 Level 3 — WarningTrigger notification (first time)
≥ 5🔴 Level 4 — CriticalTrigger notification + repeat every 5 failures
Gateway stopped🔴 Level 4 — CriticalImmediate notification
Recovered for 3 cycles✅ RecoveredSend recovery notification

🚨 Notification Triggers:

  1. Severity escalation (e.g., 1→3): sends alert
  2. First time hitting alert threshold (≥3 consecutive failures): sends alert
  3. At critical level, every 5 failures: sends reminder
  4. System recovery (abnormal→normal for 3 consecutive checks): sends recovery notice

Prerequisites

  1. Feishu channel configured in OpenClawopenclaw channels add feishu
  2. Environment variable — Set WATCHDOG_FEISHU_USER_ID to your Feishu open_id:
    $env:WATCHDOG_FEISHU_USER_ID = "ou_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    
  3. Gateway HTTP API (optional, for WebChat channel) —
    openclaw config set gateway.http.endpoints.chatCompletions.enabled true
    openclaw gateway restart
    

⚠️ WebChat timeout: The model inference takes ~40s on first load. The watchdog uses a background thread with 60s timeout so it doesn't block the main monitoring loop.

Deployment (v6.1)

Run the Watchdog as a standalone background process:

# Start
python scripts\watchdog_monitor.py

# Install auto-start (launches on user login)
python scripts\watchdog_monitor.py --install

# Remove auto-start
python scripts\watchdog_monitor.py --uninstall

Or use Start-Process for a hidden window:

$py = (Get-Command python).Source
$script = "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_monitor.py"
Start-Process -FilePath $py -ArgumentList $script `
    -WindowStyle Hidden `
    -WorkingDirectory "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts"

Check process status:

Get-WmiObject Win32_Process -Filter "Name like 'python%'" |
    Where-Object { $_.CommandLine -match 'watchdog_monitor' } |
    Select-Object ProcessId, @{n="Start";e={$_.CreationDate}}

Stop the Watchdog:

# Find the PID first, then
Stop-Process -Id <PID> -Force

View live logs:

Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\gateway_watchdog.log" -Tail 10 -Wait

View state file:

Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_state.json" -Raw | ConvertFrom-Json

🔗 Notification Architecture (v6.1)

Watchdog (background daemon, 60s interval)
    │
    ├─ [Channel A — PRIMARY] openclaw message send --channel feishu
    │            → Feishu direct message (ou_xxx)
    │            → **Instant delivery, zero token cost**
    │            → Includes full error stack traces
    │
    ├─ [Channel B — SECONDARY] Gateway HTTP API (/v1/chat/completions)
    │            → WebChat live session (agent:main:main)
    │            → **Async background thread** (doesn't block monitoring)
    │            → 60s timeout for model loading (~40s typical)
    │            → token cost: minimal (max_tokens=10)
    │
    └─ [Log]     watchdog_state.json (local check history, last 1440)
                 gateway_watchdog.log (rotating, 5MB)

Channel priority: Feishu is now the primary channel (instant, reliable via CLI). WebChat is secondary (async thread, requires model inference).

Dual-Channel Alerting + Autofix Triggers (v6.1)

Channel priority has changed in v6.1:

  • Feishu (instant) is now the primary notification channel

  • WebChat (async thread) is the secondary channel

  • When a WebChat alert arrives (~40s after error), reply with any of these commands to start diagnosis:

    • run autofix self-check
    • check what's wrong with Gateway
    • auto repair
  • Feishu messages serve as the instant primary notification (not offline backup)

  • Each alert message includes detailed error context and stack traces

Workflow Integration + Autofix Linkage

The Watchdog forms a Proactive Stability Layer, independent of the standard diagnostic flow (Steps 0-5). When an anomaly is detected:

a. The daemon logs the event and generates a System Health Warning (SHW) report b. Sends a real-time alert (with diagnostic guidance + context JSON) c. Auto-repair low-risk known issues (e.g., CLI path problems) automatically, then verifies d. High-risk operations only provide repair suggestions, awaiting user confirmation

🛠️ Auto-Repair Module (v1.0)

Repair Script Library: scripts/auto_repair.py

Matches repair plans based on the diagnostic context from Watchdog alerts:

IssueMatch ConditionRepair ActionRisk
Gateway stoppedstatus: stoppedRestart Gateway🟡 Needs confirmation
RPC connection failedrpc_ok: falseRestart Gateway🟡 Needs confirmation
CLI unavailablestatus: cli_errorCheck installation path🟢 Auto-execute
HTTP unreachablestatus: unreachableCheck port + restart🟡 Needs confirmation

Repair Verification Loop:

  1. After auto-repair, wait 3 seconds then re-run health check
  2. Verification passed → send "Auto-repair succeeded" confirmation
  3. Verification failed → send "Still needs manual diagnosis" escalation

Health Trend Tracking:

  • watchdog_state.json retains the last 1440 check records (24 hours)
  • Each record includes: timestamp, health status, severity level, source
  • Trend data can be visualized via Canvas health dashboard

Current Status (v6.1)

  • ✅ Real health check — service.runtime.status = running + rpc.ok = true
  • ✅ Noise filtering — ≥3 failures to trigger, ≥3 successes to reset
  • ✅ Severity levels — Four-tier (🟢/🟡/🟠/🔴) with auto-escalation
  • ✅ Feishu channel (PRIMARY) — openclaw message send --channel feishu, zero token cost, instant delivery
  • ✅ WebChat channel (SECONDARY) — Gateway HTTP API /v1/chat/completions, async background thread, 60s timeout
  • ✅ Single instance — Windows Mutex prevents duplicates
  • ✅ Log rotation — 5MB auto-rotate, 3 backups
  • ✅ Precise scheduling — Fixed-minute schedule, no drift
  • ✅ Hot-reload config — Watches openclaw.json for changes
  • ✅ Auto-start — HKCU\Run registry, launches on user login
  • ✅ Startup confirmation — Sends status to both channels on start
  • --status command — Shows real-time state and exits cleanly (no longer starts daemon by accident)
  • Stale process cleanup — Auto-kills orphaned --status processes on daemon startup
  • Config cache fixload_gateway_config() no longer returns token=None on subsequent calls
  • Detailed error logging — Full stack traces in all notification channels
  • Async WebChat delivery — Background thread prevents blocking main monitor loop
  • 📁 State file: scripts/watchdog_state.json
  • 📁 Log file: scripts/gateway_watchdog.log

Standard Workflow (6-Step Cycle + Proactive Layers)

This skill strictly follows these steps in sequence, enhanced by proactive layers:

Step 0: Resource Pre-check & Cost Management (New) — Starting Point

Before any resource-intensive external search or service call, proactively check API quotas, rate limits, and budget consumption for the current active session. If quota-low alerts or known rate-limit thresholds are hit, pause all execution steps and notify the user with a clear "resource warning," requesting they wait or switch to a low-cost / local alternative.

Step 1: Primary Search (See docs/MODULE_02_SearchChain.md — Step 1)

  • Search the official documentation (docs.openclaw.ai) for official solutions
  • Gather context information related to the problem
  • Extract key error messages and configuration status

Step 2: Backup Search (See docs/MODULE_02_SearchChain.md — Step 2)

  • If the official docs don't provide an answer, search GitHub Issues
  • Look for community-reported problems and solutions
  • Collect code verification requirements or pattern-matching information

Step 3: Analysis & Decision (See docs/MODULE_03_ValidationAction.md — Step 3)

  • Choose the best action path based on search results
  • Perform evidence chain analysis (L1) to evaluate solution reliability
  • Decide between direct answer, code verification, or contextual inquiry

Step 4: Validation & Action (v5.0 Enhanced) (See docs/MODULE_03_ValidationAction.md — Step 4 + docs/MODULE_03_Enhancement_Reports.md)

  • Execute validation (MRE) or propose a contextual inquiry
  • Generate an interactive diagnosis report (if MRE fails)
  • Three-step confirmation before fixes: Before running any command with system-modifying or wide-ranging effects (e.g., openclaw doctor --fix, exec/write), follow these safety steps:
    1. Problem location & explanation: Explain the diagnosis result and the core issue to be fixed
    2. Scope confirmation: Ask about and record the specific target or runtime environment (e.g., "This change will only affect the local development config. Do you agree?")
    3. Rollback plan: Provide an executable one-click rollback command. Only proceed after the user agrees via /approve

Step 5: Finalization & Memory Update (See docs/MODULE_04_Finalization.md)

  • Save facts, lessons learned, and update state
  • Trigger L2 Hot-Start Query and L3 Skill Creation Suggestions

💡 Golden Path (Recommended Flow): For most OpenClaw issues, the fastest resolution path is: openclaw doctoropenclaw doctor --fix

🖼️ Diagnosis Report Visualization (v5.0)

When MRE validation fails, generate an interactive diagnostic report using canvas.snapshot() with:

  • Visual risk flags (🔴/🟠/🟢)
  • Evidence chain diagram (Doc vs GH comparison)
  • Exec result status codes highlighted
  • Rollback command code block display

🧠 Error Log Intelligent Summary (ELIS — v5.0)

When MRE fails, use LLM-powered analysis to extract root causes from exec output:

  • Core Issue: One-sentence summary
  • Possible Causes: 2–3 bullet points
  • Recommended Fix: Specific command(s)
  • Risk Level + Confidence Score

📚 Modules & Deep Dives

Consult the following categorized sub-documents for detailed process explanations:

📁 docs/ — Core Module Documentation

📁 docs/enhancement/ — v5.0 Enhancement Features

📁 docs/tutorials/ — Usage Examples

📁 docs/reports/ — Summary Reports

📁 scripts/ — Python/JS Tools


This file is the master skill document. It defines the complete problem-solving blueprint and integrates all capability layers.