Self-Healing Agent

Automation

Self-recovery and auto-repair system for OpenClaw agents. Monitors agent health, detects failures (crashed cron jobs, broken skills, config corruption, memory bloat, session timeouts), automatically diagnoses root causes, and applies fixes. Inspired by Hermes agent's self-recovery capabilities. Use when asked to "make my agent self-healing", "auto-repair", "self-recovery", "agent health monitoring", "fix broken cron jobs automatically", "auto-diagnose failures", "resilient agent", "fault tolerance", "watchdog", "heartbeat monitor", or when agents repeatedly fail and need automated recovery. Also use to set up continuous health monitoring via cron.

Install

openclaw skills install self-healing-agent

Self-Healing Agent

Automated failure detection, diagnosis, and recovery for OpenClaw agents. The watchdog that keeps your agent running.

Quick Start

# Full health check — scan all systems, diagnose issues, suggest fixes
python3 scripts/self-healing-agent.py check

# Auto-heal — detect and fix what it can automatically
python3 scripts/self-healing-agent.py heal

# Monitor mode — run continuously, fix issues as they appear
python3 scripts/self-healing-agent.py monitor --interval 300

# Check specific subsystem
python3 scripts/self-healing-agent.py check --target cron
python3 scripts/self-healing-agent.py check --target memory
python3 scripts/self-healing-agent.py check --target config
python3 scripts/self-healing-agent.py check --target sessions

Commands

check — Health Check

Runs diagnostic suite:

  • Cron health — Failed jobs, consecutive errors, stuck jobs, timeout patterns
  • Memory health — File sizes, bloated sessions, orphaned files, growth rate
  • Config health — Valid JSON, required fields present, deprecated settings
  • Session health — Active sessions, zombie processes, memory usage
  • Skill health — Broken scripts, missing dependencies, syntax errors
  • Network health — API endpoint reachability, DNS resolution, SSL cert validity

Options: --target <subsystem> to check one area, --json for machine output.

heal — Auto-Repair

For each detected issue, applies the safest fix:

  • Restarts failed cron jobs (after diagnosing root cause)
  • Clears bloated session files (with backup)
  • Fixes JSON syntax errors in config (common trailing comma, etc.)
  • Removes orphaned process files
  • Restores corrupted memory files from git history
  • Rotates oversized log files

Options: --dry-run to preview, --aggressive for riskier fixes.

monitor — Continuous Watchdog

Runs in a loop, checking health every N seconds:

  • Logs findings to memory/self-healing-log.json
  • Auto-heals fixable issues
  • Escalates unfixable issues to the agent's main session
  • Tracks MTTR (mean time to recovery) and failure patterns

Options: --interval <seconds> (default: 300), --max-heals <n> per cycle.

report — Health Report

Generates a markdown health report covering:

  • Last 24h failure count and types
  • MTTR statistics
  • Most common failure modes
  • Recommendations for prevention

What It Monitors

SubsystemChecksAuto-Heals
CronFailed runs, timeouts, consecutive errorsRestart jobs, clear error state
MemoryFile sizes >1MB, growth rate, duplicatesArchive old files, compact
ConfigJSON validity, required fields, deprecated keysFix syntax, add defaults
SessionsZombie processes, bloated contextsKill zombies, archive contexts
SkillsSyntax errors, missing deps, broken importsLog issue, skip broken skill
NetworkAPI endpoints, DNS, SSL certsRetry with backoff, switch endpoints

Healing Log

All actions are logged to memory/self-healing-log.json:

{
  "timestamp": "2026-04-05T06:00:00Z",
  "issue": "cron job 'daily-intel' failed 3 consecutive times",
  "diagnosis": "Script timeout — API rate limit hit",
  "action": "Reset error count, added 30s backoff, restarted",
  "result": "success",
  "mttr_seconds": 12
}