Install
openclaw skills install itil-opsITIL-aligned incident, problem, and change management for AI agents. Use when: detecting service crashes, analyzing recurring failures, tracking incidents to resolution, performing root cause analysis, managing change requests, running health audits, or building operational review pipelines. Implements ITIL 4 practices adapted for autonomous agent operations: Incident Management, Problem Management, Change Management, Event Management, and Continual Improvement. Works with systemd, cron, journalctl, and coordination task boards.
openclaw skills install itil-opsStructured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.
| Level | Meaning | Response | Example |
|---|---|---|---|
| P1 | Critical — service down, data at risk | Immediate alert + auto-remediate | Crash loop, disk full, OOM |
| P2 | High — degraded service | Alert within 1h | Service restarts, auth failures |
| P3 | Medium — non-critical issue | Next review cycle | Cron timeouts, broken files |
| P4 | Low — cosmetic/minor | Track, fix when convenient | Log warnings, config drift |
Scan these in order of criticality:
journalctl --user -u SERVICE --since "12 hours ago" for watchdog timeouts, SIGABRT, SIGSEGV, core dumpsRun scripts/itil-review.sh to scan all sources. It outputs:
ITIL_CLEAR if nothing found (reply HEARTBEAT_OK)DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
↓
(3+ occurrences)
↓
ESCALATE TO PROBLEM
# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected
# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures
# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%
# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise
When incidents are found, create coordination tasks:
Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>
An incident becomes a problem when:
When a problem is identified:
Title: [ITIL-PRB] <root cause description>
Body:
- Related incidents: <list>
- Root cause: <what's actually broken>
- Evidence: <logs, patterns, data>
- Fix applied: <immediate remediation>
- Fix needed: <permanent solution>
- Prevention: <how to prevent recurrence>
Track resolved problems in state file (itil-state.json):
{
"last_review": "2026-03-22T04:19:50Z",
"last_incident_count": 2,
"last_problem_count": 1,
"known_errors": {
"memory-content-dict": {
"description": "Scripts writing content as dict instead of string",
"root_cause": "Missing json.dumps() in memory file writers",
"fix": "Wrap content in json.dumps() before saving",
"fixed_date": "2026-03-22"
}
}
}
Before modifying services, configs, or infrastructure:
| Type | Approval | Example |
|---|---|---|
| Standard | Pre-approved, just do it | Restart service, bump timeout |
| Normal | Inform human, wait for OK | New cron job, config change |
| Emergency | Fix now, inform after | Service down, data at risk |
After any change:
systemctl --user status SERVICEjournalctl --user -u SERVICE -f --since "now"scripts/itil-review.sh# Service crashes
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "watchdog timeout|killed|SIGABRT|SIGSEGV|failed with"
# Memory/resource issues
journalctl --user -u SERVICE --since "12h ago" | grep -c "Failed to load"
# Auth failures
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "unauthorized|403|token expired|auth fail"
Check services with curl:
curl -sf --max-time 5 "$URL" >/dev/null 2>&1 || echo "DOWN"
Configure endpoints in the review script for your environment.
| Review | Frequency | Purpose |
|---|---|---|
| Incident review | Every 12h | Detect and classify new issues |
| Problem review | Weekly | Identify patterns, track RCA progress |
| Capacity review | Weekly | Disk, RAM, memory count trends |
| Process review | Monthly | Are our detection rules catching real issues? |
The review script maintains itil-state.json with:
# Incident review — every 12 hours
openclaw cron add --name "itil-review" --every "12h" \
--model "anthropic/claude-sonnet-4-6" --timeout-seconds 180 \
--session isolated \
--message "Run ITIL review: bash ~/.skcapstone/agents/lumina/scripts/itil-review.sh"
# Weekly problem review (Sunday 9 AM)
# Analyze the week's incidents, identify patterns, suggest improvements
itil-ops/
├── SKILL.md # This file
├── scripts/
│ └── itil-review.sh # Main review script (scan + classify + report)
└── references/
└── itil4-agent-mapping.md # ITIL 4 → Agent operations reference
skcapstone coord create for incident/problem ticketsskmemory_snapshot to record resolutions for future reference