Incident Response Runbook

v1.0.0

Create, maintain, and execute detailed incident response runbooks to guide triage, communication, and post-incident reviews for production outages.

0· 51· 1 versions· 0 current· 0 all-time· Updated 2d ago· MIT-0

Incident Response Runbook

Generate, maintain, and execute incident response runbooks for production systems. Use when setting up incident workflows, responding to outages, or documenting post-incident learnings.

Usage

Generate Runbook

Create an incident response runbook for [service/system]. 
Infrastructure: [cloud provider, key services].
Common failure modes: [list known issues].

During Incident

Incident: [description]. Severity: [1-4]. 
Current symptoms: [what's happening].
Help me triage and respond.

Post-Incident

Generate a post-incident review for: [incident summary].
Timeline: [key events with timestamps].
Resolution: [what fixed it].

Runbook Structure

Generated runbooks follow this template:

# [Service] Incident Response Runbook

## Quick Reference
- **On-call:** [rotation link]
- **Dashboards:** [monitoring links]
- **Escalation:** [contact chain]

## Severity Levels
- **SEV1**: Complete outage, revenue impact → respond in 5 min
- **SEV2**: Degraded service, user-facing → respond in 15 min
- **SEV3**: Internal impact, no users affected → respond in 1 hour
- **SEV4**: Cosmetic or minor, no urgency → next business day

## Triage Steps
1. Confirm the issue (check dashboards, reproduce)
2. Assess blast radius (which users/services affected)
3. Assign severity level
4. Start incident channel/thread
5. Communicate to stakeholders

## Failure Modes

### [Failure Mode 1: e.g., Database Connection Pool Exhaustion]
**Symptoms:** [what you'll see]
**Diagnosis:** [commands to run, logs to check]
**Mitigation:** [immediate steps to restore service]
**Root Fix:** [permanent solution]

### [Failure Mode 2: e.g., Memory Leak in Worker Process]
...

## Rollback Procedures
[Service-specific rollback steps]

## Communication Templates
[Internal + external status page templates]

## Post-Incident Review Template
[Blameless review structure]

Scripts

scripts/generate_runbook.py

Generate a runbook skeleton from service metadata:

python3 scripts/generate_runbook.py --service api-gateway \
  --provider aws --region us-east-1 \
  --monitors datadog,pagerduty \
  --output runbook-api-gateway.md

AI Enhancement

When used as an agent skill, the incident responder:

  • Guides triage in real-time with diagnostic commands specific to the stack
  • Correlates symptoms with known failure modes from the runbook
  • Drafts status page updates and internal communications
  • Generates post-incident reviews with timeline, root cause analysis, and action items
  • Learns from past incidents to improve future runbooks

Version tags

latestvk97fwh1fvw2g4wsmp75szp755185n43m