Runbook Automator
v1.0.0Convert manual incident runbooks into automated, executable playbooks. Parse existing runbooks, generate scripts for each step, add health checks, rollback p...
Runbook Automator
Transform manual runbooks into automated, executable playbooks. Parse existing documentation, generate step-by-step scripts with health checks, decision points, rollback procedures, and notification hooks — so incidents get resolved faster with less human intervention.
Use when: "automate this runbook", "convert runbook to script", "make this playbook executable", "incident automation", "turn this wiki page into a script", or when building on-call automation.
Commands
1. convert — Parse Runbook and Generate Automation
Step 1: Identify Runbook Format
Read the input runbook (markdown, Confluence wiki, Google Doc, plain text) and extract:
- Title and scope — what incident does this address
- Prerequisites — access, tools, permissions needed
- Steps — ordered actions (distinguish manual vs automatable)
- Decision points — if/then branches
- Verification steps — how to confirm each step worked
- Rollback steps — how to undo if things go wrong
- Escalation criteria — when to page someone
Step 2: Classify Each Step
For each step in the runbook, classify as:
| Type | Example | Automation |
|---|---|---|
| Command | "Run kubectl rollout restart" | Direct script execution |
| Check | "Verify pods are running" | Script with assertion |
| Decision | "If error rate > 5%, proceed to step 4" | Conditional branch |
| Manual | "Call the database team" | Notification + pause |
| Observation | "Watch the dashboard for 10 minutes" | Timed wait + metric check |
Step 3: Generate Executable Playbook
#!/usr/bin/env bash
set -euo pipefail
# ============================================
# Automated Runbook: [Title]
# Generated from: [source document]
# Last updated: [date]
# ============================================
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
PAGERDUTY_KEY="${PAGERDUTY_KEY:-}"
DRY_RUN="${DRY_RUN:-false}"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
notify() {
log "NOTIFY: $1"
if [[ -n "$SLACK_WEBHOOK" ]]; then
curl -s -X POST "$SLACK_WEBHOOK" -H 'Content-Type: application/json' \
-d "{\"text\": \"🔧 Runbook: $1\"}" > /dev/null
fi
}
fail() { notify "❌ FAILED at step $1: $2"; exit 1; }
# --- Step 1: [Name] ---
step_1() {
log "Step 1: [description]"
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY RUN: would execute [command]"
return 0
fi
# [actual command]
[command] || fail 1 "[error description]"
# Verify
[verification command] || fail 1 "Verification failed"
log "Step 1: ✅ Complete"
}
# --- Step 2: [Decision Point] ---
step_2() {
log "Step 2: Checking [condition]"
local metric
metric=$([check command])
if (( $(echo "$metric > 5" | bc -l) )); then
log "Threshold exceeded ($metric > 5) — escalating"
step_2a # escalation path
else
log "Within bounds ($metric <= 5) — continuing"
step_3
fi
}
# --- Rollback ---
rollback() {
notify "🔄 Rolling back..."
log "Rollback: [undo commands]"
[rollback command 1]
[rollback command 2]
notify "Rollback complete"
}
trap 'rollback' ERR
# --- Execute ---
notify "Starting runbook: [Title]"
step_1
step_2
# ... remaining steps
notify "✅ Runbook complete"
2. analyze — Audit Existing Runbooks for Gaps
Read all runbooks in a directory and flag:
# Find runbook-like documents
find . -maxdepth 3 \( -name "*.md" -o -name "*.txt" -o -name "*.adoc" \) | \
xargs grep -li "runbook\|playbook\|incident\|on-call\|troubleshoot" 2>/dev/null
For each runbook, check:
- Missing rollback steps — what happens if step 3 fails?
- No verification — steps that say "do X" but never check if X worked
- Stale commands — references to deprecated tools, old hostnames, removed services
- Missing decision criteria — "if it's bad, escalate" (how bad? what metric?)
- No estimated time — SLA-critical runbooks need time bounds per step
- Missing prerequisites — assumed access or tools not listed
Output a coverage report:
# Runbook Audit Report
| Runbook | Steps | Automated | Rollback | Verified | Gaps |
|---------|-------|-----------|----------|----------|------|
| DB Failover | 8 | 3/8 (38%) | ✅ | 5/8 | Stale hostname in step 4 |
| API Scale-Up | 5 | 5/5 (100%) | ❌ Missing | 4/5 | No rollback procedure |
| Cache Flush | 3 | 2/3 (67%) | ✅ | 3/3 | Step 2 references removed tool |
3. test — Dry-Run a Generated Playbook
Execute the generated script with DRY_RUN=true:
- Validate all commands exist in PATH
- Check prerequisite access (can reach hosts, have credentials)
- Verify notification hooks work (send test message)
- Estimate execution time based on sleep/wait steps
- Flag any steps that would require manual intervention
4. template — Generate Runbook Template
Given an incident type (database, network, application, security), generate a structured template with:
- Standard sections (scope, impact, prerequisites, steps, rollback, escalation)
- Common steps for that incident type pre-filled
- Placeholder verification commands
- Notification hooks
- Post-incident review checklist
