Install
openclaw skills install @codekungfu/sre-practicesDeep SRE workflow—SLOs/SLIs, error budgets, alerting, toil reduction, incident readiness, capacity, and balancing reliability with delivery. Use when improving production culture, defining service reliability targets, or reducing on-call pain.
openclaw skills install @codekungfu/sre-practicesSRE is not “ops with a fancy title”—it is engineering reliability with explicit trade-offs between velocity and stability, measured with SLOs and managed through error budgets and toil budgets.
Trigger conditions:
Initial offer:
Walk through six stages: (1) user journeys & SLIs, (2) SLO targets & windows, (3) error budgets & policy, (4) alerting & on-call, (5) toil & automation, (6) continuous improvement. Confirm service tiering and business criticality.
Goal: Measure what users actually experience, not only server uptime.
Exit condition: SLI definitions documented with data sources (metrics, logs, probes).
Goal: Set achievable targets with explicit consequences.
Exit condition: Published SLO document per service or journey with measurement method.
Goal: Decide how to spend budget—feature velocity vs reliability work.
Exit condition: Written policy: what happens when budget burns at 25/50/100%.
Goal: Pages are symptom-based, actionable, low noise.
Exit condition: Alert inventory reviewed; tuning backlog for noisy alerts.
Goal: Reduce manual, repetitive, automatable work with measurable toil budgets.
Exit condition: Toil reduction roadmap with owners; ideally 50% toil cap aspiration per team norm (Google SRE guideline—adapt to org).
Goal: Reliability work is prioritized like features.