Observability Slos

Workflows

Deep SLO/SLI workflow—user-centric SLIs, SLO targets and windows, error budgets, multi-window burn alerts, and policy when budget is exhausted. Use when defining reliability targets or aligning eng and product on trade-offs.

Install

openclaw skills install observability-slos

Observability & SLOs (Deep Workflow)

SLOs connect engineering work to user-perceived reliability. SLIs must be measurable from systems but grounded in user journeys.

When to Offer This Workflow

Trigger conditions:

Defining 99.9% without defining for what
Too many pages or none; need error budget discipline
Product wants features while stability degrades

Initial offer:

Use six stages: (1) pick user journeys, (2) define SLIs, (3) set SLO targets & windows, (4) error budget policy, (5) alerting on budget burn, (6) review & iterate). Confirm metric stack and dependency SLOs from vendors.

Stage 1: User Journeys

Goal: Critical paths that matter if broken—checkout, login, API sync, not “CPU low”.

Output

3–10 journeys ranked by business impact and frequency.

Exit condition: One paragraph per journey: user intent + failure symptom.

Stage 2: Define SLIs

Goal: Ratio of good events over total over a window—implementation explicit.

Examples

Availability: successful requests / valid requests (define “valid”)
Latency: proportion of requests faster than T ms

Good SLIs

Objective, low-cardinality enough to measure reliably

Exit condition: SLI formula + data source (metrics, logs, probes).

Stage 3: SLO Targets & Windows

Goal: Target (e.g., 99.9% monthly) implies allowed bad minutes—make it explicit.

Practices

Rolling 30d common; align with release cadence
Tier services: not everything needs same SLO

Exit condition: Published table: journey → SLI → target → window.

Stage 4: Error Budget Policy

Goal: What we do when budget is healthy vs exhausted.

Policy ideas

Budget healthy → ship features; low → freeze risky changes, focus on reliability
Escalation when budget burns fast (multi-window alerts)

Exit condition: Written policy with product sign-off.

Stage 5: Alerting on Burn

Goal: Page on budget burn rate, not every blip—multi-window multi-burn-rate pattern when using Google-style SLO alerting.

Practices

Fast burn = page soon; slow burn = ticket/track

Exit condition: Alert rules linked to runbooks.

Stage 6: Review & Iterate

Goal: SLOs drift with architecture—quarterly review; adjust targets with data.

Final Review Checklist

Journeys and SLIs tied to real user pain
Targets realistic vs dependencies and cost
Error budget policy agreed with product
Alerts on burn, not noisy symptom spam
Review cadence scheduled

Tips for Effective Guidance

Translate 99.9% to minutes/month of allowed badness.
SLA (contract) vs SLO (internal)—don’t confuse.
Dependency SLO caps what you can promise—surface that early.

Handling Deviations

No metrics yet: start with proxy SLI (synthetic probes) and improve instrumentation.
Batch systems: event processing lag as SLI instead of HTTP.

Observability Slos

Install

Observability & SLOs (Deep Workflow)

When to Offer This Workflow

Stage 1: User Journeys

Output

Stage 2: Define SLIs

Examples

Good SLIs

Stage 3: SLO Targets & Windows

Practices

Stage 4: Error Budget Policy

Policy ideas

Stage 5: Alerting on Burn

Practices

Stage 6: Review & Iterate

Final Review Checklist

Tips for Effective Guidance

Handling Deviations

Related skills