{"skill":{"slug":"prometheus-alert-designer","displayName":"Prometheus Alert Designer","summary":"Design Prometheus alerting rules and recording rules — analyze PromQL queries, set meaningful thresholds, reduce alert fatigue, and build multi-window multi-...","description":"---\nname: prometheus-alert-designer\ndescription: Design Prometheus alerting rules and recording rules — analyze PromQL queries, set meaningful thresholds, reduce alert fatigue, and build multi-window multi-burn-rate SLO alerts. Use when asked to create alerts, fix noisy alerting, or set up Prometheus-based monitoring.\nmetadata:\n  tags: [\"prometheus\", \"alerting\", \"promql\", \"grafana\", \"sre\", \"observability\"]\n---\n\n# Prometheus Alert Designer\n\nDesign Prometheus alerting rules that wake people up only when it matters. Analyze PromQL queries for correctness, set thresholds based on real traffic patterns, create recording rules for performance, and implement multi-window burn-rate SLO alerting — the gold standard for production alerts.\n\nUse when: \"create Prometheus alerts\", \"our alerts are too noisy\", \"design alerting rules\", \"write PromQL for monitoring\", \"set up SLO-based alerting\", \"review our alerting rules\", or when configuring Alertmanager routing.\n\n## Core Philosophy\n\n**The Three Laws of Alerting:**\n1. Every alert must be actionable — if nobody needs to do anything, delete it.\n2. Every alert must be urgent — if it can wait until Monday, it's not an alert (it's a ticket).\n3. Every alert must be real — if it fires and the service is fine, the alert is broken.\n\n## Analysis Steps\n\n### 1. Inventory Existing Alerts\n\nQuery Prometheus API to list all rules, currently firing alerts, and alert history. For each alert, evaluate:\n- Fires often (>3x/week)? Probably too sensitive.\n- Nobody acts when it fires? Delete or downgrade.\n- Fires and auto-resolves in <5min? Flapping.\n- Threshold based on data or a guess? Most are guesses.\n- Has a runbook link? Without one, useless at 3 AM.\n\n### 2. Design Alert Rules by Category\n\n#### Service Availability Alerts\n\n**High Error Rate:**\n```yaml\ngroups:\n  - name: service_availability\n    interval: 30s\n    rules:\n      - alert: HighErrorRate\n        expr: |\n          (\n            sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (job, service)\n            /\n            sum(rate(http_requests_total[5m])) by (job, service)\n          ) * 100 > 5\n        for: 5m\n        labels:\n          severity: critical\n          team: \"{{ $labels.service }}\"\n        annotations:\n          summary: \"High error rate on {{ $labels.service }}\"\n          description: |\n            Error rate is {{ printf \"%.2f\" $value }}% (threshold: 5%).\n            Service: {{ $labels.service }}\n            Job: {{ $labels.job }}\n          runbook_url: \"https://wiki.internal/runbooks/high-error-rate\"\n          dashboard_url: \"https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}\"\n```\n\n**Key design decisions:**\n- `for: 5m` prevents alerting on transient spikes (a single retry storm)\n- `rate()[5m]` smooths over 5 minutes — shorter windows are noisier\n- Group by `service` so each service gets its own alert instance\n- Include both `summary` (for pager) and `description` (for context)\n- Always include `runbook_url` and `dashboard_url`\n\n**Service Down (no traffic at all):**\n```yaml\n      - alert: ServiceDown\n        expr: |\n          up{job=\"my-service\"} == 0\n          or\n          absent(up{job=\"my-service\"})\n        for: 2m\n        labels:\n          severity: page\n        annotations:\n          summary: \"{{ $labels.job }} is down on {{ $labels.instance }}\"\n          description: \"Target has been unreachable for 2 minutes.\"\n```\n\n**Important:** Use `absent()` to catch the case where the target disappears entirely (Prometheus stops scraping it, so `up` returns no data instead of 0).\n\n#### Latency Alerts\n\n**High Latency (histogram-based):**\n```yaml\n      - alert: HighLatencyP99\n        expr: |\n          histogram_quantile(0.99,\n            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)\n          ) > 2.0\n        for: 10m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"p99 latency above 2s for {{ $labels.service }}\"\n          description: \"p99 latency is {{ printf \\\"%.2f\\\" $value }}s\"\n```\n\n**Latency rules:** Always use `histogram_quantile`, alert on p99 not p50, use `for: 10m`, set threshold from SLO.\n\n#### Saturation Alerts\n\n**Disk Space** — use `predict_linear` instead of static thresholds:\n```yaml\n      - alert: DiskSpaceRunningOut\n        expr: predict_linear(node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"}[6h], 24*3600) < 0\n        for: 30m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"Disk will fill within 24h on {{ $labels.instance }}\"\n```\n\n`predict_linear` catches a disk at 60% growing 5%/hour (problem in 8h) while ignoring a disk at 85% with stable usage.\n\n**CPU/Memory** — same pattern: `1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[10m])) by (instance) * 100 > 85` for CPU, `1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90` for memory. Use `for: 10m` to avoid transient spikes.\n\n### 3. Build Recording Rules\n\nRecording rules pre-compute expensive queries so dashboards load fast and alerts evaluate reliably.\n\n**When to create a recording rule:**\n- Query uses `rate()` + aggregation across many series (>1000 time series)\n- Same query appears in multiple alerts or dashboards\n- Query takes >2 seconds to evaluate\n- Query is used for SLO calculations\n\n**Naming convention:** `level:metric:operations`\n\n```yaml\ngroups:\n  - name: service_recording_rules\n    interval: 30s\n    rules:\n      - record: service:http_requests_total:rate5m\n        expr: sum(rate(http_requests_total[5m])) by (service)\n      - record: service:http_requests_errors:rate5m\n        expr: sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)\n      - record: service:http_requests:error_ratio_5m\n        expr: service:http_requests_errors:rate5m / service:http_requests_total:rate5m\n      - record: service:http_request_duration_seconds:p99_5m\n        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))\n```\n\n### 4. Implement Multi-Window Multi-Burn-Rate SLO Alerts\n\nThis is the Google SRE recommended approach. Instead of alerting on a raw error rate, alert when you're burning through your error budget too fast.\n\n**Concept:**\n```\nSLO: 99.9% availability (error budget = 0.1% of requests can fail)\n\nBurn rate 1x  = consuming budget at exactly the right pace (will exhaust in 30 days)\nBurn rate 14x = will exhaust the entire 30-day budget in ~2 days — PAGE NOW\nBurn rate 6x  = will exhaust in ~5 days — create a ticket\nBurn rate 1x  = everything is fine\n```\n\n**Implementation — two alert tiers:**\n\n1. **Page (14x burn):** Create recording rules for 5m and 1h error ratios. Alert when BOTH windows exceed `14 * (1 - SLO)`. For 99.9% SLO: `> 14 * 0.001 = 0.014`. Use `for: 2m`. This catches severe outages — budget exhausted in ~2 days.\n\n2. **Ticket (6x burn):** Same pattern with 30m and 6h windows exceeding `6 * (1 - SLO)`. Use `for: 5m`. Catches slow degradation — budget exhausted in ~5 days.\n\n```yaml\n      - alert: SLOErrorBudgetBurnHigh\n        expr: |\n          (\n            service:slo_errors:ratio_rate5m > (14 * 0.001)\n            and\n            service:slo_errors:ratio_rate1h > (14 * 0.001)\n          )\n        for: 2m\n        labels:\n          severity: page\n        annotations:\n          summary: \"SLO burn rate critical for {{ $labels.service }}\"\n```\n\nRecording rules needed: `service:slo_errors:ratio_rate{5m,30m,1h,6h}` — each is `sum(rate(http_requests_total{status=~\"5..\"}[window])) / sum(rate(http_requests_total[window])) by (service)`.\n\n### 5. Configure Alertmanager Routing\n\nRoute alerts by severity label to appropriate channels:\n\n| Severity | Receiver | group_wait | repeat_interval |\n|----------|----------|-----------|----------------|\n| `page` | PagerDuty | 10s | 1h |\n| `critical` | Slack #incidents | 30s | 2h |\n| `warning` | Slack #monitoring | 30s | 8h |\n| `ticket` | Jira webhook | 30s | 24h |\n\n**Key settings:** `group_by: ['alertname', 'service']` to batch related alerts. Set `group_interval: 5m` to avoid notification spam.\n\n**Inhibition rules** (critical for reducing noise):\n- ServiceDown firing suppresses HighErrorRate for the same service (redundant)\n- `page` severity suppresses `warning` severity for the same service\n\n### 6. Alert Fatigue Audit\n\nReview existing alerts for these anti-patterns:\n\n- **Flapping alerts** — fires and resolves within 5 minutes repeatedly. Fix: increase `for` duration or add hysteresis.\n- **Always-firing alerts** — has been in FIRING state for days. Fix: raise threshold or reclassify as ticket.\n- **Never-firing alerts** — hasn't fired in 6 months. Fix: verify query still returns data, adjust threshold, or remove.\n- **Duplicate alerts** — multiple alerts that fire for the same incident. Fix: use inhibition rules.\n- **Missing `for` clause** — fires on every transient spike. Fix: add `for: 5m` minimum.\n- **Alert without runbook** — useless at 3 AM. Fix: write a runbook or link to the dashboard.\n- **Percentage alerts on low traffic** — \"5% error rate\" when there are 2 requests/min = 1 error fires the alert. Fix: add a minimum traffic floor: `and sum(rate(http_requests_total[5m])) by (service) > 1`\n\n## Output Format\n\n```markdown\n# Prometheus Alert Design: {Service/System Name}\n\n## Recording Rules\n{YAML recording rules with explanations}\n\n## Alert Rules\n{YAML alert rules organized by category}\n\n## Alertmanager Routing\n{Routing configuration with severity-based escalation}\n\n## SLO Burn Rate Alerts\n{Multi-window burn rate rules if applicable}\n\n## Audit Findings (if reviewing existing rules)\n- {Anti-pattern found and recommended fix}\n\n## Testing Plan\n- {How to verify each alert fires correctly}\n- {Recommended Prometheus unit test cases}\n```\n\n## Tips\n\n- Test alerts with `promtool test rules` before deploying — this catches PromQL syntax errors and logic bugs\n- Use `for: 5m` as the minimum for any alert — anything shorter is almost certainly flapping\n- Always add a traffic floor to percentage-based alerts: `and rate(total[5m]) > 1`\n- Set `group_by: ['alertname', 'service']` in Alertmanager to batch related alerts\n- Use `inhibit_rules` to suppress redundant alerts (e.g., don't alert on high latency if the service is down)\n- Name alerts with the pattern `{What}{Condition}` — `HighErrorRate`, `DiskSpaceLow`, `ServiceDown`\n- Every alert annotation should include: what's wrong, how bad it is (current value), and where to look (dashboard URL)\n- Review alert firing history monthly — if nobody acted on it, delete it\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":361,"installsAllTime":13,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1777632323668,"updatedAt":1778492821950},"latestVersion":{"version":"1.0.0","createdAt":1777632323668,"changelog":"Initial release empowers teams to design actionable, reliable Prometheus alerting rules and reduce alert fatigue.\n\n- Provides step-by-step guidance on auditing and improving existing alerts.\n- Includes templates and rationale for effective availability, latency, and saturation alerts using modern PromQL patterns.\n- Details the setup of recording rules for performance and reliability.\n- Introduces multi-window multi-burn-rate SLO alerting for robust error budget monitoring.\n- Shares best practices for Alertmanager routing to ensure alerts reach the right people.","license":"MIT-0"},"metadata":null,"owner":{"handle":"charlie-morrison","userId":"s17cttbdxry5kkyafjw983mq8s83p4y3","displayName":"charlie-morrison","image":"https://avatars.githubusercontent.com/u/271589886?v=4"},"moderation":null}