## Alerting Rules (Prometheus)

This document defines baseline alerting rules for StripeMeter. Each alert includes a brief next action.

Copy the rules block into your Prometheus `rule_files` and run `promtool check rules` before applying.

```yaml
groups:
  - name: stripemeter.rules
    rules:
      # Drift absolute exceeds epsilon for 15 minutes
      - alert: ReconciliationDriftAbsHigh
        expr: max_over_time(reconciliation_diff_abs[15m]) > 0.0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Reconciliation absolute drift above epsilon"
          description: "Absolute drift between local and Stripe exceeds epsilon for 15m. value={{ $value }}"
          runbook: "Inspect reconciliation reports; follow RECONCILIATION.md steps 1–4."

      # Drift percentage exceeds threshold for 15 minutes
      - alert: ReconciliationDriftPctHigh
        expr: max_over_time(reconciliation_diff_pct[15m]) > 0.005
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Reconciliation percentage drift above threshold"
          description: "Percentage drift (|local - stripe| / stripe) > 0.5% for 15m. value={{ $value }}"
          runbook: "Backfill or adjust per RECONCILIATION.md for affected item(s)."

      # Queue lag above threshold (requires exporter of BullMQ/Redis lag)
      - alert: IngestionQueueLagHigh
        expr: max_over_time(queue_lag_seconds{queue=~"aggregation|writer"}[15m]) > 60
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Queue lag over 60s"
          description: "Background queue lag above 60s for 15m. value={{ $value }}"
          runbook: "Check Redis, worker replicas, backlog; scale workers or unstick jobs."

      # Ingest latency p95 above SLO
      - alert: IngestLatencyHighP95
        expr: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{route="/v1/events/ingest"}[5m]))) > 0.2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Ingest p95 latency over 200ms"
          description: "POST /v1/events/ingest p95 > 200ms for 15m. value={{ $value }}"
          runbook: "Check API/DB/Redis; scale API or tune batch sizes."

      # HTTP 5xx rate above threshold
      - alert: Http5xxRateHigh
        expr: sum by (route) (rate(http_requests_total{status_code=~"5.."}[5m])) > 0.1
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx rate high"
          description: "5xx error rate > 0.1 req/s over 5m for 15m. route={{ $labels.route }} value={{ $value }}"
          runbook: "Check deploys/logs/upstreams; roll back or hotfix."
```

Notes
- Metrics used are based on built-in exporters:
  - `http_request_duration_seconds_bucket`, `http_requests_total` from API
  - `reconciliation_diff_abs`, `reconciliation_diff_pct` are placeholders; wire to reconciler metrics or Redis-exported gauges.
  - `queue_lag_seconds` is a placeholder for BullMQ/Redis consumer lag. Add an exporter or custom gauge.
- Coordinate thresholds with `RECONCILIATION.md` and production SLOs in `project-principles.md`.

### Alertmanager Example

See `ops/alertmanager/alertmanager.yml` for a minimal route/receiver.


### What to do next (Runbook)

When any reconciliation alert fires, follow the operator playbook:

- Read `RECONCILIATION.md` → Steps 1–6 (verify health, inspect drift, replay window, confirm resolution)
- Use copy-paste commands for `/metrics`, Replay API, and reconciliation summary
- Validate drift is back within epsilon (≤ 0.5%) before closing the alert