# Reconciliation Runbook (Operator Playbook) Concise, copy-pasteable steps to investigate drift and repair discrepancies between StripeMeter and Stripe. Runnable against local docker-compose. ## Policy - Drift epsilon: 0.5% relative difference per subscription item per period - Within epsilon (≤ 0.5%): record a small adjustment to align and proceed - Beyond epsilon (> 0.5%): backfill late/missing events, then re-run aggregation ## Core mechanics (how the system behaves) - Watermarks: each counter tracks a last-processed timestamp; late events within the window trigger re-aggregation - Delta push: writer sends only the delta from `pushed_total` per Stripe item (idempotent) - Reconciliation loop: periodically compares local totals vs Stripe reported usage and flags items beyond epsilon --- ## Triage Runbook ### 0) Prerequisites (local) ```bash # Bring up infra docker compose -f docker-compose.yml up -d # Start app stack (in another terminal) pnpm -r build && pnpm dev ``` Optional monitoring stack: ```bash docker compose -f docker-compose.prod.yml --profile monitoring up -d # Prometheus: http://localhost:${PROMETHEUS_PORT:-9090} # Grafana: http://localhost:${GRAFANA_PORT:-3001} (admin/admin by default) ``` ### 1) Verify service health and scrape status ```bash # API health curl -fsS http://localhost:3000/health/ready | jq . # API metrics are exposed for Prometheus curl -fsS http://localhost:3000/metrics | head -n 20 # If monitoring profile is enabled, check Prometheus targets open "http://localhost:9090/targets" || echo "Open http://localhost:9090/targets" ``` Expected: `/health/ready` returns `healthy` or `degraded`; `/metrics` includes `http_requests_total` and `http_request_duration_seconds_bucket`. ### 2) Inspect drift metrics and logs Prometheus queries (paste into Prometheus/Grafana): ``` # Percentage drift by item and period max_over_time(reconciliation_diff_pct[15m]) # Absolute drift by item and period max_over_time(reconciliation_diff_abs[15m]) # Reconciliation run cadence and latency increase(recon_runs_total[1h]) histogram_quantile(0.95, sum(rate(recon_duration_seconds_bucket[5m])) by (le)) # Ingest p95 latency for /v1/events/ingest histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{route="/v1/events/ingest"}[5m]))) # HTTP 5xx rate by route sum by (route) (rate(http_requests_total{status_code=~"5.."}[5m])) ``` Logs (newest last): ```bash docker logs -n 200 -f stripemeter-api | sed -n '1,120p' docker logs -n 200 -f stripemeter-workers | sed -n '1,120p' ``` ### 3) Identify the affected item Gather: `tenantId`, `customerRef`, `metric`, and billing period. ```bash # Quick period summary (demo endpoint) curl -s "http://localhost:3000/v1/reconciliation/summary?tenantId=demo&metric=requests" | jq . ``` ### 4) Re-aggregate a window (safe retry/backoff) Use the Replay API to recompute counters over a time window. Always run a dry-run first. ```bash # Dry-run last 24h for one metric curl -s -X POST http://localhost:3000/v1/replay \ -H 'Content-Type: application/json' \ -d '{ "tenantId": "demo", "metrics": ["requests"], "since": "-PT24H", "until": "now", "mode": "dry-run" }' | jq . # Apply if the dry-run looks correct (idempotent) curl -s -X POST http://localhost:3000/v1/replay \ -H 'Content-Type: application/json' \ -d '{ "tenantId": "demo", "metrics": ["requests"], "since": "-PT24H", "until": "now", "mode": "apply" }' | jq . ``` Guidance: - Safe to retry with exponential backoff (e.g., 5s, 15s, 30s) if queues are busy; the writer is delta/idempotent - Expected time: re-aggregation of ~10k late events ≤ 2 s on a laptop ### 5) Targeted replays Narrow the blast radius by period or customer where needed. ```bash # Replay a specific period curl -s -X POST http://localhost:3000/v1/replay \ -H 'Content-Type: application/json' \ -d '{ "tenantId": "demo", "metrics": ["requests"], "since": "2025-01-01T00:00:00Z", "until": "2025-02-01T00:00:00Z", "mode": "dry-run" }' | jq . ``` ### 6) Confirm resolution ```bash # Reconciliation summary should show drift back within epsilon curl -s "http://localhost:3000/v1/reconciliation/summary?tenantId=demo&metric=requests" | jq . # Optionally trigger a reconciliation cycle curl -s -X POST http://localhost:3000/v1/reconciliation/run -H 'Content-Type: application/json' \ -d '{"tenantId":"demo"}' | jq . ``` If still beyond epsilon: inspect raw events and watermarks, then repeat step 4 with a wider window. --- ## Common pitfalls - Tenant or metric mismatch (e.g., wrong `tenantId` or unconfigured metric mapping) - Timezone boundaries causing the wrong billing period window - Events arriving after the lateness window → require adjustments instead of automatic re-aggregation - Missing Stripe secrets or permissions block writer parity - Placeholder drift gauges (`reconciliation_diff_*`) not yet wired in your environment ## Examples - Late event (< 48h): falls within lateness window, re-aggregate; if residual diff ≤ 0.5%, adjust. - Duplicate event: idempotency key should dedupe; if not, create a negative adjustment. ## References & dashboards - Health and metrics: `GET /health/ready`, `GET /metrics` - Replay API: `POST /v1/replay` (dry-run → apply) - Reconciliation: `GET /v1/reconciliation/summary`, `POST /v1/reconciliation/run` - Alerts and “what to do next”: see `ops/ALERTS.md`