Canary Deployment Analyzer

v1.0.0

Analyze canary deployments by comparing metrics between canary and baseline. Provide data-driven promotion/rollback recommendations based on error rates, lat...

0· 32· 1 versions· 0 current· 0 all-time· Updated 6h ago· MIT-0

by@charlie-morrison

Canary Deployment Analyzer

Analyze canary deployments to decide whether to promote or rollback. Compare error rates, latency distributions, business metrics, and log patterns between canary and baseline populations — then give a data-driven recommendation.

Use when: "analyze canary", "should we promote this canary", "compare canary metrics", "canary vs baseline", "is this deploy safe to promote", "canary health check", or during progressive delivery decisions.

Commands

1. `analyze` — Full Canary Analysis

Step 1: Collect Metrics

Identify the metrics source (Prometheus, Datadog, CloudWatch, custom):

# Prometheus query examples
# Error rate — canary vs stable
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
  'query=sum(rate(http_requests_total{status=~"5..",deployment="canary"}[5m])) / sum(rate(http_requests_total{deployment="canary"}[5m]))' | \
  python3 -c "import json,sys;r=json.load(sys.stdin);print(f'Canary error rate: {r[\"data\"][\"result\"][0][\"value\"][1] if r[\"data\"][\"result\"] else \"no data\"}')"

# Same for baseline
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
  'query=sum(rate(http_requests_total{status=~"5..",deployment="stable"}[5m])) / sum(rate(http_requests_total{deployment="stable"}[5m]))' | \
  python3 -c "import json,sys;r=json.load(sys.stdin);print(f'Baseline error rate: {r[\"data\"][\"result\"][0][\"value\"][1] if r[\"data\"][\"result\"] else \"no data\"}')"

# Latency p50/p95/p99
for q in 50 95 99; do
  curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
    "query=histogram_quantile(0.${q}, sum(rate(http_request_duration_seconds_bucket{deployment=\"canary\"}[5m])) by (le))"
done

If no Prometheus, check for:

Datadog: curl -s "https://api.datadoghq.com/api/v1/query" -H "DD-API-KEY: $DD_API_KEY" --data-urlencode "query=avg:http.request.duration{deployment:canary}"
CloudWatch: aws cloudwatch get-metric-statistics --namespace MyApp --metric-name ErrorRate --dimensions Name=Deployment,Value=canary
Application logs: parse error counts from structured logs

Step 2: Statistical Comparison

For each metric, calculate:

Absolute difference: canary_value - baseline_value
Relative change: (canary - baseline) / baseline × 100%
Statistical significance: For rates, use a two-proportion z-test; for latencies, use Welch's t-test or Mann-Whitney U if distributions are skewed

Decision thresholds (configurable):

Error rate increase > 0.1% absolute OR > 10% relative → FAIL
p95 latency increase > 50ms OR > 15% relative → WARNING
p99 latency increase > 200ms OR > 25% relative → FAIL
Business metric (conversion, throughput) decrease > 5% → WARNING

Step 3: Log Analysis

# Compare error log patterns
# Canary errors
kubectl logs -l deployment=canary --since=1h 2>/dev/null | grep -i "error\|exception\|panic\|fatal" | \
  sort | uniq -c | sort -rn | head -20

# Baseline errors
kubectl logs -l deployment=stable --since=1h 2>/dev/null | grep -i "error\|exception\|panic\|fatal" | \
  sort | uniq -c | sort -rn | head -20

Look for:

New error types in canary that don't appear in baseline (strongest signal)
Error rate spike in existing error types
Timeout patterns or connection refused (infrastructure issues vs code issues)

Step 4: Generate Verdict

# Canary Analysis Report

## Verdict: PROMOTE / ROLLBACK / HOLD

## Metrics Comparison (last 30 min)
| Metric | Baseline | Canary | Delta | Status |
|--------|----------|--------|-------|--------|
| Error rate | 0.12% | 0.14% | +0.02% | ✅ Pass |
| p50 latency | 45ms | 48ms | +3ms | ✅ Pass |
| p95 latency | 180ms | 210ms | +30ms | ✅ Pass |
| p99 latency | 450ms | 620ms | +170ms | ⚠️ Warning |
| Throughput | 1200 rps | 1180 rps | -1.7% | ✅ Pass |

## New Errors in Canary
- `NullPointerException in UserService.getProfile` (23 occurrences)
  → Not present in baseline — likely regression

## Traffic Split
- Canary: 5% (60 rps)
- Baseline: 95% (1140 rps)
- Observation window: 30 min (sufficient for 5% traffic)

## Recommendation
[PROMOTE] Metrics within acceptable thresholds. p99 latency elevated but within warning range.
Monitor p99 closely after full promotion. Investigate NullPointerException — non-blocking but should be tracked.

2. `thresholds` — Configure Promotion Criteria

Help define canary promotion thresholds based on SLOs:

If team has SLOs → derive thresholds from error budget remaining
If no SLOs → suggest industry defaults (99.9% availability = 0.1% error budget)
Generate a config file for Argo Rollouts, Flagger, or custom canary controller

3. `progressive` — Design Progressive Delivery Strategy

Given a service profile (traffic volume, criticality, deployment frequency), recommend:

Traffic split stages (1% → 5% → 25% → 50% → 100%)
Observation window per stage
Automated vs manual promotion gates
Rollback trigger conditions

Version tags

latestvk979q475y782cwgmx10z31mxhn85qppg

Canary Deployment Analyzer

Canary Deployment Analyzer

Commands

1. analyze — Full Canary Analysis

Step 1: Collect Metrics

Step 2: Statistical Comparison

Step 3: Log Analysis

Step 4: Generate Verdict

2. thresholds — Configure Promotion Criteria

3. progressive — Design Progressive Delivery Strategy

Version tags

1. `analyze` — Full Canary Analysis

2. `thresholds` — Configure Promotion Criteria

3. `progressive` — Design Progressive Delivery Strategy