Install
openclaw skills install canary-deployment-analyzerAnalyze canary deployments by comparing metrics between canary and baseline. Provide data-driven promotion/rollback recommendations based on error rates, latency percentiles, and custom business metrics.
openclaw skills install canary-deployment-analyzerAnalyze canary deployments to decide whether to promote or rollback. Compare error rates, latency distributions, business metrics, and log patterns between canary and baseline populations — then give a data-driven recommendation.
Use when: "analyze canary", "should we promote this canary", "compare canary metrics", "canary vs baseline", "is this deploy safe to promote", "canary health check", or during progressive delivery decisions.
analyze — Full Canary AnalysisIdentify the metrics source (Prometheus, Datadog, CloudWatch, custom):
# Prometheus query examples
# Error rate — canary vs stable
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
'query=sum(rate(http_requests_total{status=~"5..",deployment="canary"}[5m])) / sum(rate(http_requests_total{deployment="canary"}[5m]))' | \
python3 -c "import json,sys;r=json.load(sys.stdin);print(f'Canary error rate: {r[\"data\"][\"result\"][0][\"value\"][1] if r[\"data\"][\"result\"] else \"no data\"}')"
# Same for baseline
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
'query=sum(rate(http_requests_total{status=~"5..",deployment="stable"}[5m])) / sum(rate(http_requests_total{deployment="stable"}[5m]))' | \
python3 -c "import json,sys;r=json.load(sys.stdin);print(f'Baseline error rate: {r[\"data\"][\"result\"][0][\"value\"][1] if r[\"data\"][\"result\"] else \"no data\"}')"
# Latency p50/p95/p99
for q in 50 95 99; do
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
"query=histogram_quantile(0.${q}, sum(rate(http_request_duration_seconds_bucket{deployment=\"canary\"}[5m])) by (le))"
done
If no Prometheus, check for:
curl -s "https://api.datadoghq.com/api/v1/query" -H "DD-API-KEY: $DD_API_KEY" --data-urlencode "query=avg:http.request.duration{deployment:canary}"aws cloudwatch get-metric-statistics --namespace MyApp --metric-name ErrorRate --dimensions Name=Deployment,Value=canaryFor each metric, calculate:
Decision thresholds (configurable):
# Compare error log patterns
# Canary errors
kubectl logs -l deployment=canary --since=1h 2>/dev/null | grep -i "error\|exception\|panic\|fatal" | \
sort | uniq -c | sort -rn | head -20
# Baseline errors
kubectl logs -l deployment=stable --since=1h 2>/dev/null | grep -i "error\|exception\|panic\|fatal" | \
sort | uniq -c | sort -rn | head -20
Look for:
# Canary Analysis Report
## Verdict: PROMOTE / ROLLBACK / HOLD
## Metrics Comparison (last 30 min)
| Metric | Baseline | Canary | Delta | Status |
|--------|----------|--------|-------|--------|
| Error rate | 0.12% | 0.14% | +0.02% | ✅ Pass |
| p50 latency | 45ms | 48ms | +3ms | ✅ Pass |
| p95 latency | 180ms | 210ms | +30ms | ✅ Pass |
| p99 latency | 450ms | 620ms | +170ms | ⚠️ Warning |
| Throughput | 1200 rps | 1180 rps | -1.7% | ✅ Pass |
## New Errors in Canary
- `NullPointerException in UserService.getProfile` (23 occurrences)
→ Not present in baseline — likely regression
## Traffic Split
- Canary: 5% (60 rps)
- Baseline: 95% (1140 rps)
- Observation window: 30 min (sufficient for 5% traffic)
## Recommendation
[PROMOTE] Metrics within acceptable thresholds. p99 latency elevated but within warning range.
Monitor p99 closely after full promotion. Investigate NullPointerException — non-blocking but should be tracked.
thresholds — Configure Promotion CriteriaHelp define canary promotion thresholds based on SLOs:
progressive — Design Progressive Delivery StrategyGiven a service profile (traffic volume, criticality, deployment frequency), recommend: