Install
openclaw skills install chaos-test-designerDesign chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls, and rollback procedures for services, networks, and infrastructure.
openclaw skills install chaos-test-designerDesign chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.
Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.
design — Create Chaos Experiment# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null
Map the dependency graph:
Before breaking anything, define what "normal" looks like:
## Steady-State Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog < 100 items
- All health check endpoints return 200
This is the baseline you'll compare against during the experiment.
Common failure modes ranked by severity:
Level 1 — Service Failures (start here)
Level 2 — Network Failures
Level 3 — Resource Exhaustion
Level 4 — Dependency Failures
Level 5 — Infrastructure Failures (advanced)
# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer
experiment:
name: "payment-service-pod-kill"
description: "Kill payment service pod to verify retry logic and circuit breaker"
steady_state:
- probe: http
url: "http://payment-service:8080/health"
expect_status: 200
- probe: prometheus
query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
expect: "< 0.01"
method:
- action: kill-pod
target:
namespace: production
label_selector: "app=payment-service"
count: 1
rollback:
- action: scale
target:
namespace: production
deployment: payment-service
replicas: 3
controls:
blast_radius: "single pod in production"
duration: "5 minutes"
abort_conditions:
- "error_rate > 5%"
- "p99_latency > 10s"
business_hours_only: true
For Kubernetes (Litmus):
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
spec:
appinfo:
appns: production
applabel: app=payment-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "60"
- name: FORCE
value: "false"
For plain bash:
#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail
echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"
echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0
echo "⏱️ Observing for 5 minutes..."
sleep 300
echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"
echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service
gameday — Plan a Game DayGenerate a full game day schedule:
audit — Assess Chaos ReadinessBefore running chaos experiments, verify the system has:
Score readiness 0-100 and recommend prerequisites before first chaos experiment.
report — Analyze Experiment ResultsAfter running an experiment, produce a findings report: