Chaos Test Designer
v1.0.0Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,...
Chaos Test Designer
Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.
Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.
Commands
1. design — Create Chaos Experiment
Step 1: Understand the System
# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null
Map the dependency graph:
- Which services call which?
- What are the single points of failure?
- Where are the circuit breakers, retries, fallbacks?
- What external dependencies exist (databases, caches, queues, third-party APIs)?
Step 2: Define Steady-State Hypothesis
Before breaking anything, define what "normal" looks like:
## Steady-State Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog < 100 items
- All health check endpoints return 200
This is the baseline you'll compare against during the experiment.
Step 3: Select Failure Mode
Common failure modes ranked by severity:
Level 1 — Service Failures (start here)
- Kill a single pod/container instance
- Restart a service with delay
- Reduce replica count to 1
Level 2 — Network Failures
- Add latency (100ms, 500ms, 2000ms) to inter-service calls
- Drop 10% of packets to a specific service
- DNS resolution failures
- Block traffic to a specific dependency
Level 3 — Resource Exhaustion
- Fill disk to 95%
- Consume all available memory (OOM scenarios)
- Saturate CPU
- Exhaust database connection pool
- Fill message queue to capacity
Level 4 — Dependency Failures
- External API returns 500 for all requests
- Database becomes read-only
- Cache becomes unavailable
- Message broker stops accepting messages
Level 5 — Infrastructure Failures (advanced)
- Availability zone failure (kill all resources in one AZ)
- Region failover
- Complete network partition between services
Step 4: Generate Experiment Definition
# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer
experiment:
name: "payment-service-pod-kill"
description: "Kill payment service pod to verify retry logic and circuit breaker"
steady_state:
- probe: http
url: "http://payment-service:8080/health"
expect_status: 200
- probe: prometheus
query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
expect: "< 0.01"
method:
- action: kill-pod
target:
namespace: production
label_selector: "app=payment-service"
count: 1
rollback:
- action: scale
target:
namespace: production
deployment: payment-service
replicas: 3
controls:
blast_radius: "single pod in production"
duration: "5 minutes"
abort_conditions:
- "error_rate > 5%"
- "p99_latency > 10s"
business_hours_only: true
For Kubernetes (Litmus):
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
spec:
appinfo:
appns: production
applabel: app=payment-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "60"
- name: FORCE
value: "false"
For plain bash:
#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail
echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"
echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0
echo "⏱️ Observing for 5 minutes..."
sleep 300
echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"
echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service
2. gameday — Plan a Game Day
Generate a full game day schedule:
- Pre-game briefing (objectives, safety controls, escalation contacts)
- Experiment sequence (ordered by risk, with breaks between)
- Observation assignments (who monitors what dashboard)
- Go/no-go criteria between experiments
- Post-game debrief template
3. audit — Assess Chaos Readiness
Before running chaos experiments, verify the system has:
- Health check endpoints on every service
- Monitoring and alerting in place
- Circuit breakers or retry logic
- Graceful degradation modes
- Runbooks for common failures
- Rollback procedures tested recently
Score readiness 0-100 and recommend prerequisites before first chaos experiment.
4. report — Analyze Experiment Results
After running an experiment, produce a findings report:
- Did the steady-state hypothesis hold?
- What broke? Was it expected?
- How long until the system self-healed?
- What's the blast radius in production vs expected?
- Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)
