Chaos Test Designer

v1.0.0

Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,...

0· 35· 1 versions· 0 current· 0 all-time· Updated 12h ago· MIT-0

by@charlie-morrison

Chaos Test Designer

Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.

Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.

Commands

1. `design` — Create Chaos Experiment

Step 1: Understand the System

# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null

Map the dependency graph:

Which services call which?
What are the single points of failure?
Where are the circuit breakers, retries, fallbacks?
What external dependencies exist (databases, caches, queues, third-party APIs)?

Step 2: Define Steady-State Hypothesis

Before breaking anything, define what "normal" looks like:

## Steady-State Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog < 100 items
- All health check endpoints return 200

This is the baseline you'll compare against during the experiment.

Step 3: Select Failure Mode

Common failure modes ranked by severity:

Level 1 — Service Failures (start here)

Kill a single pod/container instance
Restart a service with delay
Reduce replica count to 1

Level 2 — Network Failures

Add latency (100ms, 500ms, 2000ms) to inter-service calls
Drop 10% of packets to a specific service
DNS resolution failures
Block traffic to a specific dependency

Level 3 — Resource Exhaustion

Fill disk to 95%
Consume all available memory (OOM scenarios)
Saturate CPU
Exhaust database connection pool
Fill message queue to capacity

Level 4 — Dependency Failures

External API returns 500 for all requests
Database becomes read-only
Cache becomes unavailable
Message broker stops accepting messages

Level 5 — Infrastructure Failures (advanced)

Availability zone failure (kill all resources in one AZ)
Region failover
Complete network partition between services

Step 4: Generate Experiment Definition

# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer

experiment:
  name: "payment-service-pod-kill"
  description: "Kill payment service pod to verify retry logic and circuit breaker"
  
  steady_state:
    - probe: http
      url: "http://payment-service:8080/health"
      expect_status: 200
    - probe: prometheus
      query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
      expect: "< 0.01"
  
  method:
    - action: kill-pod
      target:
        namespace: production
        label_selector: "app=payment-service"
      count: 1
      
  rollback:
    - action: scale
      target:
        namespace: production
        deployment: payment-service
      replicas: 3
      
  controls:
    blast_radius: "single pod in production"
    duration: "5 minutes"
    abort_conditions:
      - "error_rate > 5%"
      - "p99_latency > 10s"
    business_hours_only: true

For Kubernetes (Litmus):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: FORCE
              value: "false"

For plain bash:

#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail

echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0

echo "⏱️  Observing for 5 minutes..."
sleep 300

echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service

2. `gameday` — Plan a Game Day

Generate a full game day schedule:

Pre-game briefing (objectives, safety controls, escalation contacts)
Experiment sequence (ordered by risk, with breaks between)
Observation assignments (who monitors what dashboard)
Go/no-go criteria between experiments
Post-game debrief template

3. `audit` — Assess Chaos Readiness

Before running chaos experiments, verify the system has:

Health check endpoints on every service
Monitoring and alerting in place
Circuit breakers or retry logic
Graceful degradation modes
Runbooks for common failures
Rollback procedures tested recently

Score readiness 0-100 and recommend prerequisites before first chaos experiment.

4. `report` — Analyze Experiment Results

After running an experiment, produce a findings report:

Did the steady-state hypothesis hold?
What broke? Was it expected?
How long until the system self-healed?
What's the blast radius in production vs expected?
Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)

Version tags

latestvk97e848h1pycgcb7tbs9wfd5cn85ptwt

Chaos Test Designer

Chaos Test Designer

Commands

1. design — Create Chaos Experiment

Step 1: Understand the System

Step 2: Define Steady-State Hypothesis

Step 3: Select Failure Mode

Step 4: Generate Experiment Definition

2. gameday — Plan a Game Day

3. audit — Assess Chaos Readiness

4. report — Analyze Experiment Results

Version tags

1. `design` — Create Chaos Experiment

2. `gameday` — Plan a Game Day

3. `audit` — Assess Chaos Readiness

4. `report` — Analyze Experiment Results