Chaos Test Designer

Dev Tools

Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls, and rollback procedures for services, networks, and infrastructure.

Install

openclaw skills install chaos-test-designer

Chaos Test Designer

Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.

Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.

Commands

1. design — Create Chaos Experiment

Step 1: Understand the System

# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null

Map the dependency graph:

  • Which services call which?
  • What are the single points of failure?
  • Where are the circuit breakers, retries, fallbacks?
  • What external dependencies exist (databases, caches, queues, third-party APIs)?

Step 2: Define Steady-State Hypothesis

Before breaking anything, define what "normal" looks like:

## Steady-State Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog < 100 items
- All health check endpoints return 200

This is the baseline you'll compare against during the experiment.

Step 3: Select Failure Mode

Common failure modes ranked by severity:

Level 1 — Service Failures (start here)

  • Kill a single pod/container instance
  • Restart a service with delay
  • Reduce replica count to 1

Level 2 — Network Failures

  • Add latency (100ms, 500ms, 2000ms) to inter-service calls
  • Drop 10% of packets to a specific service
  • DNS resolution failures
  • Block traffic to a specific dependency

Level 3 — Resource Exhaustion

  • Fill disk to 95%
  • Consume all available memory (OOM scenarios)
  • Saturate CPU
  • Exhaust database connection pool
  • Fill message queue to capacity

Level 4 — Dependency Failures

  • External API returns 500 for all requests
  • Database becomes read-only
  • Cache becomes unavailable
  • Message broker stops accepting messages

Level 5 — Infrastructure Failures (advanced)

  • Availability zone failure (kill all resources in one AZ)
  • Region failover
  • Complete network partition between services

Step 4: Generate Experiment Definition

# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer

experiment:
  name: "payment-service-pod-kill"
  description: "Kill payment service pod to verify retry logic and circuit breaker"
  
  steady_state:
    - probe: http
      url: "http://payment-service:8080/health"
      expect_status: 200
    - probe: prometheus
      query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
      expect: "< 0.01"
  
  method:
    - action: kill-pod
      target:
        namespace: production
        label_selector: "app=payment-service"
      count: 1
      
  rollback:
    - action: scale
      target:
        namespace: production
        deployment: payment-service
      replicas: 3
      
  controls:
    blast_radius: "single pod in production"
    duration: "5 minutes"
    abort_conditions:
      - "error_rate > 5%"
      - "p99_latency > 10s"
    business_hours_only: true

For Kubernetes (Litmus):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: FORCE
              value: "false"

For plain bash:

#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail

echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0

echo "⏱️  Observing for 5 minutes..."
sleep 300

echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service

2. gameday — Plan a Game Day

Generate a full game day schedule:

  • Pre-game briefing (objectives, safety controls, escalation contacts)
  • Experiment sequence (ordered by risk, with breaks between)
  • Observation assignments (who monitors what dashboard)
  • Go/no-go criteria between experiments
  • Post-game debrief template

3. audit — Assess Chaos Readiness

Before running chaos experiments, verify the system has:

  • Health check endpoints on every service
  • Monitoring and alerting in place
  • Circuit breakers or retry logic
  • Graceful degradation modes
  • Runbooks for common failures
  • Rollback procedures tested recently

Score readiness 0-100 and recommend prerequisites before first chaos experiment.

4. report — Analyze Experiment Results

After running an experiment, produce a findings report:

  • Did the steady-state hypothesis hold?
  • What broke? Was it expected?
  • How long until the system self-healed?
  • What's the blast radius in production vs expected?
  • Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)