Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Chaos Test Designer

v1.0.0

Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,...

0· 35· 1 versions· 0 current· 0 all-time· Updated 12h ago· MIT-0

Chaos Test Designer

Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.

Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.

Commands

1. design — Create Chaos Experiment

Step 1: Understand the System

# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null

Map the dependency graph:

  • Which services call which?
  • What are the single points of failure?
  • Where are the circuit breakers, retries, fallbacks?
  • What external dependencies exist (databases, caches, queues, third-party APIs)?

Step 2: Define Steady-State Hypothesis

Before breaking anything, define what "normal" looks like:

## Steady-State Hypothesis
- Homepage loads in < 500ms (p95)
- API error rate < 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog < 100 items
- All health check endpoints return 200

This is the baseline you'll compare against during the experiment.

Step 3: Select Failure Mode

Common failure modes ranked by severity:

Level 1 — Service Failures (start here)

  • Kill a single pod/container instance
  • Restart a service with delay
  • Reduce replica count to 1

Level 2 — Network Failures

  • Add latency (100ms, 500ms, 2000ms) to inter-service calls
  • Drop 10% of packets to a specific service
  • DNS resolution failures
  • Block traffic to a specific dependency

Level 3 — Resource Exhaustion

  • Fill disk to 95%
  • Consume all available memory (OOM scenarios)
  • Saturate CPU
  • Exhaust database connection pool
  • Fill message queue to capacity

Level 4 — Dependency Failures

  • External API returns 500 for all requests
  • Database becomes read-only
  • Cache becomes unavailable
  • Message broker stops accepting messages

Level 5 — Infrastructure Failures (advanced)

  • Availability zone failure (kill all resources in one AZ)
  • Region failover
  • Complete network partition between services

Step 4: Generate Experiment Definition

# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer

experiment:
  name: "payment-service-pod-kill"
  description: "Kill payment service pod to verify retry logic and circuit breaker"
  
  steady_state:
    - probe: http
      url: "http://payment-service:8080/health"
      expect_status: 200
    - probe: prometheus
      query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
      expect: "< 0.01"
  
  method:
    - action: kill-pod
      target:
        namespace: production
        label_selector: "app=payment-service"
      count: 1
      
  rollback:
    - action: scale
      target:
        namespace: production
        deployment: payment-service
      replicas: 3
      
  controls:
    blast_radius: "single pod in production"
    duration: "5 minutes"
    abort_conditions:
      - "error_rate > 5%"
      - "p99_latency > 10s"
    business_hours_only: true

For Kubernetes (Litmus):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: FORCE
              value: "false"

For plain bash:

#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail

echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0

echo "⏱️  Observing for 5 minutes..."
sleep 300

echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service

2. gameday — Plan a Game Day

Generate a full game day schedule:

  • Pre-game briefing (objectives, safety controls, escalation contacts)
  • Experiment sequence (ordered by risk, with breaks between)
  • Observation assignments (who monitors what dashboard)
  • Go/no-go criteria between experiments
  • Post-game debrief template

3. audit — Assess Chaos Readiness

Before running chaos experiments, verify the system has:

  • Health check endpoints on every service
  • Monitoring and alerting in place
  • Circuit breakers or retry logic
  • Graceful degradation modes
  • Runbooks for common failures
  • Rollback procedures tested recently

Score readiness 0-100 and recommend prerequisites before first chaos experiment.

4. report — Analyze Experiment Results

After running an experiment, produce a findings report:

  • Did the steady-state hypothesis hold?
  • What broke? Was it expected?
  • How long until the system self-healed?
  • What's the blast radius in production vs expected?
  • Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)

Version tags

latestvk97e848h1pycgcb7tbs9wfd5cn85ptwt