{"skill":{"slug":"aws-fis-experiment-prepare","displayName":"Aws Fis Experiment Prepare","summary":"Use when the user wants to prepare, create, or generate an AWS FIS (Fault Injection Service) experiment configuration. Triggers on \"prepare FIS experiment\",...","description":"---\nname: aws-fis-experiment-prepare\ndescription: >\n  Use when the user wants to prepare, create, or generate an AWS FIS (Fault\n  Injection Service) experiment configuration. Triggers on \"prepare FIS\n  experiment\", \"create FIS experiment for [scenario]\", \"generate chaos\n  experiment config\", \"准备 FIS 实验\", \"生成 [scenario] 混沌实验配置\",\n  \"create experiment template for AZ power interruption\", \"set up fault\n  injection test\". Covers Scenario Library pre-built scenarios (AZ Power\n  Interruption, AZ Application Slowdown, Cross-AZ Traffic Slowdown,\n  Cross-Region Connectivity), custom single FIS actions\n  (aws:rds:failover-db-cluster, aws:ec2:stop-instances, etc.), and SSM\n  Automation-based fault injection for Amazon MSK (broker reboot) and\n  ElastiCache Redis/Valkey (primary node reboot, replication group failover).\n---\n\n# AWS FIS Experiment Prepare\n\nGenerate all configuration files needed to run an AWS FIS experiment, then\ndeploy via CloudFormation with self-healing iteration until the stack\nsucceeds. Outputs a self-contained directory with a validated, deployed\nexperiment template ready for execution.\n\n**Core principle:** Validate resource-action compatibility before generating\nfiles. Never deliver untested configuration — deploy and self-heal first.\n\n## References\n\n**Always load for every experiment:**\n- `references/output-format.md` — directory layout, slug naming, README\n  template\n- `references/cfn-base-template.md` — CFN skeleton (Parameters, IAM Role,\n  Dashboard, FIS Template, Outputs)\n- `references/slug-conventions.md` — scenario/context slug abbreviations,\n  resource naming, name length budget\n\n**Load conditionally by scenario:**\n- `references/az-power-interruption-guide.md` — AZ Power Interruption\n  (sub-action pruning, tagging strategy, permissions)\n- `references/eks-pod-action-guide.md` — any `aws:eks:pod-*` action\n  (RBAC Lambda, EKS Access Entry, Pod memory stress calculation)\n- `references/elasticache-redis-guide.md` — ElastiCache Redis/Valkey\n  (native AZ power interruption, primary node reboot via SSM\n  Automation, or replication group failover via SSM Automation)\n- `references/msk-guide.md` — Amazon MSK (broker reboot via SSM\n  Automation — no native FIS action exists)\n\n**Utility scripts (execute, do not read as reference):**\n- `scripts/precheck-cfn-permissions.sh` — detects required CFN service role\n- `scripts/deploy-with-retry.sh` — validate + deploy + delete-on-fail\n- `scripts/rename-output-dir.sh` — appends FIS template ID to directory name\n\n**Script invocation:** `${SKILL_DIR}` refers to the absolute path of this\nskill's directory (where SKILL.md lives). Resolve it from the skill's\nfilesystem location before running any scripts.\n\n## Output Language Rule\n\nDetect the user's conversation language and use the **same language** for all\noutput files (README.md, comments in JSON/YAML).\n- Chinese input → Chinese output\n- English input → English output\n- Mixed → follow the dominant language\n\n## Prerequisites\n\nRequired tools:\n- **AWS CLI** — `aws fis list-actions`, resource discovery, CloudFormation\n- **aws___search_documentation** / **aws___read_documentation** — FIS docs\n  research\n- **jq** — required by `scripts/deploy-with-retry.sh` and\n  `scripts/precheck-cfn-permissions.sh`\n\n**EKS Pod fault injection:** Cluster auth mode must be\n`API_AND_CONFIG_MAP` or `API`. Check:\n```bash\naws eks describe-cluster --name {CLUSTER} \\\n  --query 'cluster.accessConfig.authenticationMode'\n```\nIf `CONFIG_MAP` only, the user must update the cluster first.\n**MANDATORY:** For any `aws:eks:pod-*` action, follow\n`references/eks-pod-action-guide.md`.\n\n## Workflow\n\n### Step 1: Identify Scenario and Region\n\n**Classify user intent into one of these branches:**\n\n| Branch | Trigger | Additional Reference |\n|---|---|---|\n| Scenario Library | AZ Power Interruption, AZ App Slowdown, Cross-AZ/Region scenarios | Read AWS doc URL (table below) |\n| Custom FIS action | User specifies an action ID or describes a single fault | — |\n| Custom FIS action (ElastiCache) | ElastiCache AZ power interruption or Redis/Valkey failover | `references/elasticache-redis-guide.md` |\n| SSM Automation | Target service has no native FIS action (MSK, ElastiCache primary reboot, ElastiCache failover) | `references/msk-guide.md` or `references/elasticache-redis-guide.md` |\n\nIf ambiguous, ask the user.\n\n**Scenario Library documentation URLs** (JSON templates are NOT available via\nCLI/API — read the doc to extract):\n\n| Scenario | Documentation URL |\n|---|---|\n| AZ Power Interruption | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-availability-scenario.html` |\n| AZ Application Slowdown | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-application-slowdown-scenario.html` |\n| Cross-AZ Traffic Slowdown | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-az-traffic-slowdown-scenario.html` |\n| Cross-Region Connectivity | `https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-region-scenario.html` |\n\n**Region detection order:**\n1. User explicitly specifies\n2. Infer from context (ARNs, previous conversation)\n3. `aws configure get region`\n4. Ask the user\n\nStore as `TARGET_REGION`.\n\n**Default experiment duration: `PT10M` (10 minutes)** for all scenarios and\nsub-actions unless the user specifies otherwise. For AZ Power Interruption,\nscale ARC Zonal Autoshift timing proportionally (ARC starts at minute 2,\nruns for 8 minutes at PT10M; formula: `startAfter = duration × (5/30)`).\n\n### Step 2: Discover Target Resources\n\n#### For Scenario Library Scenarios\n\n**CRITICAL: Scenario Library experiment templates CANNOT be generated via\nFIS API.** You MUST call `aws___read_documentation` with the scenario URL\n(Step 1 table) to extract the JSON experiment template before generating\nany files. The documentation is the only authoritative source.\n\n**Target identification — prefer `resourceArns` over `resourceTags`:**\n- Use `resourceArns` (exact ARNs) for most resource types — more precise,\n  no pre-tagging needed\n- Exception — these types do NOT support `resourceArns`, use\n  `resourceTags` instead:\n  - `aws:elasticache:replicationgroup`\n  - `aws:ec2:autoscaling-group`\n- EKS pod actions use Kubernetes namespace + pod labels (neither\n  `resourceArns` nor `resourceTags`)\n\n**`resourceArns` and `filters` are mutually exclusive.** FIS rejects targets\nthat specify both. For AZ-scoped targeting, either use `resourceArns` with\nonly the target AZ's ARNs, or use `resourceTags` + `filters` together.\n\n**If scenario is AZ Power Interruption:** follow\n`references/az-power-interruption-guide.md` for sub-action pruning, tagging\nstrategy, permissions, and one-Stack-per-AZ design.\n\n**Ask the user:**\n1. Which AZ to target (for AZ-level scenarios)\n2. Which services to include (for AZ Power Interruption) — if user mentions\n   specific services, include ONLY those + mandatory infrastructure sub-actions\n3. Target resource identifiers (cluster IDs, instance IDs, etc.)\n\n#### For Custom FIS Actions\n\n```bash\naws fis get-action --id \"ACTION_ID\" --region TARGET_REGION\n```\n\nExtract required `targets` and `parameters`. Resolve user-provided\nidentifiers to ARNs via AWS CLI.\n\n#### For Services Without Native FIS Actions (SSM Automation)\n\n1. Confirm no native action exists:\n   ```bash\n   aws fis list-actions \\\n     --query \"actions[?starts_with(id, 'aws:{SERVICE}:')]\" \\\n     --region TARGET_REGION\n   ```\n\n2. If empty, follow the service-specific guide:\n   - Amazon MSK → `references/msk-guide.md`\n   - ElastiCache primary node reboot → `references/elasticache-redis-guide.md`\n     (Scenario 2)\n   - Other services → not yet documented. Stop and inform the user.\n\n**Special case — ElastiCache:** Has a native FIS action for AZ-level impact\n(`aws:elasticache:replicationgroup-interrupt-az-power`) but **no native\naction for single-node reboot or replication group failover**. For primary\nnode reboot, use SSM Automation per\n`references/elasticache-redis-guide.md` → Scenario 2. For replication group\nfailover (TestFailover), use SSM Automation per\n`references/elasticache-redis-guide.md` → Scenario 3.\n\n3. Discover resources via the target service's CLI (`aws kafka list-clusters`,\n   etc.).\n\n### Step 2.5: EKS Pod Action Setup Gate\n\n**If the experiment includes ANY `aws:eks:pod-*` action, complete this gate\nBEFORE Step 3.**\n\nApplicable actions: `aws:eks:pod-cpu-stress`, `aws:eks:pod-delete`,\n`aws:eks:pod-io-stress`, `aws:eks:pod-memory-stress`,\n`aws:eks:pod-network-blackhole-port`, `aws:eks:pod-network-latency`,\n`aws:eks:pod-network-packet-loss`.\n\n1. Read the official documentation:\n   ```\n   aws___read_documentation:\n     url: https://docs.aws.amazon.com/fis/latest/userguide/eks-pod-actions.html\n   ```\n\n2. Follow ALL requirements in `references/eks-pod-action-guide.md`:\n   - Lambda-backed CFN Custom Resource for K8s RBAC (fixed names: `fis-sa`,\n     `fis-experiment-role`, `fis-experiment-role-binding`)\n   - EKS Access Entry for FIS Experiment Role (`Username: fis-experiment`)\n   - Cluster auth mode check (`API_AND_CONFIG_MAP` or `API`)\n   - Pod `readOnlyRootFilesystem: false` check\n   - Network action limitations (no Fargate, no bridge mode)\n   - **Pod memory stress threshold calculation** (if action is\n     `aws:eks:pod-memory-stress`) — user's percent is total target, not\n     injection value\n\nDo NOT skip. EKS pod actions have complex setup requirements that differ\nsignificantly from other FIS actions.\n\n### Step 3: Validate Resource-Action Compatibility\n\n**CRITICAL GATE.** Before generating any files, verify that the user's\nactual resources are compatible with the chosen FIS action(s).\n\n#### 3a. Inspect the Actual Resource\n\n| User Says | CLI Command | Key Fields |\n|---|---|---|\n| RDS database | `aws rds describe-db-instances --db-instance-identifier {ID}` | `Engine`, `DBClusterIdentifier` |\n| RDS/Aurora cluster | `aws rds describe-db-clusters --db-cluster-identifier {ID}` | `Engine`, `EngineMode`, `MultiAZ` |\n| EC2 instance | `aws ec2 describe-instances --instance-ids {ID}` | `InstanceType`, `Placement.AvailabilityZone` |\n| EKS cluster | `aws eks describe-cluster --name {NAME}` | `accessConfig.authenticationMode`, `version` |\n| ElastiCache | `aws elasticache describe-replication-groups --replication-group-id {ID}` | `NodeGroupConfiguration`, `MultiAZ` |\n| ASG | `aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names {NAME}` | `AvailabilityZones`, `Instances` |\n\n#### 3b. Cross-Check Against FIS Action Requirements\n\n```bash\naws fis get-action --id \"ACTION_ID\" --region TARGET_REGION \\\n  --query 'action.targets' --output json\n```\n\n**Common incompatibility traps:**\n\n| FIS Action | Required resourceType | Incompatible With | Detection |\n|---|---|---|---|\n| `aws:rds:failover-db-cluster` | `aws:rds:cluster` | Standalone RDS (non-Aurora) | `DBClusterIdentifier` is null |\n| `aws:rds:reboot-db-instances` | `aws:rds:db` | Aurora clusters | `Engine` starts with `aurora` |\n| `aws:elasticache:replicationgroup-interrupt-az-power` | `aws:elasticache:replicationgroup` | Standalone ElastiCache nodes | No replication group |\n| `aws:ec2:stop-instances` | `aws:ec2:instance` | Spot instances | `InstanceLifecycle` = `spot` |\n\n#### 3c. Decision Gate\n\n- **Compatible** → proceed to Step 4.\n- **Incompatible** → explain the mismatch, suggest alternatives based on\n  the actual resource type, ask the user to confirm or abort.\n\nExample alternatives:\n- Standalone RDS Multi-AZ → `aws:rds:reboot-db-instances` with\n  `--force-failover`\n- Aurora cluster → `aws:rds:failover-db-cluster`\n- ElastiCache standalone → explain replication group is required\n\n#### 3d. For Scenario Library Scenarios\n\nValidate EACH included sub-action against its target resources. Only\nvalidate sub-actions that remain after service-scoped pruning (Step 2).\n\n### Step 4: Determine Monitoring Configuration\n\n**Stop Conditions — default: `source: \"none\"` (no alarm).** Only create a\nCloudWatch Alarm if the user explicitly provides one.\n\n**Dashboard Metrics — comprehensive, per-service.** Group widgets by\nservice, 3 widgets per service (availability, performance, errors/latency).\nInclude only services actually affected by the experiment.\n\n| Service | Metrics |\n|---|---|\n| EC2 | `StatusCheckFailed`, `CPUUtilization`, `NetworkIn/Out`, `NetworkPacketsIn/Out` |\n| RDS/Aurora | `DatabaseConnections`, `ReadLatency`, `WriteLatency`, `AuroraReplicaLag`, `FreeableMemory` |\n| EKS | `pod_number_of_running_pods`, `pod_number_of_container_restarts`, `node_cpu_utilization`, `node_memory_utilization` |\n| ElastiCache | `ReplicationLag`, `EngineCPUUtilization`, `CurrConnections`, `CacheHitRate`, `Evictions`, `IsMaster` |\n| ALB | `HealthyHostCount`, `UnHealthyHostCount`, `HTTPCode_ELB_5XX_Count`, `TargetResponseTime` |\n| NLB | `ActiveFlowCount`, `TCP_Client_Reset_Count`, `TCP_Target_Reset_Count` |\n\n### Step 5: Generate Configuration Files\n\n**Create output directory:**\n\n```bash\n# ─── Fill in from user's request + references/slug-conventions.md ───\nSCENARIO_SLUG=\"...\"         # e.g., pod-delete, az-power-int, rds-failover\nTARGET_RESOURCE_ID=\"...\"    # e.g., my-aurora-cluster, i-0abc123def\nCONTEXT_NAME=\"\"             # optional (e.g., redis, msk); leave empty if N/A\n# ────────────────────────────────────────────────────────────────────\n\n# Derived values (do not edit):\nTARGET_SLUG=$(echo \"${TARGET_RESOURCE_ID}\" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-20)\nCONTEXT_SLUG=$(echo \"${CONTEXT_NAME}\" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-10)\nTIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)\n\nif [ -n \"${CONTEXT_SLUG}\" ]; then\n    OUTPUT_DIR=\"./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}\"\nelse\n    OUTPUT_DIR=\"./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}\"\nfi\nmkdir -p \"${OUTPUT_DIR}\"\n```\n\n**REQUIRED:** Before generating `cfn-template.yaml`, read the\n`AWS::FIS::ExperimentTemplate` CloudFormation resource documentation:\n\n```\naws___read_documentation:\n  url: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-fis-experimenttemplate.html\n```\n\n**ALSO REQUIRED:** Search for CloudFormation examples for the resources used:\n\n```\naws___search_documentation:\n  search_phrase: \"<CFN resource types in this experiment>\"\n  topics: [\"cloudformation\"]\n```\n\n**Generate files:**\n\n1. **cfn-template.yaml** — use `references/cfn-base-template.md` as the\n   skeleton. Extend with scenario-specific resources per:\n   - `references/az-power-interruption-guide.md` (if AZ Power Interruption)\n   - `references/eks-pod-action-guide.md` (if EKS pod actions)\n   - `references/msk-guide.md` (if MSK)\n   - `references/elasticache-redis-guide.md` (if ElastiCache)\n\n2. **README.md** — use the template in `references/output-format.md`.\n\n### Step 5.5: CFN Permission Pre-Check\n\nRun the precheck script to detect whether a CFN service role is required:\n\n```bash\nCFN_ROLE_ARN=$(\"${SKILL_DIR}/scripts/precheck-cfn-permissions.sh\")\n```\n\nIf the caller lacks CloudFormation permissions, the script exits 1 with\nguidance — **stop and inform the user**. Otherwise, `CFN_ROLE_ARN` is either\nempty (no service role needed) or contains the required role ARN.\n\n### Step 6: Deploy CFN Template (Self-Healing Loop)\n\n**Generate deployment parameters:**\n\n```bash\n# See references/slug-conventions.md for the ExperimentName composition rule\nRANDOM_SUFFIX=$(LC_ALL=C tr -dc 'a-z0-9' < /dev/urandom | head -c6)\n\nif [ -n \"${CONTEXT_SLUG}\" ]; then\n    EXPERIMENT_NAME=\"${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}-${RANDOM_SUFFIX}\"\nelse\n    EXPERIMENT_NAME=\"${SCENARIO_SLUG}-${TARGET_SLUG}-${RANDOM_SUFFIX}\"\nfi\nSTACK_NAME=\"fis-${EXPERIMENT_NAME}\"\n```\n\n**Deploy with self-healing retry loop** (maximum 5 attempts driven by the\nagent). The `deploy-with-retry.sh` script performs **one attempt** — the\nagent drives the loop externally. On each attempt:\n\n1. Run `scripts/deploy-with-retry.sh`:\n   ```bash\n   \"${SKILL_DIR}/scripts/deploy-with-retry.sh\" \\\n     \"${OUTPUT_DIR}/cfn-template.yaml\" \\\n     \"${STACK_NAME}\" \\\n     \"${TARGET_REGION}\" \\\n     \"${CFN_ROLE_ARN}\" \\\n     \"ExperimentName=${EXPERIMENT_NAME}\" \\\n     \"RandomSuffix=${RANDOM_SUFFIX}\"\n   ```\n2. Exit 0 → deployment succeeded, proceed to \"On Successful Deployment\".\n3. Exit 1 (validation failed) or 2 (deployment failed, stack deleted) →\n   analyze stderr output, fix `cfn-template.yaml`, increment attempt\n   counter, re-invoke the script.\n4. After 5 failed attempts → stop and report to the user with the last\n   error, all fixes attempted, and the current `cfn-template.yaml`.\n\n**Common CFN errors and fixes:**\n\n| Error Pattern | Root Cause | Fix |\n|---|---|---|\n| `Property validation failure` | Invalid CFN property name/value | Fix the resource property |\n| `Template format error` | YAML syntax issue | Fix indentation/structure |\n| `Resource type not supported` | Resource unavailable in region | Check regional availability |\n| `Circular dependency` | Resources reference each other | Use `DependsOn` or restructure |\n| `RoleArn ... is invalid` | IAM role not yet propagated | Add `DependsOn` for IAM role |\n| Empty `logConfiguration` | AZ Power Interruption doc artifact | Remove the `logConfiguration` block |\n\n#### On Successful Deployment\n\n1. Extract stack outputs:\n   ```bash\n   aws cloudformation describe-stacks \\\n     --stack-name \"${STACK_NAME}\" \\\n     --query 'Stacks[0].Outputs' \\\n     --region \"${TARGET_REGION}\" --output table\n   ```\n\n2. Update `README.md` with actual stack name, template ID, dashboard URL,\n   and cleanup command. Replace ALL `{STACK_NAME}` placeholders — do NOT\n   leave placeholders in the final output.\n\n### Step 7: Rename Output Directory with Template ID\n\nRun the rename script:\n\n```bash\nNEW_OUTPUT_DIR=$(\"${SKILL_DIR}/scripts/rename-output-dir.sh\" \\\n    \"${OUTPUT_DIR}\" \\\n    \"${STACK_NAME}\" \\\n    \"${TARGET_REGION}\")\nOUTPUT_DIR=\"${NEW_OUTPUT_DIR}\"\n```\n\nUpdate `README.md`'s `**Directory:**` field with the full absolute path of\nthe renamed directory. If CFN deployment failed (Step 6 exceeded max\nretries), skip this step.\n\nPrint a brief summary to the terminal:\n- Experiment output directory (with template ID)\n- CFN stack name and deployment status\n- Experiment template ID\n- Next step instruction\n\n## Important Guidelines\n\n- **Scenario Library templates come from documentation.** Call\n  `aws___read_documentation` on the scenario's doc URL (Step 1 table) before\n  generating any files. The documentation is the only authoritative source.\n- **Never start the FIS experiment in this skill.** Starting the experiment\n  is handled by `aws-fis-experiment-execute` or manually by the user.\n- **Validate resource-action compatibility BEFORE generating files** (Step 3).\n  The most common source of wasted effort is deploying a template that\n  targets an incompatible resource.\n- **Always deploy and validate.** Do not just generate files — deploy the CFN\n  template and iterate until it succeeds (Step 6). The user should receive a\n  working, deployed experiment template ready to start.\n- **Self-heal on CFN errors.** Read stack events, diagnose, fix the template,\n  delete the failed stack, retry. Do not ask the user to fix CFN errors.\n- **Verify FIS action availability** (`aws fis list-actions` /\n  `aws fis get-action`) before generating templates. Don't fabricate action\n  IDs.\n- **Prefer `resourceArns` over `resourceTags` for targets.** Exceptions:\n  `aws:elasticache:replicationgroup`, `aws:ec2:autoscaling-group`. Never\n  combine `resourceArns` with `filters`.\n- **IAM policy must be least-privilege.** Only include permissions for the\n  specific actions in the experiment.\n- **CFN template must be self-contained.** Deploy the CFN template and get a\n  working experiment without any other steps.\n- **Sequential MCP calls.** All `aws___read_documentation` and\n  `aws___search_documentation` calls must be sequential, never parallel.\n  Retry up to 10 times on rate limit errors.\n- **Keep local files in sync.** After successful deployment, update README.md\n  with real ARNs and stack outputs.\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":310,"installsAllTime":12,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1778412858005,"updatedAt":1779076297810},"latestVersion":{"version":"1.0.0","createdAt":1778412858005,"changelog":"Initial release of aws-fis-experiment-prepare\n\n- Generate, validate, and deploy AWS FIS experiment templates (including Scenario Library scenarios and custom FIS/SSM faults).\n- Outputs a self-contained directory with all configuration and documentation, using the user's preferred language (English or Chinese).\n- Ensures resource-action compatibility before generating files; deploys via CloudFormation with self-healing retries.\n- Supports pre-built scenarios (AZ Power Interruption, App Slowdown, Cross-AZ/Region Connectivity) and custom single-action faults (including ElastiCache/Redis and MSK SSM Automation).\n- Loads and processes detailed scenario-specific references and scripts for accurate, reliable deployments.","license":"MIT-0"},"metadata":null,"owner":{"handle":"panlm","userId":"s170tnbkqyrrzzgez8ybwx5vcs86edhj","displayName":"panlm","image":"https://avatars.githubusercontent.com/u/1658398?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1780090776929}}