Install
openclaw skills install aws-fis-experiment-prepareUse when the user wants to prepare, create, or generate an AWS FIS (Fault Injection Service) experiment configuration. Triggers on "prepare FIS experiment", "create FIS experiment for [scenario]", "generate chaos experiment config", "准备 FIS 实验", "生成 [scenario] 混沌实验配置", "create experiment template for AZ power interruption", "set up fault injection test". Covers Scenario Library pre-built scenarios (AZ Power Interruption, AZ Application Slowdown, Cross-AZ Traffic Slowdown, Cross-Region Connectivity), custom single FIS actions (aws:rds:failover-db-cluster, aws:ec2:stop-instances, etc.), and SSM Automation-based fault injection for Amazon MSK (broker reboot) and ElastiCache Redis/Valkey (primary node reboot, replication group failover).
openclaw skills install aws-fis-experiment-prepareGenerate all configuration files needed to run an AWS FIS experiment, then deploy via CloudFormation with self-healing iteration until the stack succeeds. Outputs a self-contained directory with a validated, deployed experiment template ready for execution.
Core principle: Validate resource-action compatibility before generating files. Never deliver untested configuration — deploy and self-heal first.
Always load for every experiment:
references/output-format.md — directory layout, slug naming, README
templatereferences/cfn-base-template.md — CFN skeleton (Parameters, IAM Role,
Dashboard, FIS Template, Outputs)references/slug-conventions.md — scenario/context slug abbreviations,
resource naming, name length budgetLoad conditionally by scenario:
references/az-power-interruption-guide.md — AZ Power Interruption
(sub-action pruning, tagging strategy, permissions)references/eks-pod-action-guide.md — any aws:eks:pod-* action
(RBAC Lambda, EKS Access Entry, Pod memory stress calculation)references/elasticache-redis-guide.md — ElastiCache Redis/Valkey
(native AZ power interruption, primary node reboot via SSM
Automation, or replication group failover via SSM Automation)references/msk-guide.md — Amazon MSK (broker reboot via SSM
Automation — no native FIS action exists)Utility scripts (execute, do not read as reference):
scripts/precheck-cfn-permissions.sh — detects required CFN service rolescripts/deploy-with-retry.sh — validate + deploy + delete-on-failscripts/rename-output-dir.sh — appends FIS template ID to directory nameScript invocation: ${SKILL_DIR} refers to the absolute path of this
skill's directory (where SKILL.md lives). Resolve it from the skill's
filesystem location before running any scripts.
Detect the user's conversation language and use the same language for all output files (README.md, comments in JSON/YAML).
Required tools:
aws fis list-actions, resource discovery, CloudFormationscripts/deploy-with-retry.sh and
scripts/precheck-cfn-permissions.shEKS Pod fault injection: Cluster auth mode must be
API_AND_CONFIG_MAP or API. Check:
aws eks describe-cluster --name {CLUSTER} \
--query 'cluster.accessConfig.authenticationMode'
If CONFIG_MAP only, the user must update the cluster first.
MANDATORY: For any aws:eks:pod-* action, follow
references/eks-pod-action-guide.md.
Classify user intent into one of these branches:
| Branch | Trigger | Additional Reference |
|---|---|---|
| Scenario Library | AZ Power Interruption, AZ App Slowdown, Cross-AZ/Region scenarios | Read AWS doc URL (table below) |
| Custom FIS action | User specifies an action ID or describes a single fault | — |
| Custom FIS action (ElastiCache) | ElastiCache AZ power interruption or Redis/Valkey failover | references/elasticache-redis-guide.md |
| SSM Automation | Target service has no native FIS action (MSK, ElastiCache primary reboot, ElastiCache failover) | references/msk-guide.md or references/elasticache-redis-guide.md |
If ambiguous, ask the user.
Scenario Library documentation URLs (JSON templates are NOT available via CLI/API — read the doc to extract):
| Scenario | Documentation URL |
|---|---|
| AZ Power Interruption | https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-availability-scenario.html |
| AZ Application Slowdown | https://docs.aws.amazon.com/en_us/fis/latest/userguide/az-application-slowdown-scenario.html |
| Cross-AZ Traffic Slowdown | https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-az-traffic-slowdown-scenario.html |
| Cross-Region Connectivity | https://docs.aws.amazon.com/en_us/fis/latest/userguide/cross-region-scenario.html |
Region detection order:
aws configure get regionStore as TARGET_REGION.
Default experiment duration: PT10M (10 minutes) for all scenarios and
sub-actions unless the user specifies otherwise. For AZ Power Interruption,
scale ARC Zonal Autoshift timing proportionally (ARC starts at minute 2,
runs for 8 minutes at PT10M; formula: startAfter = duration × (5/30)).
CRITICAL: Scenario Library experiment templates CANNOT be generated via
FIS API. You MUST call aws___read_documentation with the scenario URL
(Step 1 table) to extract the JSON experiment template before generating
any files. The documentation is the only authoritative source.
Target identification — prefer resourceArns over resourceTags:
resourceArns (exact ARNs) for most resource types — more precise,
no pre-tagging neededresourceArns, use
resourceTags instead:
aws:elasticache:replicationgroupaws:ec2:autoscaling-groupresourceArns nor resourceTags)resourceArns and filters are mutually exclusive. FIS rejects targets
that specify both. For AZ-scoped targeting, either use resourceArns with
only the target AZ's ARNs, or use resourceTags + filters together.
If scenario is AZ Power Interruption: follow
references/az-power-interruption-guide.md for sub-action pruning, tagging
strategy, permissions, and one-Stack-per-AZ design.
Ask the user:
aws fis get-action --id "ACTION_ID" --region TARGET_REGION
Extract required targets and parameters. Resolve user-provided
identifiers to ARNs via AWS CLI.
Confirm no native action exists:
aws fis list-actions \
--query "actions[?starts_with(id, 'aws:{SERVICE}:')]" \
--region TARGET_REGION
If empty, follow the service-specific guide:
references/msk-guide.mdreferences/elasticache-redis-guide.md
(Scenario 2)Special case — ElastiCache: Has a native FIS action for AZ-level impact
(aws:elasticache:replicationgroup-interrupt-az-power) but no native
action for single-node reboot or replication group failover. For primary
node reboot, use SSM Automation per
references/elasticache-redis-guide.md → Scenario 2. For replication group
failover (TestFailover), use SSM Automation per
references/elasticache-redis-guide.md → Scenario 3.
aws kafka list-clusters,
etc.).If the experiment includes ANY aws:eks:pod-* action, complete this gate
BEFORE Step 3.
Applicable actions: aws:eks:pod-cpu-stress, aws:eks:pod-delete,
aws:eks:pod-io-stress, aws:eks:pod-memory-stress,
aws:eks:pod-network-blackhole-port, aws:eks:pod-network-latency,
aws:eks:pod-network-packet-loss.
Read the official documentation:
aws___read_documentation:
url: https://docs.aws.amazon.com/fis/latest/userguide/eks-pod-actions.html
Follow ALL requirements in references/eks-pod-action-guide.md:
fis-sa,
fis-experiment-role, fis-experiment-role-binding)Username: fis-experiment)API_AND_CONFIG_MAP or API)readOnlyRootFilesystem: false checkaws:eks:pod-memory-stress) — user's percent is total target, not
injection valueDo NOT skip. EKS pod actions have complex setup requirements that differ significantly from other FIS actions.
CRITICAL GATE. Before generating any files, verify that the user's actual resources are compatible with the chosen FIS action(s).
| User Says | CLI Command | Key Fields |
|---|---|---|
| RDS database | aws rds describe-db-instances --db-instance-identifier {ID} | Engine, DBClusterIdentifier |
| RDS/Aurora cluster | aws rds describe-db-clusters --db-cluster-identifier {ID} | Engine, EngineMode, MultiAZ |
| EC2 instance | aws ec2 describe-instances --instance-ids {ID} | InstanceType, Placement.AvailabilityZone |
| EKS cluster | aws eks describe-cluster --name {NAME} | accessConfig.authenticationMode, version |
| ElastiCache | aws elasticache describe-replication-groups --replication-group-id {ID} | NodeGroupConfiguration, MultiAZ |
| ASG | aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names {NAME} | AvailabilityZones, Instances |
aws fis get-action --id "ACTION_ID" --region TARGET_REGION \
--query 'action.targets' --output json
Common incompatibility traps:
| FIS Action | Required resourceType | Incompatible With | Detection |
|---|---|---|---|
aws:rds:failover-db-cluster | aws:rds:cluster | Standalone RDS (non-Aurora) | DBClusterIdentifier is null |
aws:rds:reboot-db-instances | aws:rds:db | Aurora clusters | Engine starts with aurora |
aws:elasticache:replicationgroup-interrupt-az-power | aws:elasticache:replicationgroup | Standalone ElastiCache nodes | No replication group |
aws:ec2:stop-instances | aws:ec2:instance | Spot instances | InstanceLifecycle = spot |
Example alternatives:
aws:rds:reboot-db-instances with
--force-failoveraws:rds:failover-db-clusterValidate EACH included sub-action against its target resources. Only validate sub-actions that remain after service-scoped pruning (Step 2).
Stop Conditions — default: source: "none" (no alarm). Only create a
CloudWatch Alarm if the user explicitly provides one.
Dashboard Metrics — comprehensive, per-service. Group widgets by service, 3 widgets per service (availability, performance, errors/latency). Include only services actually affected by the experiment.
| Service | Metrics |
|---|---|
| EC2 | StatusCheckFailed, CPUUtilization, NetworkIn/Out, NetworkPacketsIn/Out |
| RDS/Aurora | DatabaseConnections, ReadLatency, WriteLatency, AuroraReplicaLag, FreeableMemory |
| EKS | pod_number_of_running_pods, pod_number_of_container_restarts, node_cpu_utilization, node_memory_utilization |
| ElastiCache | ReplicationLag, EngineCPUUtilization, CurrConnections, CacheHitRate, Evictions, IsMaster |
| ALB | HealthyHostCount, UnHealthyHostCount, HTTPCode_ELB_5XX_Count, TargetResponseTime |
| NLB | ActiveFlowCount, TCP_Client_Reset_Count, TCP_Target_Reset_Count |
Create output directory:
# ─── Fill in from user's request + references/slug-conventions.md ───
SCENARIO_SLUG="..." # e.g., pod-delete, az-power-int, rds-failover
TARGET_RESOURCE_ID="..." # e.g., my-aurora-cluster, i-0abc123def
CONTEXT_NAME="" # optional (e.g., redis, msk); leave empty if N/A
# ────────────────────────────────────────────────────────────────────
# Derived values (do not edit):
TARGET_SLUG=$(echo "${TARGET_RESOURCE_ID}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-20)
CONTEXT_SLUG=$(echo "${CONTEXT_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-' | cut -c1-10)
TIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)
if [ -n "${CONTEXT_SLUG}" ]; then
OUTPUT_DIR="./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}"
else
OUTPUT_DIR="./${TIMESTAMP}-${SCENARIO_SLUG}-${TARGET_SLUG}"
fi
mkdir -p "${OUTPUT_DIR}"
REQUIRED: Before generating cfn-template.yaml, read the
AWS::FIS::ExperimentTemplate CloudFormation resource documentation:
aws___read_documentation:
url: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-fis-experimenttemplate.html
ALSO REQUIRED: Search for CloudFormation examples for the resources used:
aws___search_documentation:
search_phrase: "<CFN resource types in this experiment>"
topics: ["cloudformation"]
Generate files:
cfn-template.yaml — use references/cfn-base-template.md as the
skeleton. Extend with scenario-specific resources per:
references/az-power-interruption-guide.md (if AZ Power Interruption)references/eks-pod-action-guide.md (if EKS pod actions)references/msk-guide.md (if MSK)references/elasticache-redis-guide.md (if ElastiCache)README.md — use the template in references/output-format.md.
Run the precheck script to detect whether a CFN service role is required:
CFN_ROLE_ARN=$("${SKILL_DIR}/scripts/precheck-cfn-permissions.sh")
If the caller lacks CloudFormation permissions, the script exits 1 with
guidance — stop and inform the user. Otherwise, CFN_ROLE_ARN is either
empty (no service role needed) or contains the required role ARN.
Generate deployment parameters:
# See references/slug-conventions.md for the ExperimentName composition rule
RANDOM_SUFFIX=$(LC_ALL=C tr -dc 'a-z0-9' < /dev/urandom | head -c6)
if [ -n "${CONTEXT_SLUG}" ]; then
EXPERIMENT_NAME="${SCENARIO_SLUG}-${TARGET_SLUG}-${CONTEXT_SLUG}-${RANDOM_SUFFIX}"
else
EXPERIMENT_NAME="${SCENARIO_SLUG}-${TARGET_SLUG}-${RANDOM_SUFFIX}"
fi
STACK_NAME="fis-${EXPERIMENT_NAME}"
Deploy with self-healing retry loop (maximum 5 attempts driven by the
agent). The deploy-with-retry.sh script performs one attempt — the
agent drives the loop externally. On each attempt:
scripts/deploy-with-retry.sh:
"${SKILL_DIR}/scripts/deploy-with-retry.sh" \
"${OUTPUT_DIR}/cfn-template.yaml" \
"${STACK_NAME}" \
"${TARGET_REGION}" \
"${CFN_ROLE_ARN}" \
"ExperimentName=${EXPERIMENT_NAME}" \
"RandomSuffix=${RANDOM_SUFFIX}"
cfn-template.yaml, increment attempt
counter, re-invoke the script.cfn-template.yaml.Common CFN errors and fixes:
| Error Pattern | Root Cause | Fix |
|---|---|---|
Property validation failure | Invalid CFN property name/value | Fix the resource property |
Template format error | YAML syntax issue | Fix indentation/structure |
Resource type not supported | Resource unavailable in region | Check regional availability |
Circular dependency | Resources reference each other | Use DependsOn or restructure |
RoleArn ... is invalid | IAM role not yet propagated | Add DependsOn for IAM role |
Empty logConfiguration | AZ Power Interruption doc artifact | Remove the logConfiguration block |
Extract stack outputs:
aws cloudformation describe-stacks \
--stack-name "${STACK_NAME}" \
--query 'Stacks[0].Outputs' \
--region "${TARGET_REGION}" --output table
Update README.md with actual stack name, template ID, dashboard URL,
and cleanup command. Replace ALL {STACK_NAME} placeholders — do NOT
leave placeholders in the final output.
Run the rename script:
NEW_OUTPUT_DIR=$("${SKILL_DIR}/scripts/rename-output-dir.sh" \
"${OUTPUT_DIR}" \
"${STACK_NAME}" \
"${TARGET_REGION}")
OUTPUT_DIR="${NEW_OUTPUT_DIR}"
Update README.md's **Directory:** field with the full absolute path of
the renamed directory. If CFN deployment failed (Step 6 exceeded max
retries), skip this step.
Print a brief summary to the terminal:
aws___read_documentation on the scenario's doc URL (Step 1 table) before
generating any files. The documentation is the only authoritative source.aws-fis-experiment-execute or manually by the user.aws fis list-actions /
aws fis get-action) before generating templates. Don't fabricate action
IDs.resourceArns over resourceTags for targets. Exceptions:
aws:elasticache:replicationgroup, aws:ec2:autoscaling-group. Never
combine resourceArns with filters.aws___read_documentation and
aws___search_documentation calls must be sequential, never parallel.
Retry up to 10 times on rate limit errors.