Install
openclaw skills install aws-fis-experiment-executeClawHub Security found sensitive or high-impact capabilities. Review the scan results before using.
Use when the user wants to run a prepared AWS FIS experiment where the CloudFormation stack has already been deployed. Triggers on "execute FIS experiment", "run FIS experiment", "start chaos experiment", "执行 FIS 实验", "运行混沌实验", "执行故障注入实验", "run the experiment in [directory]", or when the user provides an FIS experiment template ID (e.g. EXT1a2b3c4d5e6f7). Does NOT deploy infrastructure — only checks that it is already deployed.
openclaw skills install aws-fis-experiment-executeVerify that infrastructure is already deployed, run an AWS FIS experiment, monitor its progress, and generate a results report. Reads configuration from a prepared experiment directory whose CloudFormation stack has already been deployed.
Detect the language of the user's conversation and use the same language for all output.
Required tools:
aws fis, aws cloudwatch, aws cloudformation, aws logsREQUIRED SUB-SKILL: app-service-log-analysis must be installed. Loaded at
runtime for application discovery, log collection, and analysis. Without it, the
experiment can still run but log analysis will be skipped.
digraph execute_flow {
"User input:\npath or template ID?" [shape=diamond];
"Search CWD for\nmatching directory" [shape=box];
"Directory found?" [shape=diamond];
"Ask user for full path" [shape=box, style=bold];
"Validate files" [shape=box];
"Read README for stack name" [shape=box];
"Check CFN stack status" [shape=diamond];
"Extract template ID from outputs" [shape=box];
"Display actionIds" [shape=box];
"Pre-experiment health check" [shape=box, color=blue];
"All resources healthy?" [shape=diamond];
"Wait / prompt user" [shape=box];
"Discover apps + start logs\n(app-service-log-analysis)" [shape=box];
"User confirms experiment start" [shape=diamond, style=bold, color=red];
"Start experiment" [shape=box];
"Monitor experiment\n+ log insights" [shape=box];
"Experiment complete?" [shape=diamond];
"Wait 3 min post-baseline" [shape=box];
"Stop logs + analyze\n(app-service-log-analysis)" [shape=box];
"Generate results report" [shape=box];
"User input:\npath or template ID?" -> "Validate files" [label="Full path"];
"User input:\npath or template ID?" -> "Search CWD for\nmatching directory" [label="Template ID"];
"Search CWD for\nmatching directory" -> "Directory found?";
"Directory found?" -> "Validate files" [label="Yes (1 match)"];
"Directory found?" -> "Ask user for full path" [label="No match"];
"Ask user for full path" -> "Validate files" [label="User provides path"];
"Validate files" -> "Read README for stack name";
"Read README for stack name" -> "Check CFN stack status";
"Check CFN stack status" -> "Extract template ID from outputs" [label="CREATE_COMPLETE"];
"Check CFN stack status" -> "Generate results report" [label="Not deployed / failed, abort"];
"Extract template ID from outputs" -> "Display actionIds";
"Display actionIds" -> "Pre-experiment health check";
"Pre-experiment health check" -> "All resources healthy?";
"All resources healthy?" -> "Discover apps + start logs\n(app-service-log-analysis)" [label="Yes"];
"All resources healthy?" -> "Wait / prompt user" [label="No"];
"Wait / prompt user" -> "Pre-experiment health check" [label="Retry (poll 60s,\nmax 10 min non-interactive)"];
"Wait / prompt user" -> "Discover apps + start logs\n(app-service-log-analysis)" [label="User override"];
"Wait / prompt user" -> "Generate results report" [label="Abort"];
"Discover apps + start logs\n(app-service-log-analysis)" -> "User confirms experiment start";
"User confirms experiment start" -> "Start experiment" [label="Yes, I confirm"];
"User confirms experiment start" -> "Stop logs + analyze\n(app-service-log-analysis)" [label="No, abort"];
"Start experiment" -> "Monitor experiment\n+ log insights";
"Monitor experiment\n+ log insights" -> "Experiment complete?";
"Experiment complete?" -> "Monitor experiment\n+ log insights" [label="No, poll again"];
"Experiment complete?" -> "Wait 3 min post-baseline" [label="Yes"];
"Wait 3 min post-baseline" -> "Stop logs + analyze\n(app-service-log-analysis)";
"Stop logs + analyze\n(app-service-log-analysis)" -> "Generate results report";
}
The user provides either:
EXT1a2b3c4d5e6f7)If the user provides a template ID, search CWD for directories ending with that ID:
find . -maxdepth 1 -type d -name "*${TEMPLATE_ID_INPUT}" 2>/dev/null
The experiment directory name ends with the template ID (e.g.,
2026-04-11-az-power-int-my-cluster-EXT1a2b3c4d5e6f7). Extract it:
TEMPLATE_ID=$(basename "${EXPERIMENT_DIR}" | grep -oE 'EXT[a-zA-Z0-9]+$')
Store as TEMPLATE_ID. This is used in all subsequent steps.
Verify EXPERIMENT_DIR contains: cfn-template.yaml, README.md.
Read README.md from the experiment directory to extract:
# FIS Experiment: AZ Power Interruption)**Region:** {REGION}**Target AZ:** {AZ_ID} (if applicable)**Estimated Duration:** {DURATION}**CFN Stack:** {STACK_NAME} (for cleanup reference only)Present a summary to the user with all extracted information.
Use TEMPLATE_ID (extracted from directory name in Step 1b) to query the experiment
template via AWS CLI and display all action IDs:
aws fis get-experiment-template \
--id "{TEMPLATE_ID}" \
--region {REGION} \
--query 'experimentTemplate.actions' --output json
Extract all actionId values from the actions map and display them to the user:
Actions found:
- {actionId_1}
- {actionId_2}
...
Proceed directly to Step 3.5 (resource health check).
Before starting log collection or the experiment itself, verify that every target resource referenced by the FIS experiment template is in a healthy baseline state. Starting an experiment against already-degraded resources makes results unattributable and may amplify impact on fragile infrastructure.
Scope: All resources listed in the FIS experiment template's targets map
(from the Step 3 query). This covers any managed service — RDS, Aurora, MSK,
ElastiCache (Redis/Memcached), EKS clusters and nodegroups, EC2, OpenSearch,
DocumentDB, etc. — whatever the template targets.
Procedure:
For each target in the experiment template, extract its resourceType (e.g.
aws:rds:db, aws:msk:cluster, aws:elasticache:replicationgroup) and the
actual resource identifiers (from resourceArns or resolved from resourceTags).
For each resource, call the appropriate AWS describe API for its service and
read the canonical status field. Use your knowledge of AWS services to pick the
right API and the right "healthy" value (e.g. RDS available, MSK ACTIVE,
ElastiCache available, EKS ACTIVE, EC2 running with status check ok).
Present a table to the user:
Resource Type Status Healthy?
{id_1} {resourceType_1} {status_1} {✓ or ✗}
{id_2} {resourceType_2} {status_2} {✓ or ✗}
...
If the resource type is unfamiliar or the API call fails, mark the resource as unchecked and treat it as unhealthy for decision purposes.
Decision rules:
proceed (override and continue), abort
(stop the workflow), or retry (re-run the health check now).How to determine interactive vs non-interactive: Use your own judgment based on the runtime context (e.g. whether a TTY is attached, whether you can invoke interactive prompts, or environment signals suggesting a CI/automated run). When uncertain, default to interactive behavior.
REQUIRED: You MUST use the skill tool to load the app-service-log-analysis
skill NOW, before proceeding. Call: skill(name="app-service-log-analysis").
This injects the skill's instructions into your context so you can execute its
steps. If the skill is not installed or cannot be loaded, inform the user and
skip log collection (the experiment can still run without it).
This step runs BEFORE the experiment starts — discovering applications after the experiment begins risks missing early log entries that get rotated or overwritten.
Execute from app-service-log-analysis skill:
expected-behavior.md~/.kube/config), verifies access to each clusterkubectl logs -f
for all confirmed applications across all clustersThis is the most dangerous step. The experiment WILL affect real resources.
Before starting, present a clear warning:
WARNING: Starting this FIS experiment will cause REAL impact:
Scenario: {SCENARIO_NAME}
Region: {REGION}
Target AZ: {AZ_ID}
Duration: {DURATION}
Stack: {STACK_NAME} (verified: CREATE_COMPLETE)
Template ID: {TEMPLATE_ID}
Resources that WILL be affected:
- {list each affected resource type and count from README}
Stop Conditions:
- {list each alarm that will stop the experiment}
Applications being monitored:
- {list each namespace/deployment from SERVICE_APP_MAP}
Managed service log collection:
- {list each service with logging status from MANAGED_LOG_GROUPS}
Log directory: {LOG_DIR}
Post-experiment baseline: 3 minutes (automatic)
Type "Yes, start experiment" to proceed, or "No" to abort.
Only proceed if the user explicitly confirms. If user aborts, proceed to Step 7 to stop log collection and clean up first.
Save the returned experiment.id.
Poll the experiment status and display progress. See references/cli-commands.md for
polling commands and experiment status reference.
Polling strategy:
Log insights during each poll cycle: Execute app-service-log-analysis Step 5
(Real-time Monitoring Display) — read recent logs, count errors/warnings, display
per-app summary, detect recovery signals. The skill must already be loaded from Step 4.
During monitoring, remind the user:
references/cli-commands.md for stop command)After the experiment completes (any terminal state):
Continue collecting logs for 3 minutes after the experiment ends to capture recovery behavior. This applies to both application logs and managed service logs. Display a countdown to the user:
Experiment completed. Collecting post-experiment baseline logs...
Remaining: {countdown} (3 minutes total)
After the 3-minute baseline window ends, proceed to analysis.
Execute app-service-log-analysis Steps 7-8:
kubectl logs processesThe application log analysis output is embedded into the experiment results report (see Step 10 below), NOT saved as a separate file.
After the experiment completes (any terminal state), generate a results report and write it directly to a local markdown file in the experiment directory.
See references/report-template.md for the complete report structure, file naming
convention, and timestamp format rules.
Per-service analysis: Identify all services affected by the experiment from the README's "Affected Resources" table. For each service, create a sub-section with: (1) timeline events, (2) observed behavior, (3) key findings. Include indirectly affected services.
After saving, print a brief terminal summary:
After the experiment, offer cleanup. See references/cli-commands.md for commands.
| Error | Cause | Resolution |
|---|---|---|
| Stack name not found in README | README missing **CFN Stack:** field | Check if the experiment was prepared with a recent version of aws-fis-experiment-prepare |
Stack not found (ValidationError) | Stack does not exist or was deleted | Deploy the stack first using aws-fis-experiment-prepare |
Stack in CREATE_FAILED / ROLLBACK_COMPLETE | Stack deployment failed | Check stack events for failure reason, fix and redeploy |
ExperimentTemplateId not in outputs | Stack template missing output | Check cfn-template.yaml for the output definition |
AccessDeniedException | Insufficient permissions | Check IAM permissions for FIS, CloudWatch, CloudFormation |
ResourceNotFoundException on targets | Tagged resources not found | Verify resource tags match experiment template |
Experiment stuck in initiating | IAM role propagation delay | Wait 30 seconds and check again |
kubectl: command not found | kubectl not installed | Install kubectl and configure kubeconfig |
error: You must be logged in | kubeconfig not configured | Run aws eks update-kubeconfig --name {cluster} |
/.pids: Permission denied | LOG_DIR variable empty due to && chain | Use multi-line script with export LOG_DIR=..., NOT && chains |
| No EKS apps discovered | No pods reference affected service endpoints | Ask user to manually specify namespace/deployment pairs |