{"skill":{"slug":"aws-service-chaos-research","displayName":"Aws Service Chaos Research","summary":"Use when the user asks about chaos engineering, fault injection, resilience testing, or HA verification for a SPECIFIC AWS service (e.g., RDS, EKS, MSK, Elas...","description":"---\nname: aws-service-chaos-research\ndescription: >\n  Use when the user asks about chaos engineering, fault injection, resilience testing,\n  or HA verification for a SPECIFIC AWS service (e.g., RDS, EKS, MSK, ElastiCache,\n  DynamoDB, S3, Lambda, OpenSearch, etc.). Triggers on \"chaos testing on [service]\",\n  \"fault injection for [service]\", \"how to test HA of [service]\",\n  \"FIS scenarios/actions for [service]\", \"[service] failover testing\",\n  \"[service] resilience testing\", \"[service] 混沌测试\", \"[service] 故障注入\",\n  \"[service] 高可用验证\", \"对 [service] 做混沌实验\", \"test my [service]\",\n  \"verify my [service] is resilient\". Use this skill even when the user phrases\n  it casually like \"test my RDS\" or \"how resilient is my MSK cluster\".\n---\n\n# AWS Service-Specific Chaos & HA Testing Research\n\nGenerate comprehensive chaos engineering and high availability testing scenarios for a\nspecific AWS service. Uses a **Scenario-Library-first** approach: read the latest FIS\nScenario Library documentation for pre-built composite scenarios first, then query\nindividual FIS actions via `list-actions`, and finally supplement with deep documentation\nresearch.\n\n## Output Language Rule\n\nDetect the language of the user's conversation and use the **same language** for all output.\n- Chinese input -> Chinese output\n- English input -> English output\n- Mixed -> follow the dominant language\n\n## Prerequisites\n\nRequired tools (at least one of each group):\n\n**FIS Scenario Library (Group A — documentation-based, always available):**\n- `aws___read_documentation` — read FIS Scenario Library pages directly (scenarios are\n  console-only and cannot be queried via CLI, so reading the latest docs is the only way\n  to discover them)\n\n**FIS Actions Discovery (Group B — use in order of preference):**\n1. **AWS CLI** `aws fis list-actions` — definitive, real-time list of FIS actions from user's region\n2. **aws___search_documentation** — FIS actions reference page as fallback when CLI is unavailable\n\n**Documentation Research (Group C):**\n- `aws___search_documentation` — search AWS official docs\n- `aws___read_documentation` — read full doc pages\n- `aws___recommend` — discover related pages\n\nAll documentation research uses **only** the AWS Knowledge MCP tools above.\nDo NOT use SearXNG or other web search tools for documentation research.\n\n## Workflow\n\n**CRITICAL — Sequential execution of all AWS Knowledge MCP calls:**\nAll calls to `aws___search_documentation`, `aws___read_documentation`, and\n`aws___recommend` MUST be executed **one at a time, sequentially**. NEVER send\nmultiple MCP requests in parallel — the aws-knowledge-mcp-server has strict rate\nlimits and will reject concurrent requests with \"Too many requests\" errors.\nWait for each request to return a complete response before sending the next one.\nThis applies to ALL steps below (Step 2, 4b, 4c, 5a, 5b).\n\n**Retry on failure:** If any MCP call (especially `aws___read_documentation`) returns\na rate limit error (\"Too many requests\") or any other transient error, **retry up to\n10 times** with a 5-second wait between retries. Only skip the request after all 10\nretries have failed.\n\n**Multi-service requests:** When the user asks about multiple services (e.g.,\n\"EKS, RDS, MSK, and ElastiCache\"), process them **one service at a time**. Complete\nall research steps (Steps 2-5) for one service before starting the next. Do NOT\nlaunch parallel research for multiple services — this will trigger rate limiting.\nThe Scenario Library fetch (Step 2) only needs to run once since it covers all\nservices; the per-service steps (3-5) must be repeated sequentially for each service.\n\n### Step 1: Identify Target Service\n\nExtract the target AWS service from the user's message and determine the target region.\n\n#### Region Detection\n\nFIS actions can differ across AWS regions — some actions may be available in\n`us-east-1` but not yet in `ap-southeast-1`. Always determine the target region first,\nbecause service keyword resolution depends on it.\n\n**Detection order (use the first one that applies):**\n\n1. **User explicitly specifies** — e.g., \"us-west-2\", \"东京区域\", \"ap-northeast-1\"\n2. **Infer from context** — resource ARNs, previous conversation mentioning a region\n3. **Check AWS CLI default** — run `aws configure get region` to get the configured default\n4. **Ask the user** — if none of the above yields a region, ask:\n   \"Which AWS region are you targeting? FIS actions and scenarios may vary by region.\"\n\nStore the resolved region as `TARGET_REGION` for use in subsequent steps.\n\n#### Service Keyword Resolution\n\nFIS action IDs follow the pattern `aws:<service>:<action>`. To map the user's input\nto the correct FIS service keyword, use dynamic discovery from the live FIS action list:\n\n```bash\naws fis list-actions --region TARGET_REGION | jq '.actions[].id' | awk -F':' '{print $2}' | sort -u\n```\n\nThis returns the definitive list of FIS-supported service keywords in that region\n(e.g., `ebs`, `ec2`, `ecs`, `eks`, `elasticache`, `fis`, `network`, `rds`, `s3`, `ssm`...).\nMatch the user's service name against this list. For example, if the user says\n\"Aurora\", match it to `rds`; if \"Kubernetes\", match to `eks`.\n\nIf the AWS CLI is not available, derive the keyword by lowercasing the AWS service name\nand removing spaces/hyphens (e.g., \"ElastiCache\" -> `elasticache`).\n\nIf the service is ambiguous, ask the user to clarify (e.g., \"RDS MySQL or Aurora MySQL?\").\n\nAlso determine the deployment architecture if the user mentions it:\n- Multi-AZ, Multi-Region, Single-AZ\n- Read replicas, Global Tables, Cross-region replication\n- This affects which scenarios are relevant\n\n### Step 2: Fetch FIS Scenario Library (Scenario-Library-First)\n\n**This step has the highest priority.** The FIS Scenario Library provides AWS-curated\ncomposite scenarios that orchestrate multiple fault injection actions into realistic\nfailure simulations. These are the most valuable starting point because they represent\nAWS's own recommendations for how to test resilience.\n\nScenario Library scenarios are **console-only** — they cannot be listed or queried via\nAWS CLI or API. The only way to discover them is by reading the latest documentation.\n\nFetch the Scenario Library pages listed in `references/search-queries.md` under\n\"FIS Scenario Library Pages (Always Fetch)\". Read both the overview and detailed scenario\npages relevant to the target service. **Read pages one at a time, sequentially** —\nwait for each `aws___read_documentation` call to complete before starting the next one.\n\n#### From the scenario documentation, extract for each relevant scenario:\n\n- **Scenario name and description**\n- Which **sub-actions** the scenario orchestrates\n- Which sub-actions are **relevant to the target service**\n- What **resource tags** are required to target specific resources\n- The **default durations** (interruption + recovery phases)\n- Any **prerequisites or limitations**\n- **Stop condition** recommendations\n\n#### Decision: Which scenarios apply?\n\nAfter reading the documentation, classify each scenario's relevance:\n\n| Relevance | Criteria |\n|---|---|\n| **Directly relevant** | Scenario includes sub-actions that explicitly target the service (e.g., \"Failover RDS\" in AZ Power Interruption) |\n| **Indirectly relevant** | Scenario affects infrastructure the service depends on (e.g., network disruption affects any VPC-based service) |\n| **Not relevant** | Scenario has no meaningful impact on the target service |\n\nInclude both directly and indirectly relevant scenarios in the output.\n\n### Step 3: Query FIS Actions\n\nAfter the Scenario Library research, query individual FIS actions to discover\nservice-specific fault injection capabilities that may not be covered by composite\nscenarios.\n\n#### Path A: AWS CLI Available (Preferred)\n\n**Step 3a: Fetch ALL FIS actions in the target region:**\n\n```bash\naws fis list-actions --region TARGET_REGION --query 'actions[].{id:id, description:description}' --output json\n```\n\nReplace `TARGET_REGION` with the region resolved in Step 1 (e.g., `us-east-1`).\nIf no region was determined, omit `--region` to use the CLI default, but **warn\nthe user** that results reflect their default region and may differ in other regions.\n\n**Step 3b: Filter for target service** — from the full list, find actions whose `id`\ncontains the search keyword(s) from Step 1:\n\n```bash\naws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:KEYWORD:`)].{id:id, description:description}' --output json\n```\n\nAlso scan the description field for the service name, because some actions may\nreference a service in their description even if the action prefix is different.\n\n**Step 3c (Optional): Collect cross-cutting actions** — these affect services\nindirectly. Include them if the user's service would benefit from network, API, or\ninfrastructure-level fault injection testing:\n\n```bash\naws fis list-actions --region TARGET_REGION --query 'actions[?starts_with(id, `aws:network:`) || starts_with(id, `aws:fis:inject`) || starts_with(id, `aws:ssm:`) || starts_with(id, `aws:ec2:stop`) || starts_with(id, `aws:ec2:terminate`)].{id:id, description:description}' --output json\n```\n\nCross-cutting actions and when they're useful:\n- `aws:network:disrupt-connectivity` — useful for any VPC-based service\n- `aws:network:disrupt-vpc-endpoint` — useful for services accessed via PrivateLink\n- `aws:fis:inject-api-internal-error` — useful to test app handling of AWS API failures\n- `aws:fis:inject-api-throttle-error` — useful to test backoff/retry logic\n- `aws:fis:inject-api-unavailable-error` — useful to test graceful degradation\n- `aws:ec2:stop-instances` / `terminate-instances` — useful for services running on EC2\n- `aws:ssm:send-command` / `start-automation-execution` — useful for custom fault scripts\n\nWhether to include cross-cutting actions depends on context:\n- **Include** when the service runs on EC2, uses VPC networking, or the user is\n  interested in infrastructure-level failure testing\n- **Skip** when the user is focused only on service-native failures, or the service\n  is fully managed with no user-accessible infrastructure layer\n\n#### Path B: AWS CLI Not Available\n\nSearch the FIS actions reference documentation:\n```\naws___search_documentation(\n  search_phrase=\"AWS FIS actions [SERVICE_NAME] fault injection\",\n  topics=[\"reference_documentation\"],\n  limit=10\n)\n```\n\nThen read the FIS actions reference page:\n```\naws___read_documentation(\n  url=\"https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html\",\n  max_length=10000\n)\n```\n\n#### Decision Point: FIS Actions Found?\n\nCount the number of **service-specific** actions found (exclude cross-cutting actions).\n\n- **YES (1+ service-specific actions found)** -> Continue to Step 4 (FIS-Enriched Path)\n- **NO (zero service-specific actions)** -> Jump to Step 5 (Documentation-Only Path)\n\n### Step 4: FIS-Enriched Path\n\nWhen FIS has native actions for the target service, combine Scenario Library findings\nwith FIS-action-specific details.\n\n#### 4a: Organize FIS Actions into Testing Scenarios\n\nMap each FIS action to a testing scenario. Use the \"FIS Native Fault Injection\nScenarios\" table format from `references/output-template.md`.\n\n**IMPORTANT — Scenario Library deduplication (must apply before building the table):**\nBefore listing any FIS action in the per-service table, check whether that exact\naction ID appeared as a sub-action in any Scenario Library composite scenario\ndiscovered in Step 2. Common examples of overlap:\n- `aws:rds:failover-db-cluster` — sub-action of AZ Power Interruption\n- `aws:elasticache:replicationgroup-interrupt-az-power` — sub-action of AZ Power Interruption\n- `aws:eks:pod-network-latency` — sub-action of AZ Application Slowdown\n- `aws:eks:pod-network-packet-loss` — sub-action of Cross-AZ Traffic Slowdown\n- `aws:ec2:stop-instances` — sub-action of AZ Power Interruption\n\nRules:\n1. If an action **is** a Scenario Library sub-action, **still list it** in the\n   per-service table but append to the \"HA Verification Purpose\" column:\n   \"(Also sub-action of {Scenario Name} — see Scenario Library section)\".\n2. If **all** service-specific FIS actions are Scenario Library sub-actions (e.g.,\n   ElastiCache has only `replicationgroup-interrupt-az-power` which is covered by\n   AZ Power Interruption), **omit** the \"FIS Native Fault Injection Scenarios\"\n   sub-section entirely and replace with:\n   > All FIS native actions for {SERVICE} are covered by Scenario Library composite\n   > scenarios. See the Scenario Library and Cross-Cutting section for details.\n\nGroup scenarios by failure domain:\n1. **Instance/Task Level** — individual resource failure\n2. **Storage Level** — disk/volume failure or degradation\n3. **Network Level** — connectivity disruption\n4. **AZ Level** — availability zone failure simulation\n5. **Region Level** — cross-region failover\n6. **API/Control Plane** — AWS API errors\n\n**Scenario Library cross-reference:** For each FIS action, check whether it also\nappears as a sub-action in any Scenario Library composite scenario discovered in\nStep 2. If it does, append a note in the \"HA Verification Purpose\" column (e.g.,\n\"Also a sub-action of AZ Power Interruption — see Scenario Library section\"). If\n**all** service-specific FIS actions are sub-actions of Scenario Library scenarios,\nomit the \"FIS Native Fault Injection Scenarios\" sub-section entirely and replace\nit with a note: \"All FIS native actions for this service are covered by Scenario\nLibrary composite scenarios — see the Scenario Library and Cross-Cutting section.\"\n\n#### 4b: Enrich with Service-Specific Capabilities\n\nSome services have **built-in fault injection** beyond FIS. Search for these\n(**sequentially** — wait for the search to complete before reading any result pages):\n\n```\naws___search_documentation(\n  search_phrase=\"[SERVICE_NAME] fault injection testing failover simulation\",\n  topics=[\"general\", \"reference_documentation\"],\n  limit=10\n)\n```\n\nIf found, add a \"Service Built-in Fault Injection\" section using the table format from\n`references/output-template.md`.\n\n#### 4c: Deep Documentation Research\n\nUse the search queries from `references/search-queries.md` under \"FIS-Enriched Path\".\nRun all 5 queries **sequentially** (one at a time). After searches, read the top 3-5\nmost relevant pages **one at a time** and use `aws___recommend` on the most relevant\npage for discovery. Never send multiple read or recommend requests in parallel.\n\n### Step 5: Documentation-Only Path (No FIS Actions)\n\nWhen FIS has no native actions for the target service, fall back to comprehensive\ndocumentation research. Note that Scenario Library findings from Step 2 still apply.\n\n#### 5a: Deep Documentation Search\n\nUse the search queries from `references/search-queries.md` under \"Documentation-Only Path\".\nRun all 6 queries **sequentially** (one at a time, wait for each to complete).\n\n#### 5b: Read Key Pages and Discover Related Content\n\nFrom the combined search results, read the **top 5 most relevant pages** following the\npriority order in `references/search-queries.md`. Read pages **one at a time** — wait\nfor each `aws___read_documentation` call to complete before the next. Then use\n`aws___recommend` on the service's main documentation page to discover related content.\n\nExtract from all pages:\n- **Failure modes** the service can experience\n- **Built-in HA mechanisms** (automatic failover, replication, etc.)\n- **Testing approaches** documented in official guides\n- **Monitoring/metrics** to watch during tests\n\n#### 5c: Compile Alternative Testing Approaches\n\nUse the \"Testing Methods (No Native FIS Actions)\" section format from `references/output-template.md`,\nincluding both indirect FIS actions and AWS API/Console methods.\n\n### Step 6: Compile Output and Save to Local File\n\n**Write the report directly to a local markdown file** instead of outputting the full\ncontent to the terminal. Use the following file naming convention:\n\n```bash\nTIMESTAMP=$(TZ=Asia/Shanghai date +%Y-%m-%d-%H-%M-%S)\nSERVICE_SLUG=$(echo \"{SERVICE_NAME}\" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')\n# File name: ${TIMESTAMP}-${SERVICE_SLUG}-chaos-research.md\n```\n\nFor multi-service requests, generate **one file per service**:\n- `${TIMESTAMP}-rds-chaos-research.md`\n- `${TIMESTAMP}-eks-chaos-research.md`\n- etc.\n\nCompile the report content using the exact format defined in `references/output-template.md`\nand save it to the file. The report must include all sections in this order:\n\n1. **Executive Summary** — overview with region, FIS support status, key recommendation\n2. **Scenario Library and Cross-Cutting** — Scenario Library composite scenarios (highest priority), cross-cutting actions as optional supplement. **This section comes BEFORE per-service sections.**\n3. **Per-service sections** — each with: FIS scenarios (using `{SVC}-#` test IDs, e.g., `EKS-1`, `Redis-1`), built-in methods, recommended testing scenario matrix, environment observations, and stop conditions\n4. **Recommended Test Priority (Consolidated)** — references test IDs from per-service sections; do NOT duplicate full descriptions; do NOT list a FIS action separately if already covered by a Scenario Library scenario in the same table\n5. **Implementation Best Practices** — steady state, DNS/connection, blast radius\n6. **Reference Materials** — only URLs from actual search results or pages read\n7. **Next Steps** — 3-4 actionable next steps\n\nAfter saving, print a brief summary to the terminal listing only:\n- The file path(s) of the generated report(s)\n- Target service(s) and region\n- Number of FIS actions found (service-specific + cross-cutting)\n- Number of Scenario Library scenarios identified\n- Top 3 recommended test priorities\n\n## Important Guidelines\n\n- **Scenario Library first, always.** The FIS Scenario Library represents AWS's own\n  curated resilience testing scenarios. Always read the latest Scenario Library\n  documentation before anything else. These are documentation-based (console-only),\n  not CLI-queryable.\n- **Region matters.** Always resolve the target region before querying FIS actions.\n  FIS action availability varies by region. Always pass `--region` to the AWS CLI and\n  clearly state the region in the output.\n- **Don't fabricate FIS actions.** If an action doesn't exist, say so clearly. The\n  fallback path exists precisely for services FIS doesn't cover.\n- **Don't fabricate links.** Only include URLs from actual search results or known\n  documentation pages you've read.\n- **Be specific about the service.** Every recommendation should reference the specific\n  service, its HA mechanisms, and its specific metrics.\n- **Cross-cutting actions are optional context.** Include them when they add value,\n  but focus on service-specific actions and Scenario Library scenarios first.\n- **AWS Knowledge MCP only for docs research.** Do NOT use SearXNG or other web search.\n  Use `aws___search_documentation`, `aws___read_documentation`, and `aws___recommend`.\n- **Search across multiple topics.** Use different `topics` values (`general`,\n  `reference_documentation`, `troubleshooting`) sequentially.\n- **Use aws___recommend for discovery.** After reading a key page, call `aws___recommend`\n  to find related content that keyword search may miss.\n- **NEVER send MCP requests in parallel.** All calls to `aws___search_documentation`,\n  `aws___read_documentation`, and `aws___recommend` MUST be executed one at a time.\n  Wait for each response before sending the next request. Parallel calls will trigger\n  \"Too many requests\" errors from the aws-knowledge-mcp-server. This is the single\n  most common cause of failures — enforce strictly in every step.\n- **Retry on failure — up to 10 times.** If any MCP call fails with a rate limit or\n  transient error, wait 5 seconds and retry. Repeat up to 10 times before skipping.\n- **Respect language.** Output in the same language as the user's conversation.\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":326,"installsAllTime":12,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1778412848261,"updatedAt":1778492892498},"latestVersion":{"version":"1.0.0","createdAt":1778412848261,"changelog":"Initial release with AWS service-specific chaos engineering research workflow:\n\n- Supports user queries for chaos testing, fault injection, resilience, or HA of specific AWS services (e.g., RDS, EKS, MSK, S3, Lambda).\n- Scenario-Library-first approach: fetches and summarizes AWS FIS Scenario Library documentation for relevant built-in scenarios.\n- Dynamically discovers available FIS actions per region, supporting region detection through explicit mention, context, or CLI default.\n- Handles both directly and indirectly relevant scenarios, with clear extraction of prerequisites, resource tags, and stop conditions.\n- All AWS documentation and action discovery workflows are strictly sequential, with robust retry logic to handle rate limits and transient errors.\n- Multilingual output: responds in the user's input language (English, Chinese, or mixed), following user preference.","license":"MIT-0"},"metadata":null,"owner":{"handle":"panlm","userId":"s170tnbkqyrrzzgez8ybwx5vcs86edhj","displayName":"panlm","image":"https://avatars.githubusercontent.com/u/1658398?v=4"},"moderation":null}