Alibabacloud Kafka Capacity Assessment

alibabacloud-skills-team@sdk-team

Performs capacity assessment on Alibaba Cloud Kafka instances to determine whether throttling is occurring and recommends instance upgrades when capacity is running high. Triggers when users describe issues such as consumer lag buildup, producer send failures, throughput throttling, or connection anomalies on a Kafka instance. Trigger phrases: "kafka capacity assessment", "kafka throttling", "kafka consumer lag", "kafka performance bottleneck", "kafka throughput insufficient", "kafka connection limit", "kafka disk full", "kafka anomaly". Do NOT use for: Kafka instance creation, Topic/partition management and configuration, message send/receive code debugging, or fault diagnosis unrelated to capacity.

Install

openclaw skills install @sdk-team/alibabacloud-kafka-capacity-assessment

Kafka Instance Capacity Assessment

Based on the incident information provided by the user, this skill queries CloudMonitor metrics and instance metadata to determine whether a Kafka instance has reached a capacity bottleneck, and provides upgrade recommendations accordingly.

Architecture: Kafka Instance (v2/v3 Series) + CloudMonitor (CMS) + Aliyun CLI

Pre-check: Aliyun CLI >= 3.3.3 required

Run aliyun version to verify >= 3.3.3. If not installed or version too low, run curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash to install/update, or see references/cli-installation-guide.md for installation instructions.

Pre-check: Aliyun CLI plugin update required

[MUST] run aliyun configure set --auto-plugin-install true to enable automatic plugin installation. [MUST] run aliyun plugin update to ensure that any existing plugins are always up-to-date.

Authentication

Pre-check: Alibaba Cloud Credentials Required

Security Rules:

NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)

NEVER ask the user to input AK/SK directly in the conversation or command line

NEVER use aliyun configure set with literal credential values

ONLY use aliyun configure list to check credential status
bash
aliyun configure list
Check the output for a valid profile (AK, STS, or OAuth identity).

If no valid profile exists, STOP here.

Obtain credentials from Alibaba Cloud Console

Configure credentials outside of this session (via aliyun configure in terminal or environment variables in shell profile)

Return and re-run after aliyun configure list shows a valid profile

Observability (MUST follow for every aliyun command)

Upon loading this skill, generate a random session ID (32-char lowercase hex string) once for the entire session. Use it as {session-id} below.

Rule: Every aliyun CLI command that calls a cloud API MUST include the --user-agent flag. Local utility commands (e.g. configure, plugin, version) do not support this flag and should be excluded.

text

--user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/{session-id}

Example (assuming session-id is a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6):

bash

aliyun ecs describe-instances --biz-region-id cn-hangzhou --user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6

Do not skip, alter the format, or omit --user-agent on any aliyun API command invocation.

Script / Terraform execution: When running Python SDK scripts or Terraform commands or bash scripts, inject the session-id via inline environment variable so the code can read it at runtime:

bash

# Python SDK script
SKILL_SESSION_ID={session-id} python3 scripts/deploy.py

# Terraform
SKILL_SESSION_ID={session-id} terraform apply

Scripts and Terraform configs should read SKILL_SESSION_ID from the environment (default to empty string if absent). See references/how-to-implement-by-common-sdk.md for SDK patterns.

RAM Policy

This skill requires the following RAM permissions:

API	Permission Action	Description
`GetInstanceList`	`alikafka:GetInstanceList`	Retrieve the Kafka instance list and instance metadata
`DescribeMetricList`	`cms:DescribeMetricList`	Query CloudMonitor metric data

For the full permission list, see references/ram-policies.md.

[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:

Read references/ram-policies.md to get the full list of permissions required by this SKILL

Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions

Pause and wait until the user confirms that the required permissions have been granted

Parameter Confirmation

IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.

Parameter	Required/Optional	Description	Default
Instance Region	Required	Region ID where the instance resides, e.g. `cn-hangzhou`	None
Instance ID	Required	Kafka instance ID, e.g. `alikafka_post-cn-xxxxx`	None
Incident Time / Time Range	Optional	Minute-level precision is preferred	Last 1 hour from current time
Symptoms	Optional	Consumer lag / producer failure / connection anomaly, etc.	None (full assessment if omitted)

Core Workflow (SOP)

Hard Rules

This skill performs read-only operations only. If a bottleneck is identified, inform the user which metrics need to be scaled up via the console. Calling any write OpenAPI or executing CLI commands to upgrade the instance on behalf of the user is strictly prohibited.
v2 and v3 instances use different CloudMonitor metrics. Always query according to the specifications below. Using v2 MetricNames on a v3 instance, or vice versa, is strictly prohibited.

Step 1: Collect Incident Context

Gather parameters from the user (see the Parameter Confirmation section above). For any required parameter not yet provided, prompt for it individually.

Key symptoms to watch for:

Message backlog / increasing consumer lag
Producer send failures / reduced send throughput
Reduced consumer consumption rate
Clients unable to connect to the broker
Messages deleted before the committed retention period expires
Instance throughput throttled

Step 2: Identify Instance Series, Edition, and Specification Family

bash

aliyun alikafka get-instance-list --biz-region-id <RegionId> --region <RegionId> \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/{session-id}

In Aliyun CLI Plugin Mode, the region must be passed via both --biz-region-id and --region.

Extract the following key fields from the response:

Series: Identifies the instance series (v2 or v3)
v2 series: Refer to Kafka v2 Instance Specification Reference Section 1 to determine the edition and specification family
v3 series: Refer to Kafka v3 Instance Specification Reference Section 1 to determine the edition (v3 has no specification family distinction)

If the user has provided an instance ID, filter by it:

bash

aliyun alikafka get-instance-list --biz-region-id <RegionId> --region <RegionId> --instance-id <InstanceId> \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/{session-id}

Filter by series:

bash

aliyun alikafka get-instance-list --biz-region-id <RegionId> --region <RegionId> --series v2 \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/{session-id}

Step 3: Infer Metrics Likely Approaching Their Limits

Based on the symptoms described by the user and the capacity bottleneck guidance in the knowledge base, infer the likely bottleneck:

Symptom	Probable Bottleneck Metric
Messages deleted before retention period expires	Disk space at capacity
Message backlog	Network throughput throttled, or produce/consume request rate throttled
Producer send throughput degraded	Produce traffic throttled or produce request rate throttled
Consumer consumption rate degraded	Consume traffic throttled or consume request rate throttled
Clients unable to connect to broker	Connection count at limit (necessary but not sufficient — verify with metrics)

Step 4: Query CloudMonitor Metric Data

4.1 Determine Query Parameters

Based on the Series value confirmed in Step 2, determine the Namespace and MetricName:

Namespace: Fixed value — acs_kafka
v2 instance MetricNames: See Kafka v2 Instance Specification and Capacity Policy Section 5
v3 instance MetricNames: See Kafka v3 Instance Specification, Elastic Strategy, and Capacity Policy Section 6

4.2 Construct the Query Command

bash

aliyun cms describe-metric-list \
  --namespace acs_kafka \
  --metric-name <MetricName> \
  --period 60 \
  --start-time "<start time YYYY-MM-DD HH:mm:ss>" \
  --end-time "<end time YYYY-MM-DD HH:mm:ss>" \
  --dimensions '[{"instanceId":"<InstanceId>"}]' \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-kafka-capacity-assessment/{session-id}

Parameter notes:

--period: Aggregation interval in seconds; valid values: 15 / 60 / 900 / 3600. Recommended: 60
--start-time / --end-time: Time range; the interval must not exceed 31 days; the range is left-open, right-closed. If the user does not specify a time range, default to the last 1 hour (--end-time = current time, --start-time = current time - 1 hour), and note this in the diagnostic report
--dimensions: JSON string specifying the instance ID

4.3 Query the Applicable Metrics by Instance Series

Metrics for v2 instances (select based on the inferred bottleneck from Step 3):

MetricName	Description	Bottleneck Threshold
`InstanceMessageInputRatioV2`	Produce traffic as a percentage of instance specification limit	Approaching 100% indicates a bottleneck
`InstanceMessageOutputRatioV2`	Consume traffic as a percentage of instance specification limit	Approaching 100% indicates a bottleneck
`PartitionInstanceRatioV2`	Partition count as a percentage of instance specification limit	Approaching 100% indicates a bottleneck
`instance_disk_capacity`	Instance disk utilization	> 80% indicates near-full; Kafka dynamic cleanup begins
`InstanceMaxConnection`	Maximum connections per node (public + private network)	Compare against formula-calculated limit
`InstanceMaxInternetConnection`	Maximum connections per node (public network only)	Compare against formula-calculated limit
`instance_reqs_input`	Produce request rate (requests/sec)	Compare against formula-calculated limit
`instance_reqs_output`	Consume request rate (requests/sec)	Compare against formula-calculated limit

Metrics for v3 instances (select based on the inferred bottleneck from Step 3):

MetricName	Description	Bottleneck Threshold
`InstanceMessageInputRatioV3`	Produce traffic as a percentage of the elastic ceiling	> 100% means the elastic ceiling is exceeded and throttling is occurring; (0%, 100%] means reserved capacity is exceeded, incurring elastic overage charges
`InstanceMessageOutputRatioV3`	Consume traffic as a percentage of the elastic ceiling	Same as above
`InstanceMaxConnectionRatioV3`	Connection utilization per node (public + private network)	Approaching 100% indicates a bottleneck
`InstanceMaxInternetConnectionRatioV3`	Connection utilization per node (public network only)	Approaching 100% indicates a bottleneck
`InstanceThrottleTimeP99InputV3`	Produce throttle duration	> 0 indicates active throttling
`InstanceThrottleTimeP99OutputV3`	Consume throttle duration	> 0 indicates active throttling

Step 5: Analyze Findings and Generate Diagnostic Report

5.1 Analysis

Combine the following inputs for a comprehensive assessment:

Instance metadata (Step 2) → obtain the specification limits for each metric
CloudMonitor data (Step 4) → obtain the actual usage of each metric during the incident window
When investigating a historical incident, prioritize ratio-based metrics (e.g., InstanceMessageInputRatioV2, instance_disk_capacity, InstanceMessageInputRatioV3) or anomaly-based metrics (e.g., InstanceThrottleTimeP99InputV3). Fall back to absolute-value capacity metrics only if the preferred metrics return no data.

v2 instance capacity limit calculation:

See Kafka v2 Instance Specification and Capacity Policy Section 3

v3 instance capacity limit calculation:

See Kafka v3 Instance Specification, Elastic Strategy, and Capacity Policy Sections 2–4

5.2 Generate Diagnostic Report

Output the following structured report to the user:

markdown

## Kafka Instance Capacity Assessment Report

### Basic Information
- Instance ID:
- Region:
- Instance Series: v2 / v3
- Edition:
- Specification:
- Incident Time:

### Incident Summary
- User-Reported Symptoms:
- Anomalous Metrics:

### Monitoring Data Analysis
| Metric Name | Observed Value | Specification Limit | Utilization / Threshold Exceeded | Conclusion |
|:---|:---|:---|:---|:---|
| ... | ... | ... | ... | ... |

### Diagnostic Conclusion
- Root Cause:
- Metrics at or Exceeding Limit:

### Upgrade Recommendations
- Metrics to scale up and recommended target values:
- How to proceed: Navigate to Alibaba Cloud Console > Cloud Message Queue for Apache Kafka > Instance Details > Upgrade

Success Verification

See references/verification-method.md for verification steps.

Cleanup

This skill performs read-only query operations only and does not create or modify any resources. No cleanup is required.

Command Tables

For a complete list of CLI commands used by this skill, see references/related-commands.md.

Best Practices

Prefer ratio/percentage metrics for bottleneck detection to avoid inconsistencies caused by historical data recorded under a different specification tier
For v3 instances, distinguish between "reserved capacity" and "elastic ceiling": exceeding reserved capacity only incurs elastic overage charges; throttling occurs only when the elastic ceiling is exceeded
CloudMonitor data has an approximate 1-minute propagation delay; account for this time offset when querying
For v2 instances, when disk utilization exceeds 80%, Kafka begins dynamic log cleanup, which may cause messages to be deleted before the configured retention period
Applicable only to standard instances listed in official Alibaba Cloud documentation; not applicable to non-standard instances or self-managed Kafka clusters
Calling any write-operation API is prohibited; only read-only queries are permitted
v2 and v3 instances use different CloudMonitor MetricNames; mixing them is strictly prohibited

Reference Links

Document	Description
Knowledge Base Notes and Freshness Advisory	Last-updated timestamp, data staleness risks, and links to official documentation
Kafka v2 Instance Specification and Capacity Policy	v2 instance specification tables, partition limits, connection count / request rate formulas, disk cleanup policy, CloudMonitor metrics, and bottleneck troubleshooting guide
Kafka v3 Instance Specification, Elastic Strategy, and Capacity Policy	v3 instance elastic strategy, elastic ceiling calculation, connection count / request rate formulas, CloudMonitor metrics, and bottleneck troubleshooting guide
CLI Installation Guide	Complete guide for installing and configuring Aliyun CLI
RAM Policies	Full list of RAM permissions required by this skill
Related Commands	All CLI commands used by this skill
Verification Method	Steps for verifying capacity assessment results
Acceptance Criteria	Acceptance criteria and test scenarios for this skill