Performs capacity assessment on Alibaba Cloud Kafka instances to determine whether throttling is occurring and recommends instance upgrades when capacity is running high.
Triggers when users describe issues such as consumer lag buildup, producer send failures, throughput throttling, or connection anomalies on a Kafka instance.
Trigger phrases: "kafka capacity assessment", "kafka throttling", "kafka consumer lag", "kafka performance bottleneck", "kafka throughput insufficient", "kafka connection limit", "kafka disk full", "kafka anomaly".
Do NOT use for: Kafka instance creation, Topic/partition management and configuration, message send/receive code debugging, or fault diagnosis unrelated to capacity.
Based on the incident information provided by the user, this skill queries CloudMonitor metrics and instance metadata to determine whether a Kafka instance has reached a capacity bottleneck, and provides upgrade recommendations accordingly.
Run aliyun version to verify >= 3.3.3. If not installed or version too low,
run curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash to install/update,
or see references/cli-installation-guide.md for installation instructions.
Pre-check: Aliyun CLI plugin update required
[MUST] run aliyun configure set --auto-plugin-install true to enable automatic plugin installation.
[MUST] run aliyun plugin update to ensure that any existing plugins are always up-to-date.
Authentication
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
NEVER read, echo, or print AK/SK values (e.g., echo $ALIBABA_CLOUD_ACCESS_KEY_ID is FORBIDDEN)
NEVER ask the user to input AK/SK directly in the conversation or command line
NEVER use aliyun configure set with literal credential values
ONLY use aliyun configure list to check credential status
bash
aliyun configure list
Check the output for a valid profile (AK, STS, or OAuth identity).
Configure credentials outside of this session (via aliyun configure in terminal or environment variables in shell profile)
Return and re-run after aliyun configure list shows a valid profile
Observability (MUST follow for every aliyun command)
Upon loading this skill, generate a random session ID (32-char lowercase hex string) once for the entire session. Use it as {session-id} below.
Rule: Every aliyun CLI command that calls a cloud API MUST include the --user-agent flag.
Local utility commands (e.g. configure, plugin, version) do not support this flag and should be excluded.
Do not skip, alter the format, or omit --user-agent on any aliyun API command invocation.
Script / Terraform execution: When running Python SDK scripts or Terraform commands or bash scripts, inject the session-id via inline environment variable so the code can read it at runtime:
Scripts and Terraform configs should read SKILL_SESSION_ID from the environment (default to empty string if absent). See references/how-to-implement-by-common-sdk.md for SDK patterns.
RAM Policy
This skill requires the following RAM permissions:
API
Permission Action
Description
GetInstanceList
alikafka:GetInstanceList
Retrieve the Kafka instance list and instance metadata
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
Read references/ram-policies.md to get the full list of permissions required by this SKILL
Use ram-permission-diagnose skill to guide the user through requesting the necessary permissions
Pause and wait until the user confirms that the required permissions have been granted
Parameter Confirmation
IMPORTANT: Parameter Confirmation — Before executing any command or API call,
ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks,
passwords, domain names, resource specifications, etc.) MUST be confirmed with the
user. Do NOT assume or use default values without explicit user approval.
Parameter
Required/Optional
Description
Default
Instance Region
Required
Region ID where the instance resides, e.g. cn-hangzhou
None
Instance ID
Required
Kafka instance ID, e.g. alikafka_post-cn-xxxxx
None
Incident Time / Time Range
Optional
Minute-level precision is preferred
Last 1 hour from current time
Symptoms
Optional
Consumer lag / producer failure / connection anomaly, etc.
None (full assessment if omitted)
Core Workflow (SOP)
Hard Rules
This skill performs read-only operations only. If a bottleneck is identified, inform the user which metrics need to be scaled up via the console. Calling any write OpenAPI or executing CLI commands to upgrade the instance on behalf of the user is strictly prohibited.
v2 and v3 instances use different CloudMonitor metrics. Always query according to the specifications below. Using v2 MetricNames on a v3 instance, or vice versa, is strictly prohibited.
Step 1: Collect Incident Context
Gather parameters from the user (see the Parameter Confirmation section above). For any required parameter not yet provided, prompt for it individually.
Key symptoms to watch for:
Message backlog / increasing consumer lag
Producer send failures / reduced send throughput
Reduced consumer consumption rate
Clients unable to connect to the broker
Messages deleted before the committed retention period expires
Instance throughput throttled
Step 2: Identify Instance Series, Edition, and Specification Family
--start-time / --end-time: Time range; the interval must not exceed 31 days; the range is left-open, right-closed. If the user does not specify a time range, default to the last 1 hour (--end-time = current time, --start-time = current time - 1 hour), and note this in the diagnostic report
--dimensions: JSON string specifying the instance ID
4.3 Query the Applicable Metrics by Instance Series
Metrics for v2 instances (select based on the inferred bottleneck from Step 3):
MetricName
Description
Bottleneck Threshold
InstanceMessageInputRatioV2
Produce traffic as a percentage of instance specification limit
Approaching 100% indicates a bottleneck
InstanceMessageOutputRatioV2
Consume traffic as a percentage of instance specification limit
Approaching 100% indicates a bottleneck
PartitionInstanceRatioV2
Partition count as a percentage of instance specification limit
Maximum connections per node (public + private network)
Compare against formula-calculated limit
InstanceMaxInternetConnection
Maximum connections per node (public network only)
Compare against formula-calculated limit
instance_reqs_input
Produce request rate (requests/sec)
Compare against formula-calculated limit
instance_reqs_output
Consume request rate (requests/sec)
Compare against formula-calculated limit
Metrics for v3 instances (select based on the inferred bottleneck from Step 3):
MetricName
Description
Bottleneck Threshold
InstanceMessageInputRatioV3
Produce traffic as a percentage of the elastic ceiling
> 100% means the elastic ceiling is exceeded and throttling is occurring; (0%, 100%] means reserved capacity is exceeded, incurring elastic overage charges
InstanceMessageOutputRatioV3
Consume traffic as a percentage of the elastic ceiling
Same as above
InstanceMaxConnectionRatioV3
Connection utilization per node (public + private network)
Approaching 100% indicates a bottleneck
InstanceMaxInternetConnectionRatioV3
Connection utilization per node (public network only)
Approaching 100% indicates a bottleneck
InstanceThrottleTimeP99InputV3
Produce throttle duration
> 0 indicates active throttling
InstanceThrottleTimeP99OutputV3
Consume throttle duration
> 0 indicates active throttling
Step 5: Analyze Findings and Generate Diagnostic Report
5.1 Analysis
Combine the following inputs for a comprehensive assessment:
Instance metadata (Step 2) → obtain the specification limits for each metric
CloudMonitor data (Step 4) → obtain the actual usage of each metric during the incident window
When investigating a historical incident, prioritize ratio-based metrics (e.g., InstanceMessageInputRatioV2, instance_disk_capacity, InstanceMessageInputRatioV3) or anomaly-based metrics (e.g., InstanceThrottleTimeP99InputV3). Fall back to absolute-value capacity metrics only if the preferred metrics return no data.
Prefer ratio/percentage metrics for bottleneck detection to avoid inconsistencies caused by historical data recorded under a different specification tier
For v3 instances, distinguish between "reserved capacity" and "elastic ceiling": exceeding reserved capacity only incurs elastic overage charges; throttling occurs only when the elastic ceiling is exceeded
CloudMonitor data has an approximate 1-minute propagation delay; account for this time offset when querying
For v2 instances, when disk utilization exceeds 80%, Kafka begins dynamic log cleanup, which may cause messages to be deleted before the configured retention period
Applicable only to standard instances listed in official Alibaba Cloud documentation; not applicable to non-standard instances or self-managed Kafka clusters
Calling any write-operation API is prohibited; only read-only queries are permitted
v2 and v3 instances use different CloudMonitor MetricNames; mixing them is strictly prohibited