Install
openclaw skills install huawei-cloud-cce-autoscaling-diagnoserHuawei Cloud CCE autoscaling failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose CCE autoscaling failures across HPA not increasing Pod replicas, CCE elastic engine or Cluster Autoscaler not adding/removing nodes, missing metrics, missing CPU/memory requests, maxReplicas or max_nodes limits, Pending Pods, scheduling constraints, subnet IP exhaustion, ECS quota, or IAM agency permission issues, (2) perform HPA-to-CA cascade diagnosis linking workload-level and node-level scaling failures, (3) analyze CA Pod logs for Cluster Autoscaler signals (NoExpansionOptions, MaxNodeGroupSizeReached, QuotaExceeded, SubnetIPExhausted, IAM denied), (4) generate a complete Markdown diagnosis report with process, evidence, conclusion, confidence, and recommendations. Trigger: user mentions "autoscaling diagnosis", "弹性伸缩诊断", "HPA diagnosis", "HPA 诊断", "scaling failure", "伸缩失败", "HPA not scaling", "HPA 不伸缩", "replica scaling", "副本伸缩", "autoscaling issue", "伸缩问题"
openclaw skills install huawei-cloud-cce-autoscaling-diagnoser⚠️ Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.
- The dispatcher script is located at
scripts/huawei-cloud.pywithin the skill directory- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them. Do not run them directly in a shell.- Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
- All paths are relative to the skill directory, which is the directory where this SKILL.md is located.
This skill diagnoses CCE autoscaling link failures across two closed-loop layers: (1) whether HPA increases workload replica count from N to N+1, and (2) whether CCE elastic engine / Cluster Autoscaler increases node count from M to M+1 after resource-insufficient Pending Pods appear. It outputs a complete Markdown diagnosis report with process, evidence, root cause conclusion, confidence, data gaps, and recommendations.
Architecture: Python dispatcher (scripts/huawei-cloud.py) → Huawei Cloud Python SDK + Kubernetes client → HPA/CA/Addon/NodePool/Pod/Events/Metrics → Gateway intent routing → Path A/B/C diagnosis → Structured evidence + Markdown report
Related Skills:
| Skill | Purpose |
|---|---|
huawei-cloud-cce-pod-failure-diagnoser | Pod runtime failure diagnosis (CrashLoopBackOff, OOMKilled, Pending) |
huawei-cloud-cce-node-failure-diagnoser | Node-level failure diagnosis (NotReady, disk/memory pressure) |
huawei-cloud-cce-workload-failure-diagnoser | Workload rollout failure diagnosis |
huawei-cloud-cce-auto-remediation-runner | Execute remediation actions (HPA config, nodepool resize) |
huawei-cloud-cce-root-cause-analyzer | Cross-resource root cause correlation |
huawei-cloud-cce-alarm-correlation-engine | Alarm correlation and diagnosis triggering |
huawei-cloud-cce-capacity-trend-forecaster | Capacity trend and HPA coverage analysis |
huawei-cloud-cce-cost-optimization-advisor | Resource governance and cost optimization |
Capabilities:
huawei_autoscaling_diagnose)huawei_list_cce_hpas)huawei_list_cce_addons, huawei_list_cce_nodepools)huawei_get_cce_pods, huawei_get_cce_events)huawei_get_aom_metrics)Typical Use Cases:
huaweicloudsdkcore, huaweicloudsdkcce, huaweicloudsdkaom, kubernetespython3 --versionpip3 install huaweicloudsdkcore huaweicloudsdkcce huaweicloudsdkaom kubernetesecho $HUAWEI_AK or echo $HUAWEI_SK to check credentialsHUAWEI_AK, HUAWEI_SK, HUAWEI_REGIONConfiguration Method (Environment Variables Only):
export HUAWEI_AK=<your-ak>
export HUAWEI_SK=<your-sk>
export HUAWEI_REGION=cn-north-4
Additional Variables:
| Variable | Required | Description |
|---|---|---|
HUAWEI_AK | Yes | Huawei Cloud Access Key |
HUAWEI_SK | Yes | Huawei Cloud Secret Key |
HUAWEI_REGION | No | Default region (overrides region param if set) |
HUAWEI_PROJECT_ID | No | Project ID (auto-obtained via IAM API when not set) |
HUAWEI_SECURITY_TOKEN | No | Required when using temporary AK/SK |
| API Action | Service | Purpose |
|---|---|---|
| CCE cluster read | CCE | huawei_list_cce_clusters, huawei_list_cce_nodepools |
| CCE addon read | CCE | huawei_list_cce_addons, huawei_get_cce_addon_detail |
| CCE HPA read | CCE (kubeconfig) | huawei_list_cce_hpas |
| CCE workload read | CCE (kubeconfig) | huawei_get_cce_deployments, huawei_list_cce_statefulsets |
| CCE Pod read | CCE (kubeconfig) | huawei_get_cce_pods |
| CCE Pod logs | CCE (kubeconfig) | huawei_get_pod_logs |
| CCE Events read | CCE (kubeconfig) | huawei_get_cce_events |
| AOM metrics read | AOM | huawei_get_aom_metrics, huawei_get_cce_pod_metrics_topN, huawei_get_cce_node_metrics_topN |
Permission Failure Handling:
All actions are invoked via the dispatcher script:
python3 scripts/huawei-cloud.py <action> region=<region> cluster_id=<cluster_id> [key=value ...]
python3 scripts/huawei-cloud.py huawei_autoscaling_diagnose \
region=cn-north-4 cluster_id=<cluster_id> \
namespace=default workload_name=my-app workload_type=Deployment \
question="Why isn't HPA scaling my workload?"
Returns structured evidence + report_markdown (complete Markdown diagnosis report). When report_markdown is present, use it as the final report body. You may add clarifications the user requests, but do not discard evidence tables.
| Action | Required Params | Description |
|---|---|---|
huawei_list_cce_hpas | region, cluster_id | List HPA specs, current/desired replicas, conditions, metrics |
huawei_list_cce_addons | region, cluster_id | Identify CCE elastic engine, metrics/AOM/Prometheus addons |
huawei_get_cce_addon_detail | region, cluster_id, addon_id | Get addon detail (version, status) |
huawei_list_cce_nodepools | region, cluster_id | List nodepools: autoscaling enable, min/max, current node count |
huawei_get_cce_pods | region, cluster_id | List Pod phase, owner, container state, resources.requests/limits, annotations |
huawei_get_cce_deployments | region, cluster_id | Read Deployment desired/current/ready replicas |
huawei_list_cce_statefulsets | region, cluster_id | Read StatefulSet desired/current/ready replicas |
huawei_get_cce_events | region, cluster_id | Read HPA, Pod, Scheduler events (FailedScheduling, FailedGetResourceMetric) |
huawei_get_cce_pod_metrics_topN | region, cluster_id | Pod resource metric ranking |
huawei_get_cce_node_metrics_topN | region, cluster_id | Node resource metric ranking |
huawei_get_aom_metrics | region, cluster_id | AOM/Prometheus custom metric queries |
# Step 1: Locate CA component Pods in kube-system
python3 scripts/huawei-cloud.py huawei_get_cce_pods \
region=cn-north-4 cluster_id=<cluster_id> namespace=kube-system
# Step 2: Retrieve CA Pod logs (find pods with names containing autoscaler/cce-elastic/elastic-engine)
python3 scripts/huawei-cloud.py huawei_get_pod_logs \
region=cn-north-4 cluster_id=<cluster_id> namespace=kube-system \
pod_name=cce-cluster-autoscaler-abc123 container=autoscaler tail_lines=200
CA Log Signal Quick Reference:
| Signal | Meaning | Severity |
|---|---|---|
No expansion options | No available expansion options for node pool specs/AZ/subnet | critical |
max node group size reached | Node group reached max_nodes limit | critical |
Scale-up: final scale-up plan is empty | All node groups skipped in expansion plan | critical |
Quota exceeded / quota limit | Cloud resource (ECS/EVS/EIP) quota insufficient | critical |
subnet ip exhausted / no available ip | VPC subnet available IP exhausted | critical |
iam / permission denied / agency / forbidden | IAM agency or permission abnormality | critical |
Failed to refresh / cannot connect | CA cannot connect to cloud API or control plane | high |
skipping node group | CA skipped a node group (reason in log) | high |
pod ... is unschedulable | CA identified an unschedulable Pod | info |
ScaleDown: no candidates | No candidate nodes for scale-down | info |
node ... is not suitable for removal | Node does not meet scale-down conditions | high |
not safe to evict / safe-to-evict=false | PDB or annotation protection blocking eviction | high |
huawei_autoscaling_diagnose (Primary Action)| Parameter | Required | Default | Description |
|---|---|---|---|
region | Yes | - | Huawei Cloud region (e.g., cn-north-4) |
cluster_id | Yes | - | CCE cluster ID |
namespace | No | - | Target namespace (narrows scope) |
workload_name | No | - | Target workload name (Deployment/StatefulSet) |
workload_type | No | - | Workload type (Deployment or StatefulSet) |
question | No | - | User's original question (improves intent routing) |
| Parameter | Required | Description | Default |
|---|---|---|---|
region | Yes | Huawei Cloud region | - |
cluster_id | Yes (most actions) | CCE cluster ID | - |
namespace | Action-dependent | Kubernetes namespace | - |
workload_name | Action-dependent | Deployment/StatefulSet name | - |
pod_name | Required for logs | Pod name | - |
container | Required for logs | Container name | - |
tail_lines | No | Log tail lines count | 200 |
top_n | No | Number of top results for metrics | 10 |
| Region Name | Region ID |
|---|---|
| North China - Beijing 4 | cn-north-4 |
| North China - Beijing 1 | cn-north-1 |
| East China - Shanghai 1 | cn-east-3 |
| East China - Shanghai 2 | cn-east-2 |
| South China - Guangzhou | cn-south-1 |
| South China - Shenzhen | cn-south-4 |
| Southwest China - Guiyang 1 | cn-southwest-2 |
| Asia Pacific - Bangkok | ap-southeast-2 |
| Asia Pacific - Singapore | ap-southeast-1 |
| Asia Pacific - Hong Kong | ap-southeast-3 |
| Europe - Paris | eu-west-0 |
The primary action huawei_autoscaling_diagnose returns structured evidence and a Markdown report. See Output Schema for the full JSON response schema.
Key output fields:
| Field | Description |
|---|---|
success | Whether the diagnosis completed successfully |
intent.target | Routing target: WORKLOAD, NODE, or UNKNOWN |
intent.scale_direction | Scale direction: scale_up, scale_down, or unknown |
route | Diagnosis path: A, B, C, or BLOCKED |
discovery | Has_HPA, Has_CA, metric addon detected, nodepool autoscaling enabled |
issues | List of diagnosed issues with code, severity, layer, evidence, recommendation |
evidence | List of evidence items with layer, source, summary |
data_gaps | Data collection failures or unconfirmed items |
conclusion | Root cause conclusion summary |
confidence | Confidence level (High, Medium, Low) |
report_markdown | Complete Markdown diagnosis report (use as final output) |
Required Markdown report sections:
# CCE Autoscaling Automated Diagnosis Report## 1. Diagnosis Overview: region, cluster, intent, scale direction, route, conclusion, confidence## 2. Capability Discovery & Routing: Has_HPA, Has_CA, metric link, routing basis## 3. Investigation Process: Gateway, Path A/B/C actual execution steps## 4. Key Evidence: HPA status, nodepool/addon, Pending Pod, FailedScheduling evidence## 5. Issues & Root Cause Convergence: issues ranked by severity with evidence and recommendations## 6. Next-Step Recommendations: read-only verification and remediation suggestions only## 7. Data Gaps: collection failures and items that could not be confirmedSee Verification Method for step-by-step verification.
huawei_autoscaling_diagnose first; use manual fallback only if the primary action failsresources.requests on Pod containers; missing requests are a common critical root causehuawei-cloud-cce-auto-remediation-runner and require user confirmation| Document | Description |
|---|---|
| Workflow | Gateway routing, Path A/B/C diagnosis trees, manual fallback tool sequence |
| Output Schema | JSON response schema and required Markdown report sections |
| Capability Map | Reusable tool capabilities, current gaps, and recommended atomic tool additions |
| Risk Rules | Allowed read actions, prohibited write actions, mutation boundary rules |
huawei_autoscaling_diagnose is the primary tool; raw queries are for targeted evidence when the user requests specific information or when the primary tool fails| Pitfall | Symptom | Quick Fix |
|---|---|---|
| Skipping Gateway routing | Diagnosis enters wrong path or misses capability | Always run intent + capability discovery before Path A/B/C |
| Missing CPU/Memory request | HPA cannot calculate utilization; FailedGetResourceMetric | Check resources.requests on all target Pod containers |
| Ignoring CA Pod logs | CA root cause remains unknown | Prioritize CA Pod log retrieval (kube-system autoscaler Pods) |
| Treating tolerance as failure | HPA not scaling when metrics within ~10% tolerance | Verify current metric ratio vs target threshold and tolerance window |
| Isolated HPA/CA analysis | Missing HPA→CA cascade linkage | Use Path C when both HPA and CA are present and intent is UNKNOWN |
| Wrong cluster_id | API returns 404 or empty results | Verify cluster ID via huawei_list_cce_clusters |
| Credential permission denied | API returns 403 | Check IAM permissions for CCE HPA/Pod/Event/Addon access |
| Not checking maxReplicas | HPA stuck at max replicas with ScalingLimited condition | Compare currentReplicas vs maxReplicas in HPA status |
| Not checking nodepool max_nodes | CA not expanding despite Pending Pods | Check max_nodes vs current node count in nodepool |
| Metrics API unavailable | HPA shows FailedGetResourceMetric | Ensure metrics-server or AOM addon is installed in cluster |