Install
openclaw skills install huawei-cloud-cce-node-failure-diagnoserHuawei Cloud CCE Node failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose CCE node NotReady, node resource pressure, node failure events, (2) analyze node disk/memory/CPU pressure, (3) check node status and conditions, (4) view node metrics and events. Trigger: user mentions "node failure", "节点故障", "NodeNotReady", "节点 NotReady", "node pressure", "节点压力", "node disk pressure", "磁盘压力", "node eviction", "节点驱逐", "节点异常", "节点诊断", "CCE node", "CCE 节点", "节点状态"
openclaw skills install huawei-cloud-cce-node-failure-diagnoser⚠️ Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.
- The dispatcher script is located at
scripts/huawei-cloud.pywithin the skill directory- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them. Do not run them directly in a shell.- Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
- All paths are relative to the skill directory, which is the directory where this SKILL.md is located.
This skill diagnoses CCE/Kubernetes node failures and produces a complete Markdown diagnosis report. It uses the local Python dispatcher (scripts/huawei-cloud.py) to call the Huawei Cloud Python SDK and Kubernetes client APIs, collecting node status, kube-node-lease evidence, events, pod symptoms, and AOM metrics.
The skill covers: node NotReady, disk/memory/CPU pressure, network abnormalities, kubelet/CRI failures, NPD events, and workload impact on the affected node.
| Skill | Purpose |
|---|---|
huawei-cloud-cce-pod-failure-diagnoser | Pod-level failure diagnosis |
huawei-cloud-cce-network-failure-diagnoser | Network failure diagnosis |
huawei-cloud-cce-storage-failure-diagnoser | Storage failure diagnosis |
huawei-cloud-cce-auto-remediation-runner | Execute remediation actions (cordon, drain, reboot) |
huawei-cloud-cce-metric-analyzer | Metric trend analysis |
huawei-cloud-cce-observability-context-builder | Observability context enrichment |
huawei_node_failure_diagnose)huawei_get_kubernetes_nodes)huawei_get_cce_events)huawei_get_cce_pods)huawei_get_cce_node_metrics, huawei_get_cce_pod_metrics_topN)huawei_node_status_inspection, huawei_node_resource_inspection, huawei_node_vul_inspection)huawei_list_security_groups, huawei_hss_list_hosts, huawei_hss_list_host_vuls_all)The dispatcher script requires Python >= 3.6 and the following packages:
huaweicloudsdkcorehuaweicloudsdkccehuaweicloudsdkaomhuaweicloudsdkhsshuaweicloudsdkvpchuaweicloudsdkecshuaweicloudsdkceshuaweicloudsdkevshuaweicloudsdkeiphuaweicloudsdkelbhuaweicloudsdkiamkubernetes| Variable | Required | Description |
|---|---|---|
| HUAWEI_AK | Yes | Huawei Cloud Access Key |
| HUAWEI_SK | Yes | Huawei Cloud Secret Key |
| HUAWEI_REGION | No | Default region (overrides region param if set) |
| HUAWEI_PROJECT_ID | No | Project ID (auto-obtained via IAM API when not set) |
| HUAWEI_SECURITY_TOKEN | No | Required when using temporary AK/SK |
🚫 Never expose or log AK/SK values. Credentials exist only in the current request call stack and are released after each invocation. Do not write credentials to files, logs, or responses.
✅ Use environment variables HUAWEI_AK / HUAWEI_SK for authentication. The dispatcher reads them automatically.
| Permission | Service | Required For |
|---|---|---|
| CCE cluster/node read | CCE | huawei_list_cce_nodes, huawei_get_cce_nodes, huawei_node_diagnose |
| Kubernetes API read | CCE (kubeconfig) | huawei_get_kubernetes_nodes, huawei_node_failure_diagnose |
| AOM metrics read | AOM | huawei_get_cce_node_metrics, huawei_get_cce_pod_metrics_topN |
| CES alarm read | CES | huawei_get_cce_events |
| HSS host/vul read | HSS | huawei_hss_list_hosts, huawei_hss_list_host_vuls_all |
| VPC/SG read | VPC | huawei_list_security_groups, huawei_list_vpc_acls |
All actions are invoked via the dispatcher script:
python3 scripts/huawei-cloud.py <action> region=<region> cluster_id=<cluster_id> [key=value ...]
python3 scripts/huawei-cloud.py huawei_node_failure_diagnose \
region=cn-north-4 cluster_id=<cluster_id> \
node_name=<node_name> lease_timeout_seconds=40 \
event_limit=500 hours=1 include_metrics=true
Returns structured evidence + report_markdown (complete Markdown diagnosis report).
| Action | Required Params | Description |
|---|---|---|
huawei_get_kubernetes_nodes | region, cluster_id | Query v1.Node Ready/conditions status |
huawei_get_cce_events | region, cluster_id | Retrieve Kubernetes events |
huawei_get_cce_pods | region, cluster_id | List Pod phase/reason/lastState |
huawei_get_cce_node_metrics | region, cluster_id, node_ip | Query node CPU/memory/disk metrics |
huawei_get_cce_node_metrics_topN | region, cluster_id | Top-N node metrics |
huawei_get_cce_pod_metrics_topN | region, cluster_id | Top-N pod metrics (supports node_ip filter) |
| Action | Required Params | Description |
|---|---|---|
huawei_node_status_inspection | region, cluster_id | Node status health inspection |
huawei_node_resource_inspection | region, cluster_id | Node resource utilization inspection |
huawei_node_vul_inspection | region, cluster_id | Node vulnerability inspection |
| Action | Required Params | Description |
|---|---|---|
huawei_list_security_groups | region | List VPC security groups |
huawei_list_vpc_acls | region | List VPC network ACLs |
huawei_hss_list_hosts | region | List HSS host security status |
huawei_hss_list_host_vuls_all | region, host_id | List all vulnerabilities for a host |
huawei_node_failure_diagnose| Parameter | Required | Default | Description |
|---|---|---|---|
region | Yes | - | Huawei Cloud region (e.g., cn-north-4) |
cluster_id | Yes | - | CCE cluster ID |
node_name | No* | - | Target node name (one of node_name or node_ip required) |
node_ip | No* | - | Target node internal IP (one of node_name or node_ip required) |
lease_timeout_seconds | No | 40 | Kube-node-lease stale threshold in seconds |
event_limit | No | 500 | Maximum events to retrieve |
hours | No | 1 | Metric lookback window in hours |
include_metrics | No | true | Whether to include AOM metrics |
*At least one of node_name or node_ip must be provided. If both are omitted, the action returns an error.
| Parameter | Required | Description |
|---|---|---|
region | Yes | Huawei Cloud region |
cluster_id | Yes (most actions) | CCE cluster ID |
node_ip | Required for huawei_get_cce_node_metrics | Node internal IP |
top_n | No | Number of top results (default 10) |
hours | No | Metric lookback hours (default 1) |
The primary action huawei_node_failure_diagnose returns structured evidence and a Markdown report. See references/output-schema.md for the full JSON response schema.
Key output fields:
| Field | Description |
|---|---|
success | Whether the diagnosis completed successfully |
node | Node name, IP, Ready status, conditions |
lease | kube-node-lease renew time, stale status, delay seconds |
liveness | Control plane liveness case (A/B/C/D) and conclusion |
root_category | Root cause category (ControlPlaneDisconnected, MemoryPressure, DiskPressure, Network, Kubelet, NotReady, Healthy) |
confidence | Confidence level (High/Medium/Low) |
evidence | List of evidence items with category, severity, signal, source, detail |
pod_summary | Pod phase counts and symptomatic pod list |
health_items | Node health check items with status |
report_markdown | Complete Markdown diagnosis report (use as final output) |
When report_markdown is present, use it as the final report body. You may add clarifications the user requests, but do not discard evidence tables.
python3 scripts/huawei-cloud.py huawei_get_kubernetes_nodes region=cn-north-4 cluster_id=<cluster_id>
huawei_node_failure_diagnose on a healthy node; expect root_category=Healthy and confidence=Highreport_markdown contains all required sections (see references/output-schema.md)huawei_node_failure_diagnose first; use manual fallback actions only if the primary action failsReady=Unknown and lease is stale, conclude "control plane disconnected from node" rather than prematurely attributing to kubelet or network aloneUnknown with NodeStatusUnknown reason, label them "indeterminate" — do not mark as "normal"| Document | Description |
|---|---|
references/workflow.md | Diagnosis triage flow, evidence rules, and fallback workflow |
references/output-schema.md | Output JSON schema and required Markdown report sections |
references/risk-rules.md | Risk boundary rules: allowed read actions, prohibited write actions |
| Huawei Cloud Python SDK Documentation | SDK reference |
| Huawei Cloud API Explorer | API interactive explorer |
huawei-cloud-cce-auto-remediation-runner and require user confirmationpython3 scripts/huawei-cloud.py <action>; do not use hcloud CLI or direct API callshuawei_node_failure_diagnose fails, follow the manual fallback workflow in references/workflow.md| Pitfall | Correct Approach |
|---|---|
Concluding "kubelet failure" when Ready=Unknown + lease stale | Conclude "control plane disconnected from node (network or kubelet/CRI heartbeat interrupted, requires node-side verification)" |
Marking Unknown pressure conditions as "normal" | Label as "indeterminate — no independent evidence available" |
| Using a single metric spike as root cause | Validate with trend data over time; use hours parameter for lookback |
| Skipping event/pod correlation | Always cross-reference Event signals with Pod symptoms before forming conclusions |
| Executing cordon/drain/reboot directly | This skill does not perform write actions; hand off to huawei-cloud-cce-auto-remediation-runner |
| Ignoring CNI sandbox failures in Pod events | FailedCreatePodSandBox + CNI error patterns are strong network abnormality evidence |
| Not checking kube-node-lease when node is NotReady | Lease staleness is critical evidence for control plane connectivity |