Huawei Cloud Cce Network Failure Diagnoser

Other

Huawei Cloud CCE Network failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose CCE network connectivity issues, Service/Ingress failures, (2) analyze ELB configuration, VPC/Subnet issues, (3) diagnose DNS resolution failures, (4) check network policies and security group rules. Trigger: user mentions "network failure", "网络故障", "Service unreachable", "Service 不通", "Ingress 502", "Ingress 504", "ELB error", "ELB 异常", "DNS failure", "DNS 解析失败", "network diagnosis", "网络诊断", "VPC", "subnet", "子网", "安全组", "网络策略"

Install

openclaw skills install huawei-cloud-cce-network-failure-diagnoser

Huawei Cloud CCE Network Failure Diagnoser

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using a dispatcher pattern. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.

The dispatcher script is scripts/huawei-cloud.py, invoked as python3 scripts/huawei-cloud.py <action> <key=value params>

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them. Do not run them directly in a shell.

For action details and parameters, refer to references/workflow.md, references/risk-rules.md, and references/output-schema.md

Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.

All paths are relative to the skill directory, which is the directory where this SKILL.md is located.

Overview

This skill diagnoses CCE (Cloud Container Engine) network failures by performing a layered, read-only diagnosis across the full network stack — from node infrastructure, DNS, Service/EndpointSlice, NetworkPolicy, Ingress to cloud-side ELB/EIP/NAT/VPC security policies. It produces a complete Markdown diagnosis report that must include the investigation process, evidence, conclusions, confidence levels, and verification criteria.

Use this skill when:

CCE Service connectivity is broken (Service unreachable, intermittent, or flapping)
DNS/CoreDNS resolution failures (NXDOMAIN, timeout)
Ingress 502/504 errors or ELB backend health issues
NetworkPolicy blocking traffic between Pods
VPC/Subnet/Security Group/ACL configuration affecting cluster networking
EIP/NAT gateway affecting external access from the cluster

This skill does NOT handle:

Creating, modifying, or deleting any resources
Binding/unbinding EIP or modifying security groups/ACLs/ELB listeners
Scaling workloads or restarting components
Pod-level or Node-level root causes (cross-reference to huawei-cloud-cce-pod-failure-diagnoser, huawei-cloud-cce-node-failure-diagnoser, huawei-cloud-cce-workload-failure-diagnoser)

Prerequisites

You must run the environment check script first to complete environment validation and dependency installation in one step:

Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh
Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

Windows note: Do not use && to chain commands (PowerShell 5.x does not support it); use semicolons if you need to change directories first.

The script will check in order: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before proceeding.

Environment Variables:

Variable	Required	Description
HW_ACCESS_KEY	Yes	Huawei Cloud AK (Access Key)
HW_SECRET_KEY	Yes	Huawei Cloud SK (Secret Key)
HW_REGION_NAME	No	Default cn-north-4
HW_PROJECT_ID	No	Project ID (automatically obtained via IAM API when not set)
HW_SECURITY_TOKEN	No	Required when using temporary AK/SK
HW_CCE_CLUSTER_ID	Yes	CCE cluster ID for diagnosis target
KUBECONFIG	No	Kubernetes config; auto-obtained from CCE API if not set

Security Constraints:

Never persist AK/SK/Token/Certificate to filesystem
AK/SK exists only in the current call stack; released after call ends
Only non-sensitive project IDs may be cached in process memory (never written to disk)
All temporary certificate files must be deleted immediately after use
Never leak AK/SK in logs, responses, or error messages
Never send credentials to any third-party server

Do not output the values of environment variables.

IAM Permission Requirements

API Action	Permission	Purpose
cce:cluster:get	Get cluster	View cluster details
cce:cluster:list	List clusters	List CCE clusters
cce:node:list	List nodes	List cluster nodes
vpc:vpc:list	List VPCs	Query VPC details
vpc:subnet:list	List subnets	Query subnet details
elb:loadbalancer:list	List ELBs	Query ELB details
elb:listener:list	List listeners	Query ELB listeners
aom:*:get	Read AOM	Query AOM metrics and alarms

Permission Failure Handling:

When any command fails due to permission errors, display required permission list
Guide the user to create a custom policy in the IAM console
Pause execution and wait for user confirmation

Core Tools

All actions are invoked via the Python dispatcher script:

python3 scripts/huawei-cloud.py <action> region=<region> cluster_id=<cluster_id> namespace=<namespace> [other_params...]

Execution via skill:

Linux / macOS: skill action=exec: skill://.venv/bin/python3 skill://scripts/huawei-cloud.py <action> <params>
Windows: skill action=exec: skill://.venv/Scripts/python3.exe skill://scripts/huawei-cloud.py <action> <params>

Primary Diagnosis Action

Action	Description
`huawei_network_failure_diagnose`	One-shot diagnosis: collects K8s and cloud-side read-only snapshots, returns structured findings + `report_markdown`

Kubernetes Evidence Actions

Action	Description
`huawei_get_cce_services`	List Services in a namespace
`huawei_get_cce_ingresses`	List Ingresses in a namespace
`huawei_get_cce_pods`	List Pods in a namespace
`huawei_get_kubernetes_nodes`	List cluster Nodes
`huawei_get_cce_events`	List cluster Events
`huawei_get_pod_logs`	Retrieve Pod container logs

Cloud Network Evidence Actions

Action	Description
`huawei_get_elb_backend_status`	Read ELB pool/member/health monitor/load balancer status
`huawei_get_elb_metrics`	Retrieve ELB monitoring metrics
`huawei_list_elb`	List ELB load balancers
`huawei_list_elb_listeners`	List ELB listeners
`huawei_list_eip`	List EIP addresses
`huawei_get_eip_metrics`	Retrieve EIP monitoring metrics
`huawei_list_nat`	List NAT gateways
`huawei_get_nat_gateway_metrics`	Retrieve NAT gateway metrics
`huawei_list_security_groups`	List VPC security groups
`huawei_list_vpc_acls`	List VPC ACLs

Legacy Compatibility Actions

Action	Description
`huawei_network_diagnose`	Legacy comprehensive network diagnosis
`huawei_network_diagnose_by_alarm`	Diagnosis triggered by alarm correlation
`huawei_network_verify_pod_scheduling`	Verify Pod scheduling constraints (read-only)

Parameter Reference

Required Parameters

Parameter	Description
`region`	Huawei Cloud region, e.g., `cn-north-4`
`cluster_id`	CCE cluster ID
`namespace`	Kubernetes namespace

Optional Parameters (provide as many as possible for accurate diagnosis)

Parameter	Description
`failure_symptom`	Symptom description: `domain_unresolvable`, `in_cluster_service_unreachable`, `service_intermittent`, `external_access_failed`, `ingress_502_504`
`target_kind`	Resource type: Pod, Service, Ingress, etc.
`target_name`	Resource name
`service_name`	Target Service name
`ingress_name`	Target Ingress name
`source_pod`	Source Pod name or label
`destination_pod`	Destination Pod name or label
`domain`	Domain name for DNS diagnosis
`elb_id`	ELB load balancer ID

Output Format

huawei_network_failure_diagnose returns structured JSON with an embedded report_markdown:

{
  "success": true,
  "action": "huawei_network_failure_diagnose",
  "region": "cn-north-4",
  "cluster_id": "cluster-id",
  "namespace": "default",
  "conclusion": "high signal conclusion",
  "confidence": "High",
  "pipeline_pruned": false,
  "findings": [
    {
      "stage": "Stage 3: East-West Routing and Policy Layer",
      "type": "NetworkPolicyBlocked",
      "title": "NetworkPolicy selects target Pod but does not allow source Pod labels or target port",
      "confidence": 1.0,
      "severity": "critical",
      "evidence": [],
      "recommendation": [],
      "prune": false
    }
  ],
  "top_causes": [],
  "snapshot": {
    "inputs": {},
    "nodes": [],
    "pods": [],
    "services": [],
    "ingresses": [],
    "endpoint_slices": [],
    "network_policies": [],
    "events": [],
    "logs": {},
    "cloud": {
      "elb_ids": [],
      "elbs": {},
      "eips": {},
      "nat": {},
      "security_groups": {},
      "vpc_acls": {}
    }
  },
  "report_markdown": "# CCE Network Failure Automated Diagnosis Report\n..."
}

Markdown Report Sections

The report_markdown must contain the following headings:

Diagnosis Overview — target, symptom, conclusion, confidence, collection time, pruned stages
Investigation Process — per-stage status (checked, abnormal, pruned/skipped)
Link Topology — DNS path, east-west path, or north-south path based on failure type
Key Object Snapshot — Service, EndpointSlice, Backend Pods, Ingress, NetworkPolicy, Cloud ELB
Evidence Matrix — stage, type, confidence, evidence summary
Diagnosis Conclusion — top root causes (max 3), each backed by evidence
Recommended Actions and Verification Criteria — read-only verification steps or change suggestions to hand off to huawei-cloud-cce-auto-remediation-runner

Finding Types

Common type values in findings:

Type	Description
`NodeUnhealthy`	Node Ready=False or Ready=Unknown
`NodePressure`	Memory/Disk/PID/Network pressure on node
`PodDNSConfigMissing`	Pod dnsPolicy=None with no dnsConfig
`KubeDnsNoEndpoint`	kube-dns EndpointSlice has 0 ready endpoints
`CoreDNSRestarting`	CoreDNS pods showing OOMKilled/LivenessProbe failures
`CoreDNSNxDomain`	CoreDNS logs showing NXDOMAIN responses
`CoreDNSUpstreamTimeout`	CoreDNS logs showing upstream i/o timeout
`NetworkPolicyBlocked`	NetworkPolicy blocks source Pod traffic (confidence 100%)
`ServiceNoReadyEndpoint`	Service has 0 ready endpoints in EndpointSlice
`ServiceSelectorMismatch`	Service selector matches no Pods
`ReadinessFlapping`	Backend Pod readiness probe flapping
`BackendOverloaded`	Application logs show OOM/connection pool exhausted
`LoadBalancerProvisioningFailed`	LoadBalancer Ingress status empty with CCM errors
`ELBBackendUnhealthy`	ELB member unhealthy while K8s backend Pod is Ready
`IngressUpstreamError`	Ingress controller logs show 502/504

Verification

Run the environment check script to confirm dependencies and credentials are available
Execute huawei_network_failure_diagnose with a known-healthy cluster and verify the report structure
Cross-reference findings with Huawei Cloud console data (ELB health, security groups, VPC ACLs)
Verify pipeline_pruned flag is set correctly when node-level issues prune upper layers
Confirm that confidence and severity values are present in all findings

Best Practices

Always provide failure_symptom to direct the diagnosis pipeline to the relevant stage (DNS, east-west, or north-south)
Provide as many optional parameters as possible (service_name, ingress_name, source_pod, destination_pod, domain) for more precise diagnosis
Start with huawei_network_failure_diagnose for one-shot comprehensive diagnosis; use individual actions only for targeted follow-up queries
When evidence is insufficient, state "evidence insufficient" explicitly — never present guesses as conclusions
For north-south (external access) issues, always supplement with huawei_get_elb_backend_status and huawei_list_security_groups to check cloud-side configuration
When node-level issues are found, note that upper-layer diagnosis may be pruned; cross-reference with huawei-cloud-cce-node-failure-diagnoser

Reference Documents

Diagnosis workflow, reuse priorities, and layered pipeline: references/workflow.md
Risk rules and action boundaries: references/risk-rules.md
Output schema and finding type reference: references/output-schema.md

Notes

This skill is strictly read-only; it never modifies Service, Ingress, NetworkPolicy, CoreDNS ConfigMap, security groups, ACLs, ELB listeners/backends, EIP bindings, or NAT rules
Never execute kubectl exec, packet capture, stress testing, or active traffic injection unless the user explicitly requests and acknowledges the risk
huawei_network_verify_pod_scheduling is for verification only; it does not replace scaling actions
Any network change suggestion must describe impact scope, rollback method, and verification criteria, and be handed off to huawei-cloud-cce-auto-remediation-runner for preview
Do not output the values of environment variables such as HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN
All scripts must be executed via skill action=exec; do not run them directly in a shell

Common Pitfalls

Missing cluster_id: The cluster_id parameter is required for all CCE actions. If the user only provides a cluster name, query huawei_list_cce_clusters first to resolve the ID
Wrong failure_symptom: Using a wrong symptom category (e.g., ingress_502_504 for an in-cluster issue) may misdirect the pipeline. Always confirm the symptom type with the user
Ignoring node-level root cause: If nodes are NotReady, upper-layer diagnosis may be pruned. Do not skip the node-layer check even when the symptom appears to be Service/DNS-level
Confusing K8s-side and cloud-side: ELB backend unhealthy does not always mean the K8s Pod is unhealthy — check both huawei_get_elb_backend_status and huawei_get_cce_pods together
Over-interpreting insufficient evidence: When EndpointSlice has 0 ready endpoints, it could be selector mismatch, readiness flapping, or Pod crash. Do not jump to conclusions without checking Pod events and logs
Not checking NetworkPolicy for east-west issues: NetworkPolicy blocking has 100% confidence when confirmed, but is easily overlooked. Always check NetworkPolicy in the target namespace