Huawei Cloud Cce Network Failure Diagnoser

Other

Huawei Cloud CCE Network failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose CCE network connectivity issues, Service/Ingress failures, (2) analyze ELB configuration, VPC/Subnet issues, (3) diagnose DNS resolution failures, (4) check network policies and security group rules. Trigger: user mentions "network failure", "网络故障", "Service unreachable", "Service 不通", "Ingress 502", "Ingress 504", "ELB error", "ELB 异常", "DNS failure", "DNS 解析失败", "network diagnosis", "网络诊断", "VPC", "subnet", "子网", "安全组", "网络策略"

Install

openclaw skills install huawei-cloud-cce-network-failure-diagnoser

Huawei Cloud CCE Network Failure Diagnoser

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using a dispatcher pattern. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.

  • The dispatcher script is scripts/huawei-cloud.py, invoked as python3 scripts/huawei-cloud.py <action> <key=value params>
  • All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them. Do not run them directly in a shell.
  • For action details and parameters, refer to references/workflow.md, references/risk-rules.md, and references/output-schema.md
  • Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
  • All paths are relative to the skill directory, which is the directory where this SKILL.md is located.

Overview

This skill diagnoses CCE (Cloud Container Engine) network failures by performing a layered, read-only diagnosis across the full network stack — from node infrastructure, DNS, Service/EndpointSlice, NetworkPolicy, Ingress to cloud-side ELB/EIP/NAT/VPC security policies. It produces a complete Markdown diagnosis report that must include the investigation process, evidence, conclusions, confidence levels, and verification criteria.

Use this skill when:

  1. CCE Service connectivity is broken (Service unreachable, intermittent, or flapping)
  2. DNS/CoreDNS resolution failures (NXDOMAIN, timeout)
  3. Ingress 502/504 errors or ELB backend health issues
  4. NetworkPolicy blocking traffic between Pods
  5. VPC/Subnet/Security Group/ACL configuration affecting cluster networking
  6. EIP/NAT gateway affecting external access from the cluster

This skill does NOT handle:

  1. Creating, modifying, or deleting any resources
  2. Binding/unbinding EIP or modifying security groups/ACLs/ELB listeners
  3. Scaling workloads or restarting components
  4. Pod-level or Node-level root causes (cross-reference to huawei-cloud-cce-pod-failure-diagnoser, huawei-cloud-cce-node-failure-diagnoser, huawei-cloud-cce-workload-failure-diagnoser)

Prerequisites

You must run the environment check script first to complete environment validation and dependency installation in one step:

  • Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh
  • Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

Windows note: Do not use && to chain commands (PowerShell 5.x does not support it); use semicolons if you need to change directories first.

The script will check in order: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before proceeding.

Environment Variables:

VariableRequiredDescription
HW_ACCESS_KEYYesHuawei Cloud AK (Access Key)
HW_SECRET_KEYYesHuawei Cloud SK (Secret Key)
HW_REGION_NAMENoDefault cn-north-4
HW_PROJECT_IDNoProject ID (automatically obtained via IAM API when not set)
HW_SECURITY_TOKENNoRequired when using temporary AK/SK
HW_CCE_CLUSTER_IDYesCCE cluster ID for diagnosis target
KUBECONFIGNoKubernetes config; auto-obtained from CCE API if not set

Security Constraints:

  1. Never persist AK/SK/Token/Certificate to filesystem
  2. AK/SK exists only in the current call stack; released after call ends
  3. Only non-sensitive project IDs may be cached in process memory (never written to disk)
  4. All temporary certificate files must be deleted immediately after use
  5. Never leak AK/SK in logs, responses, or error messages
  6. Never send credentials to any third-party server

Do not output the values of environment variables.


IAM Permission Requirements

API ActionPermissionPurpose
cce:cluster:getGet clusterView cluster details
cce:cluster:listList clustersList CCE clusters
cce:node:listList nodesList cluster nodes
vpc:vpc:listList VPCsQuery VPC details
vpc:subnet:listList subnetsQuery subnet details
elb:loadbalancer:listList ELBsQuery ELB details
elb:listener:listList listenersQuery ELB listeners
aom:*:getRead AOMQuery AOM metrics and alarms

Permission Failure Handling:

  1. When any command fails due to permission errors, display required permission list
  2. Guide the user to create a custom policy in the IAM console
  3. Pause execution and wait for user confirmation

Core Tools

All actions are invoked via the Python dispatcher script:

python3 scripts/huawei-cloud.py <action> region=<region> cluster_id=<cluster_id> namespace=<namespace> [other_params...]

Execution via skill:

  • Linux / macOS: skill action=exec: skill://.venv/bin/python3 skill://scripts/huawei-cloud.py <action> <params>
  • Windows: skill action=exec: skill://.venv/Scripts/python3.exe skill://scripts/huawei-cloud.py <action> <params>

Primary Diagnosis Action

ActionDescription
huawei_network_failure_diagnoseOne-shot diagnosis: collects K8s and cloud-side read-only snapshots, returns structured findings + report_markdown

Kubernetes Evidence Actions

ActionDescription
huawei_get_cce_servicesList Services in a namespace
huawei_get_cce_ingressesList Ingresses in a namespace
huawei_get_cce_podsList Pods in a namespace
huawei_get_kubernetes_nodesList cluster Nodes
huawei_get_cce_eventsList cluster Events
huawei_get_pod_logsRetrieve Pod container logs

Cloud Network Evidence Actions

ActionDescription
huawei_get_elb_backend_statusRead ELB pool/member/health monitor/load balancer status
huawei_get_elb_metricsRetrieve ELB monitoring metrics
huawei_list_elbList ELB load balancers
huawei_list_elb_listenersList ELB listeners
huawei_list_eipList EIP addresses
huawei_get_eip_metricsRetrieve EIP monitoring metrics
huawei_list_natList NAT gateways
huawei_get_nat_gateway_metricsRetrieve NAT gateway metrics
huawei_list_security_groupsList VPC security groups
huawei_list_vpc_aclsList VPC ACLs

Legacy Compatibility Actions

ActionDescription
huawei_network_diagnoseLegacy comprehensive network diagnosis
huawei_network_diagnose_by_alarmDiagnosis triggered by alarm correlation
huawei_network_verify_pod_schedulingVerify Pod scheduling constraints (read-only)

Parameter Reference

Required Parameters

ParameterDescription
regionHuawei Cloud region, e.g., cn-north-4
cluster_idCCE cluster ID
namespaceKubernetes namespace

Optional Parameters (provide as many as possible for accurate diagnosis)

ParameterDescription
failure_symptomSymptom description: domain_unresolvable, in_cluster_service_unreachable, service_intermittent, external_access_failed, ingress_502_504
target_kindResource type: Pod, Service, Ingress, etc.
target_nameResource name
service_nameTarget Service name
ingress_nameTarget Ingress name
source_podSource Pod name or label
destination_podDestination Pod name or label
domainDomain name for DNS diagnosis
elb_idELB load balancer ID

Output Format

huawei_network_failure_diagnose returns structured JSON with an embedded report_markdown:

{
  "success": true,
  "action": "huawei_network_failure_diagnose",
  "region": "cn-north-4",
  "cluster_id": "cluster-id",
  "namespace": "default",
  "conclusion": "high signal conclusion",
  "confidence": "High",
  "pipeline_pruned": false,
  "findings": [
    {
      "stage": "Stage 3: East-West Routing and Policy Layer",
      "type": "NetworkPolicyBlocked",
      "title": "NetworkPolicy selects target Pod but does not allow source Pod labels or target port",
      "confidence": 1.0,
      "severity": "critical",
      "evidence": [],
      "recommendation": [],
      "prune": false
    }
  ],
  "top_causes": [],
  "snapshot": {
    "inputs": {},
    "nodes": [],
    "pods": [],
    "services": [],
    "ingresses": [],
    "endpoint_slices": [],
    "network_policies": [],
    "events": [],
    "logs": {},
    "cloud": {
      "elb_ids": [],
      "elbs": {},
      "eips": {},
      "nat": {},
      "security_groups": {},
      "vpc_acls": {}
    }
  },
  "report_markdown": "# CCE Network Failure Automated Diagnosis Report\n..."
}

Markdown Report Sections

The report_markdown must contain the following headings:

  1. Diagnosis Overview — target, symptom, conclusion, confidence, collection time, pruned stages
  2. Investigation Process — per-stage status (checked, abnormal, pruned/skipped)
  3. Link Topology — DNS path, east-west path, or north-south path based on failure type
  4. Key Object Snapshot — Service, EndpointSlice, Backend Pods, Ingress, NetworkPolicy, Cloud ELB
  5. Evidence Matrix — stage, type, confidence, evidence summary
  6. Diagnosis Conclusion — top root causes (max 3), each backed by evidence
  7. Recommended Actions and Verification Criteria — read-only verification steps or change suggestions to hand off to huawei-cloud-cce-auto-remediation-runner

Finding Types

Common type values in findings:

TypeDescription
NodeUnhealthyNode Ready=False or Ready=Unknown
NodePressureMemory/Disk/PID/Network pressure on node
PodDNSConfigMissingPod dnsPolicy=None with no dnsConfig
KubeDnsNoEndpointkube-dns EndpointSlice has 0 ready endpoints
CoreDNSRestartingCoreDNS pods showing OOMKilled/LivenessProbe failures
CoreDNSNxDomainCoreDNS logs showing NXDOMAIN responses
CoreDNSUpstreamTimeoutCoreDNS logs showing upstream i/o timeout
NetworkPolicyBlockedNetworkPolicy blocks source Pod traffic (confidence 100%)
ServiceNoReadyEndpointService has 0 ready endpoints in EndpointSlice
ServiceSelectorMismatchService selector matches no Pods
ReadinessFlappingBackend Pod readiness probe flapping
BackendOverloadedApplication logs show OOM/connection pool exhausted
LoadBalancerProvisioningFailedLoadBalancer Ingress status empty with CCM errors
ELBBackendUnhealthyELB member unhealthy while K8s backend Pod is Ready
IngressUpstreamErrorIngress controller logs show 502/504

Verification

  1. Run the environment check script to confirm dependencies and credentials are available
  2. Execute huawei_network_failure_diagnose with a known-healthy cluster and verify the report structure
  3. Cross-reference findings with Huawei Cloud console data (ELB health, security groups, VPC ACLs)
  4. Verify pipeline_pruned flag is set correctly when node-level issues prune upper layers
  5. Confirm that confidence and severity values are present in all findings

Best Practices

  1. Always provide failure_symptom to direct the diagnosis pipeline to the relevant stage (DNS, east-west, or north-south)
  2. Provide as many optional parameters as possible (service_name, ingress_name, source_pod, destination_pod, domain) for more precise diagnosis
  3. Start with huawei_network_failure_diagnose for one-shot comprehensive diagnosis; use individual actions only for targeted follow-up queries
  4. When evidence is insufficient, state "evidence insufficient" explicitly — never present guesses as conclusions
  5. For north-south (external access) issues, always supplement with huawei_get_elb_backend_status and huawei_list_security_groups to check cloud-side configuration
  6. When node-level issues are found, note that upper-layer diagnosis may be pruned; cross-reference with huawei-cloud-cce-node-failure-diagnoser

Reference Documents

  • Diagnosis workflow, reuse priorities, and layered pipeline: references/workflow.md
  • Risk rules and action boundaries: references/risk-rules.md
  • Output schema and finding type reference: references/output-schema.md

Notes

  1. This skill is strictly read-only; it never modifies Service, Ingress, NetworkPolicy, CoreDNS ConfigMap, security groups, ACLs, ELB listeners/backends, EIP bindings, or NAT rules
  2. Never execute kubectl exec, packet capture, stress testing, or active traffic injection unless the user explicitly requests and acknowledges the risk
  3. huawei_network_verify_pod_scheduling is for verification only; it does not replace scaling actions
  4. Any network change suggestion must describe impact scope, rollback method, and verification criteria, and be handed off to huawei-cloud-cce-auto-remediation-runner for preview
  5. Do not output the values of environment variables such as HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN
  6. All scripts must be executed via skill action=exec; do not run them directly in a shell

Common Pitfalls

  1. Missing cluster_id: The cluster_id parameter is required for all CCE actions. If the user only provides a cluster name, query huawei_list_cce_clusters first to resolve the ID
  2. Wrong failure_symptom: Using a wrong symptom category (e.g., ingress_502_504 for an in-cluster issue) may misdirect the pipeline. Always confirm the symptom type with the user
  3. Ignoring node-level root cause: If nodes are NotReady, upper-layer diagnosis may be pruned. Do not skip the node-layer check even when the symptom appears to be Service/DNS-level
  4. Confusing K8s-side and cloud-side: ELB backend unhealthy does not always mean the K8s Pod is unhealthy — check both huawei_get_elb_backend_status and huawei_get_cce_pods together
  5. Over-interpreting insufficient evidence: When EndpointSlice has 0 ready endpoints, it could be selector mismatch, readiness flapping, or Pod crash. Do not jump to conclusions without checking Pod events and logs
  6. Not checking NetworkPolicy for east-west issues: NetworkPolicy blocking has 100% confidence when confirmed, but is easily overlooked. Always check NetworkPolicy in the target namespace