Huawei Cloud Cce Availability Risk Scanner

Security

Huawei Cloud CCE availability risk scanning skill using Python SDK dispatcher for read-only cluster risk assessment. Use this skill when the user wants to: (1) scan CCE clusters for availability risks including single replicas, missing PodDisruptionBudgets, unhealthy probes, unreasonable affinity or nodepool pinning, (2) assess master HA and utilization, node and workload AZ balance, gateway workload distribution, and core addon anti-affinity, (3) detect resource request/limit overcommit and capacity illusions, (4) produce risk-rated reports with remediation plans and YAML suggestions, (5) check control-plane visibility, node AZ distribution, nodepool distribution, and Pod spread. Trigger: user mentions "availability risk", "可用性风险", "availability scanner", "可用性扫描", "cluster inspection", "集群巡检", "risk assessment", "风险评估", "single point of failure", "单点故障", "availability gap", "可用性缺口", "PDB missing", "单副本", "AZ imbalance", "AZ 不均衡", "gateway concentration", "网关集中", "resource overcommit", "资源超配", "health probe missing", "探针缺失"

Install

openclaw skills install huawei-cloud-cce-availability-risk-scanner

Huawei Cloud CCE Availability Risk Scanner

⚠️ Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.

  • The dispatcher script is located at scripts/huawei-cloud.py within the skill directory
  • All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them. Do not run them directly in a shell.
  • Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
  • All paths are relative to the skill directory, which is the directory where this SKILL.md is located.

Overview

This skill scans Huawei Cloud CCE clusters for availability risks. It performs read-only checks, produces risk-rated reports, and generates remediation plans with YAML suggestions. It does NOT directly modify workloads, PDBs, affinity rules, probes, node pools, or cluster configuration.

Architecture: Python dispatcher (scripts/huawei-cloud.py) → Huawei Cloud Python SDK + Kubernetes client → Nodes, Pods, Deployments, StatefulSets, DaemonSets, PDBs, Services, Ingresses, Events, Metrics → Risk classification → Remediation plan → Reports

Related Skills:

SkillPurpose
huawei-cloud-cce-pod-failure-diagnoserPod-level failure diagnosis (CrashLoopBackOff, OOMKilled, Pending)
huawei-cloud-cce-node-failure-diagnoserNode-level failure diagnosis (NotReady, pressure)
huawei-cloud-cce-network-failure-diagnoserNetwork failure diagnosis (Service, DNS, Ingress, ELB)
huawei-cloud-cce-root-cause-analyzerCross-resource root cause correlation
huawei-cloud-cce-auto-remediation-runnerExecute remediation actions (scale, PDB, affinity, probes)
huawei-cloud-cce-cce-workload-managerWorkload lifecycle management (Deployment/StatefulSet operations)

Capabilities:

  1. One-shot availability risk scan with automated inventory collection and risk classification (huawei_scan_cce_availability_risk)
  2. Control-plane visibility and master HA assessment (node count, AZ distribution, CPU/memory metrics)
  3. Node AZ distribution and nodepool distribution analysis
  4. Workload risk detection: single replicas, missing PDBs, Pod AZ/node concentration, missing health probes, hard affinity, anti-affinity gaps, topology spread gaps
  5. Gateway workload identification and distribution assessment (nginx, gateway, ingress, proxy, kong, apisix, traefik)
  6. Core addon anti-affinity and distribution checks (CoreDNS, nginx-ingress, ingress-nginx)
  7. Resource request/limit overcommit detection and cluster capacity illusion identification
  8. Risk-rated reports with severity classification, remediation suggestions, and authorized execution plans

Typical Use Cases:

  • "Scan my CCE cluster for availability risks"
  • "Check if my cluster has single points of failure"
  • "Assess master HA and node AZ distribution"
  • "Find workloads missing PodDisruptionBudgets"
  • "Identify gateway workloads concentrated on a single node or AZ"
  • "Detect resource overcommit and capacity illusions"
  • "Check health probe coverage for my Deployments"
  • "Assess workload affinity and topology spread"
  • "Review core addon (CoreDNS, nginx-ingress) anti-affinity"
  • "Generate an availability risk report with remediation plan"

Prerequisites

1. Python Requirements (MANDATORY)

  • Python >= 3.6 installed
  • Required packages: huaweicloudsdkcore, huaweicloudsdkcce, huaweicloudsdkaom, huaweicloudsdkhss, huaweicloudsdkvpc, huaweicloudsdkecs, huaweicloudsdkces, huaweicloudsdkevs, huaweicloudsdkeip, huaweicloudsdkelb, huaweicloudsdkiam, kubernetes
  • Verify: python3 --version
  • Install packages: pip3 install huaweicloudsdkcore huaweicloudsdkcce huaweicloudsdkaom huaweicloudsdkhss huaweicloudsdkvpc huaweicloudsdkecs huaweicloudsdkces huaweicloudsdkevs huaweicloudsdkeip huaweicloudsdkelb huaweicloudsdkiam kubernetes

2. Credential Configuration

  • Valid Huawei Cloud credentials (AK/SK mode)
  • Security Rules:
    • 🚫 Never expose AK/SK values in code, conversation, or commands
    • 🚫 Never use echo $HUAWEI_AK or echo $HUAWEI_SK to check credentials
    • 🚫 Never write credentials to files, logs, or responses
    • ✅ Use environment variables: HUAWEI_AK, HUAWEI_SK, HUAWEI_REGION
    • ✅ Credentials exist only in the current request call stack and are released after each invocation
    • ✅ Prefer IAM users over root account for cloud operations

Configuration Method (Environment Variables Only):

export HUAWEI_AK=<your-ak>
export HUAWEI_SK=<your-sk>
export HUAWEI_REGION=cn-north-4

Additional Variables:

VariableRequiredDescription
HUAWEI_AKYesHuawei Cloud Access Key
HUAWEI_SKYesHuawei Cloud Secret Key
HUAWEI_REGIONNoDefault region (overrides region param if set)
HUAWEI_PROJECT_IDNoProject ID (auto-obtained via IAM API when not set)
HUAWEI_SECURITY_TOKENNoRequired when using temporary AK/SK

3. IAM Permission Requirements

API ActionServicePurpose
CCE cluster readCCEhuawei_list_cce_clusters
CCE node readCCEhuawei_get_kubernetes_nodes, huawei_get_cce_nodes
CCE workload readCCEhuawei_get_cce_pods, huawei_get_cce_deployments
CCE nodepool readCCEhuawei_list_cce_nodepools
CCE addon readCCEhuawei_list_cce_addons
AOM metrics readAOMhuawei_get_cce_node_metrics, huawei_get_cce_node_metrics_topN, huawei_get_aom_metrics
Kubernetes API readCCE (kubeconfig)huawei_get_cce_pods, huawei_get_cce_deployments, huawei_list_cce_statefulsets, huawei_list_cce_daemonsets

Permission Failure Handling:

  1. When any action fails due to permission errors, display the required permission list
  2. Guide the user to create a custom policy in the IAM console
  3. Pause execution and wait for user confirmation that permissions have been granted
  4. Retry the failed action

Core Commands

All actions are invoked via the dispatcher script:

python3 scripts/huawei-cloud.py <action> region=<region> cluster_id=<cluster_id> [key=value ...]

1. Primary Scan Action (One-Call)

The primary scan command that collects all availability risk data in a single call and outputs a risk-rated report.

python3 scripts/huawei-cloud.py huawei_scan_cce_availability_risk \
  region=cn-north-4 cluster_id=<cluster_id> \
  exclude_namespaces=kube-system \
  gateway_keywords=nginx,gateway,ingress,proxy,kong,apisix,traefik \
  metrics_hours=24 \
  output_dir=./output

Returns: risk-rated issues, severity classification, inventory summary, data gaps, remediation suggestions, and optionally availability-risk-summary.json and availability-risk-report.md files.

2. Inventory Collection Actions

ActionRequired ParamsDescription
huawei_get_kubernetes_nodesregion, cluster_idQuery v1.Node Ready/conditions/AZ distribution
huawei_get_cce_podsregion, cluster_idList Pod phase/reason/state/node/AZ
huawei_get_cce_deploymentsregion, cluster_idList Deployments with replicas/PDB/affinity
huawei_get_cce_servicesregion, cluster_idList Services for workload correlation
huawei_get_cce_ingressesregion, cluster_idList Ingresses for gateway identification
huawei_list_cce_nodepoolsregion, cluster_idList node pools with AZ distribution
huawei_list_cce_daemonsetsregion, cluster_idList DaemonSets for probe/affinity check
huawei_list_cce_statefulsetsregion, cluster_idList StatefulSets for PDB/single-replica check
huawei_get_cce_node_metrics_topNregion, cluster_idTop-N node CPU/memory metrics
huawei_get_aom_metricsregionAOM metric data for master/node trends
huawei_list_cce_clustersregionList CCE clusters (for cluster selection)

3. Supplementary Query Actions

For targeted evidence when the user requests specific information:

# Node AZ distribution detail
python3 scripts/huawei-cloud.py huawei_get_kubernetes_nodes \
  region=cn-north-4 cluster_id=<cluster_id>

# Pod distribution across AZs
python3 scripts/huawei-cloud.py huawei_get_cce_pods \
  region=cn-north-4 cluster_id=<cluster_id> namespace=default

# Deployment detail with PDB and affinity
python3 scripts/huawei-cloud.py huawei_get_cce_deployments \
  region=cn-north-4 cluster_id=<cluster_id> namespace=default

# Node pool AZ distribution
python3 scripts/huawei-cloud.py huawei_list_cce_nodepools \
  region=cn-north-4 cluster_id=<cluster_id>

# Node metrics trend
python3 scripts/huawei-cloud.py huawei_get_cce_node_metrics_topN \
  region=cn-north-4 cluster_id=<cluster_id> top_n=10

Parameter Reference

huawei_scan_cce_availability_risk (Primary Action)

ParameterRequiredDefaultDescription
regionYes-Huawei Cloud region (e.g., cn-north-4)
cluster_idYes-CCE cluster ID
exclude_namespacesNokube-systemNamespaces excluded from business risk scanning; core addons still checked
gateway_keywordsNonginx,gateway,ingress,proxy,kong,apisix,traefikKeywords for identifying gateway-class workloads
metrics_hoursNo24Lookback window for master/node CPU/memory trend metrics
output_dirNo-Directory for availability-risk-summary.json and availability-risk-report.md output

Common Parameters

ParameterRequiredDescriptionDefault
regionYesHuawei Cloud region-
cluster_idYes (most actions)CCE cluster ID-
namespaceContext-dependentKubernetes namespace-
top_nNoNumber of top results10
metrics_hoursNoMetric lookback hours24

Common Region IDs

Region NameRegion ID
North China - Beijing 4cn-north-4
North China - Beijing 1cn-north-1
East China - Shanghai 1cn-east-3
East China - Shanghai 2cn-east-2
South China - Guangzhoucn-south-1
South China - Shenzhencn-south-4
Southwest China - Guiyang 1cn-southwest-2
Asia Pacific - Bangkokap-southeast-2
Asia Pacific - Singaporeap-southeast-1
Asia Pacific - Hong Kongap-southeast-3
Europe - Pariseu-west-0

Output Format

The primary action huawei_scan_cce_availability_risk returns structured risk data. See Output Schema for the full JSON response schema.

Key Output Fields:

FieldDescription
successWhether the scan completed successfully
scopeScan scope (region, cluster_id, excluded namespaces, gateway keywords)
inventoryCollected resource counts (nodes, workloads, pods, PDBs, services, ingresses) and AZ distribution
cluster.control_planeMaster HA status, visible node count, zone distribution, metrics
cluster.resourcesCPU/memory request/limit allocatable ratios, missing request containers count
issues[]Risk issues with severity, category, resource, message, recommendation
summary.risk_levelOverall risk level: critical, high, medium, low
summary.issue_countTotal issues with severity breakdown
recommendationsRemediation recommendations list
remediation_planAuthorized execution plan items
data_gapsData gaps when control-plane or metrics are unavailable
filesOptional output file paths (summary JSON, report Markdown, raw inventory)

Issue Severity Levels:

SeverityCriteria
criticalSingle replica gateway, no master HA, single-AZ concentration of all Ready nodes
highMulti-replica workload missing PDB, Pod concentration on single node/AZ, missing health probes
mediumMissing resource requests, memory overcommit ratio > 2x, core addon single replica
lowCPU overcommit ratio > 4x (may be intentional burst), minor affinity gaps

Issue Categories:

CategoryDescription
single-replicaWorkload or gateway running with < 2 replicas
pdbMulti-replica workload missing PodDisruptionBudget
health-checkWorkload missing readinessProbe or livenessProbe
affinityHard affinity pinning to single AZ/node/nodepool, missing anti-affinity
az-distributionNodes or Pods concentrated in a single AZ
gatewayGateway workload risk (concentration, missing PDB, missing probes)
resourcesMissing requests, overcommit, or capacity illusion

Verification

See Verification Method for step-by-step verification.

Best Practices

  1. Primary action first: Always call huawei_scan_cce_availability_risk first; use manual inventory queries only if the primary scan fails or the user requests specific detail
  2. Control-plane data gap: When CCE managed control plane does not expose master nodes, mark it as a data gap in the report — do NOT assume master HA
  3. Core addon awareness: Even when kube-system is in exclude_namespaces, CoreDNS, nginx-ingress, and ingress-nginx are still individually identified and checked
  4. Gateway identification: Use gateway_keywords to identify gateway-class workloads; adjust keywords for custom gateway implementations
  5. Remediation authorization: All real remediation (scaling replicas, creating PDB, modifying probes, adjusting affinity, migrating nodes, resizing node pools) requires explicit user authorization before execution
  6. Remediation hand-off: When remediation is needed, hand off to huawei-cloud-cce-auto-remediation-runner with proper safeguards and user confirmation
  7. Read-only boundary: This skill does NOT scale replicas, create PDBs, modify probes, adjust affinity, migrate nodes, or resize node pools — it only generates remediation plans and YAML suggestions
  8. Resource overcommit interpretation: CPU overcommit ratio > 4x is marked as low risk (may be intentional burst); memory overcommit ratio > 2x is marked as medium risk (OOM and bin-packing risk)

Reference Documents

DocumentDescription
WorkflowScan workflow, evidence collection steps, and risk classification rules
Risk RulesSafety constraints, mutation boundaries, and authorization requirements
Output SchemaComplete JSON response format for scan results
Verification MethodStep-by-step verification for skill setup and scan execution
Common PitfallsTroubleshooting guides for scan pitfalls

Notes

  • Read-only by design — this skill does NOT modify workloads, PDBs, probes, affinity, node pools, or cluster configuration
  • Remediation hand-off — all mutation suggestions are handed off to huawei-cloud-cce-auto-remediation-runner with requires_confirmation: true
  • Never expose or log AK/SK or environment variable values
  • All actions are executed via python3 scripts/huawei-cloud.py <action>; do not use hcloud CLI or direct API calls
  • Data gaps — when CCE managed control plane does not expose master nodes, the scan marks this as a data gap and recommends verifying in the CCE console/API
  • Gateway keywords — default keywords cover common gateway implementations; custom gateways can be added via gateway_keywords parameter
  • kube-system exclusion — business risk scanning excludes kube-system by default, but core addons (CoreDNS, nginx-ingress, ingress-nginx) are still individually checked for anti-affinity and distribution risks

Common Pitfalls

See Common Pitfalls & Solutions for detailed troubleshooting guides.

Quick Reference:

PitfallSymptomQuick Fix
Assuming master HAReport concludes "master HA OK" with no visible master nodesMark as data gap; recommend CCE console/API verification
Skipping PDB checkMissing PDB for multi-replica gateway not flaggedInclude gateway keywords and check PDB for all multi-replica workloads
Ignoring gateway concentrationAll gateway Pods on one node/AZUse gateway_keywords and check Pod distribution across nodes/AZs
Treating CPU overcommit as criticalCPU limit/request ratio > 4x flagged as criticalMark as low risk; confirm whether intentional burst design
Missing resource requestsContainers with no CPU/memory requests not flaggedAlways check request/limit presence; mark missing requests as medium risk
Excluding core addonskube-system excluded removes CoreDNS from checksCore addons are individually identified regardless of namespace exclusion
Wrong cluster_idAPI returns 404 or empty resultsVerify cluster ID via huawei_list_cce_clusters
Credential permission deniedAPI returns 403Check IAM permissions for CCE node/workload/metrics access
Metrics API unavailableNode/Pod metrics query failsEnsure metrics-server addon is installed in cluster