Huawei Cloud Cce Metric Analyzer

Other

Huawei Cloud CCE Metric analysis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) query Pod/Node CPU/memory/disk metrics, (2) get resource usage TopN rankings, (3) query ECS/ELB/EIP/NAT cloud resource metrics, (4) aggregate cluster monitoring data with anomaly detection, (5) detect threshold-based resource anomalies. Trigger: user mentions "metric analysis", "指标分析", "CCE metrics", "CCE 指标", "AOM metrics", "AOM 指标", "resource metrics", "资源指标", "CPU usage", "CPU 使用率", "memory usage", "内存使用率", "performance monitoring", "性能监控", "TopN", "resource ranking", "资源排名"

Install

openclaw skills install huawei-cloud-cce-metric-analyzer

Huawei Cloud CCE Metric Analyzer

Overview

Query and analyze metrics for CCE clusters (Pod/Node CPU/memory/disk) and cloud resources (ECS, ELB, EIP, NAT). Supports threshold-based anomaly detection, status classification (critical/warning/normal), and full-cluster monitoring aggregation.

Architecture: python3 scripts/huawei-cloud.py dispatcher → Huawei Cloud Python SDK + AOM Prometheus → Pod/Node metrics, ECS/ELB/EIP/NAT metrics → Threshold classification → Anomaly detection

Related Skills:

huawei-cloud-cce-pod-failure-diagnoser - Pod CrashLoopBackOff, OOMKilled, restart storms
huawei-cloud-cce-node-failure-diagnoser - Node health, resource pressure diagnosis
huawei-cloud-cce-kubernetes-event-analyzer - Warning events, failure patterns
huawei-cloud-cce-capacity-trend-forecaster - Capacity planning and trend forecasting
huawei-cloud-cce-cost-optimization-advisor - Resource cost optimization
huawei-cloud-cce-auto-remediation-runner - Remediation actions (scale, resize, drain)

Capabilities:

Pod CPU/memory TopN ranking and single Pod time-series metrics
Node CPU/memory/disk TopN ranking and single Node time-series metrics
ECS instance CPU/memory/disk/network metrics
ELB connection, bandwidth, QPS metrics
EIP bandwidth, traffic, packet loss metrics
NAT Gateway SNAT connection metrics
Full-cluster monitoring aggregation with anomaly detection (80% threshold)
Threshold-based status classification (critical/warning/normal/unknown)

Typical Use Cases:

"Show Pods with the highest CPU usage in my cluster"
"Get Node memory usage ranking"
"Check ECS instance resource metrics"
"What is the ELB QPS for my load balancer?"
"Show EIP bandwidth usage"
"Aggregate all monitoring data for the cluster"
"Which resources have exceeded critical thresholds?"
"Detect resource anomalies in the last hour"

Prerequisites

1. Python Dependencies

Python 3.8+ with huaweicloudsdkcce, huaweicloudsdkcore, huaweicloudsdkaom, huaweicloudsdkces packages
Run environment check before first use (see Verification section)

2. Credential Configuration

Valid Huawei Cloud credentials (AK/SK mode)
Security Rules:
- 🚫 Never expose AK/SK values in code, conversation, or commands
- 🚫 Never use echo $HUAWEI_AK or echo $HUAWEI_SK to check credentials
- ✅ Use environment variables: HUAWEI_AK, HUAWEI_SK, HUAWEI_REGION
- ✅ Prefer IAM users over root account for cloud operations
- ✅ Enable MFA for sensitive operations

Configuration Method (Environment Variables Only):

export HUAWEI_AK=<your-ak>
export HUAWEI_SK=<your-sk>
export HUAWEI_REGION=cn-north-4

⚠️ Important Security Notes:

Never commit credentials to version control
Use IAM users with minimal required permissions
Enable MFA for sensitive operations
Rotate AK/SK regularly

3. IAM Permission Requirements

API Action	Permission	Purpose
`cce:cluster:get`	Get cluster	View CCE cluster details
`aom:instance:list`	List AOM instances	Discover AOM Prometheus instance for metrics
`aom:metricsData:get`	Get metrics data	Query Pod/Node CPU/memory/disk metrics
`ces:metricsData:get`	Get CES metrics	Query ECS/ELB/EIP/NAT cloud resource metrics
`ecs:cloudServers:list`	List ECS servers	Correlate ECS instance IDs
`elb:loadbalancers:list`	List ELB instances	Correlate ELB IDs
`vpc:eips:list`	List EIPs	Correlate EIP IDs
`nat:natGateways:list`	List NAT Gateways	Correlate NAT Gateway IDs

Permission Failure Handling:

When any command fails due to IAM permission errors, display the required permission list
Guide the user to create a custom policy in the IAM console and grant authorization
Pause execution and wait for user confirmation that permissions have been granted

Core Commands

All commands use the Python dispatcher script: python3 scripts/huawei-cloud.py <action> <key=value>...

1. CCE Pod Metrics

# Pod TopN — cluster-wide CPU/memory ranking
python3 scripts/huawei-cloud.py huawei_get_cce_pod_metrics_topN \
  region=cn-north-4 cluster_id=<cluster-id> \
  namespace=default top_n=10 hours=1

# Pod TopN with label selector
python3 scripts/huawei-cloud.py huawei_get_cce_pod_metrics_topN \
  region=cn-north-4 cluster_id=<cluster-id> \
  namespace=default label_selector="app=nginx,version=v1" top_n=10 hours=1

# Single Pod time-series
python3 scripts/huawei-cloud.py huawei_get_cce_pod_metrics \
  region=cn-north-4 cluster_id=<cluster-id> \
  pod_name=my-app-xxx namespace=default hours=1

2. CCE Node Metrics

# Node TopN — cluster-wide CPU/memory/disk ranking
python3 scripts/huawei-cloud.py huawei_get_cce_node_metrics_topN \
  region=cn-north-4 cluster_id=<cluster-id> \
  top_n=10 hours=1

# Single Node time-series
python3 scripts/huawei-cloud.py huawei_get_cce_node_metrics \
  region=cn-north-4 cluster_id=<cluster-id> \
  node_ip=10.0.0.1 hours=1

3. Cloud Resource Metrics

# ECS instance metrics
python3 scripts/huawei-cloud.py huawei_get_ecs_metrics \
  region=cn-north-4 instance_id=<instance-id>

# ELB metrics
python3 scripts/huawei-cloud.py huawei_get_elb_metrics \
  region=cn-north-4 elb_id=<loadbalancer-id> hours=1

# EIP metrics
python3 scripts/huawei-cloud.py huawei_get_eip_metrics \
  region=cn-north-4 eip_id=<eip-id> hours=1

# NAT Gateway metrics
python3 scripts/huawei-cloud.py huawei_get_nat_gateway_metrics \
  region=cn-north-4 nat_gateway_id=<nat-gateway-id> hours=1

4. Cluster Monitoring Aggregation

# Aggregate all monitoring data with anomaly detection
python3 scripts/huawei-cloud.py huawei_cce_cluster_monitoring_aggregation \
  region=cn-north-4 cluster_id=<cluster-id> \
  start_time="2026-05-30 00:00:00" end_time="2026-05-30 23:59:59" \
  namespace=default top_n=10

This tool aggregates: Pod TopN CPU/memory, Node TopN CPU/memory/disk, ELB metrics (with LoadBalancer service association), NAT Gateway metrics, EIP metrics (bandwidth, packet loss), and anomaly detection using 80% threshold.

Parameter Reference

Common Parameters

Parameter	Required/Optional	Description	Default
`region`	Required	Huawei Cloud region	`HUAWEI_REGION`
`cluster_id`	Required	CCE cluster ID	N/A
`namespace`	Recommended	Kubernetes namespace	`default`
`ak`	Optional	Override AK	`HUAWEI_AK`
`sk`	Optional	Override SK	`HUAWEI_SK`
`project_id`	Optional	Project ID	Auto from IAM

`huawei_get_cce_pod_metrics_topN` Parameters

Parameter	Required	Description	Default
`namespace`	No	Namespace filter	all
`label_selector`	No	Label selector (e.g. app=web)	N/A
`top_n`	No	Number of top items	10
`hours`	No	Metrics lookback hours	1
`node_ip`	No	Filter Pods on specific node	N/A
`cpu_query`	No	Custom CPU PromQL	Auto
`memory_query`	No	Custom memory PromQL	Auto

`huawei_get_cce_pod_metrics` Parameters

Parameter	Required	Description	Default
`pod_name`	Yes	Target Pod name	N/A
`namespace`	No	Namespace	`default`
`hours`	No	Metrics lookback hours	1

`huawei_get_cce_node_metrics_topN` Parameters

Parameter	Required	Description	Default
`top_n`	No	Number of top items	10
`hours`	No	Metrics lookback hours	1

`huawei_get_cce_node_metrics` Parameters

Parameter	Required	Description	Default
`node_ip`	Yes	Target Node IP	N/A
`hours`	No	Metrics lookback hours	1

`huawei_get_ecs_metrics` Parameters

Parameter	Required	Description	Default
`instance_id`	Yes	ECS instance ID	N/A

`huawei_get_elb_metrics` Parameters

Parameter	Required	Description	Default
`elb_id`	Yes	ELB loadbalancer ID	N/A
`hours`	No	Metrics lookback hours	1

`huawei_get_eip_metrics` Parameters

Parameter	Required	Description	Default
`eip_id`	Yes	EIP ID	N/A
`hours`	No	Metrics lookback hours	1

`huawei_get_nat_gateway_metrics` Parameters

Parameter	Required	Description	Default
`nat_gateway_id`	Yes	NAT Gateway ID	N/A
`hours`	No	Metrics lookback hours	1

`huawei_cce_cluster_monitoring_aggregation` Parameters

Parameter	Required	Description	Default
`start_time`	Yes	Start time (YYYY-MM-DD HH:MM:SS)	N/A
`end_time`	Yes	End time (YYYY-MM-DD HH:MM:SS)	N/A
`namespace`	No	Namespace filter	`default`
`top_n`	No	Number of top items	10

Output Format

See Output Schema for the complete JSON response structure.

Key output fields:

success — boolean, true if query completed
region — Huawei Cloud region
cluster_id / cluster_name — CCE cluster identity
aom_instance_id — AOM Prometheus instance used for metric queries
metrics — Dict with cpu/memory/disk data per resource, including status classification
time_series — Historical data points with timestamp, time, average, min, max
status — Threshold classification: critical (>80% CPU, >85% memory/disk), warning (>50% CPU/memory, >70% disk), normal (below warning), unknown (no data)

Cloud resource metric fields (ECS/ELB/EIP/NAT):

ECS: cpu_util, mem_util, disk_util, network_incoming/outgoing_bytes_rate, disk_read/write_bytes_rate
ELB: m1_cps, m14_l7_rt, mb_l7_qps, mc-me-mf_l7_http_2xx-5xx
EIP: upstream/downstream_bandwidth, upstream/downstream_bandwidth_usage, upstream/downstream_traffic, packet_loss_rate
NAT: snat_connection, inbound/outbound_bandwidth, snat_connection_ratio

Verification

Run python3 scripts/huawei-cloud.py huawei_get_cce_pod_metrics_topN region=cn-north-4 cluster_id=<cluster-id> namespace=default top_n=5 to verify Pod metric queries
Run python3 scripts/huawei-cloud.py huawei_get_cce_node_metrics_topN region=cn-north-4 cluster_id=<cluster-id> top_n=5 to verify Node metric queries
Run python3 scripts/huawei-cloud.py huawei_get_ecs_metrics region=cn-north-4 instance_id=<instance-id> to verify CES metric connectivity

Best Practices

Start with TopN for cluster-wide overview — use Pod/Node TopN before drilling into individual resources
Time-bound queries — keep hours small (1-4) for recent analysis; cap at 24 hours for historical reviews
Use namespace filtering — always provide namespace to reduce noise in Pod TopN results
Check status classification — focus on critical and warning resources first; normal resources can be skipped
Use aggregation for full-cluster health checks — huawei_cce_cluster_monitoring_aggregation gives a one-shot overview of all resource metrics with anomaly detection
Correlate with events — if metrics show anomalies, check huawei-cloud-cce-kubernetes-event-analyzer for related warning events
Hand off, don't remediate — this skill is read-only; hand off to diagnosis skills for root cause analysis
Sanitize output — do not expose production pod names, node IPs, or cluster IDs in public summaries; use redacted examples

Reference Documents

Document	Description
Workflow	Metric query sequence, Pod/Node workflows, threshold detection, next-step handoff
Risk Rules	Read-only constraints, data redaction rules, time-bounding, threshold caveats
Output Schema	JSON response format for CCE metrics, cloud resource metrics, time-series, status values

Notes

This skill is strictly read-only — it only queries and analyzes metrics; no modifications are made to resources or configurations
Thresholds (CPU >80%, Memory >85%, Disk >85%) are predefined baselines — actual thresholds may vary by workload SLO; recommend users customize thresholds based on their specific requirements
AK/SK must never be hardcoded — use environment variables only
The Python dispatcher script (scripts/huawei-cloud.py) is the only execution method — do not use hcloud CLI or direct API calls for metric queries
AOM Prometheus instance is auto-discovered — no need to manually specify aom_instance_id
Cloud resource metrics (ECS/ELB/EIP/NAT) use CES (Cloud Eye Service), not AOM
Do not make automatic scaling or remediation decisions based solely on metric analysis — forward to huawei-cloud-cce-auto-remediation-runner only if explicitly requested and validated

Common Pitfalls

Pitfall	Symptom	Quick Fix
Missing `cluster_id`	Action fails immediately	Provide `cluster_id` from cluster listing
AOM Prometheus instance not found	Metric queries return empty results	Ensure AOM Prom instance is created for the cluster; check `aom:instance:list` permission
Large time window without namespace filter	Slow response, too many results	Narrow `hours` to 1-4 and add `namespace` filter
Cloud resource ID not found	ECS/ELB/EIP/NAT query returns error	Verify resource ID exists; check CES IAM permission
Custom PromQL syntax error	`cpu_query` / `memory_query` returns empty	Use default auto-generated PromQL; only customize if familiar with AOM PromQL syntax
Permission denied on CES metrics	Cloud resource metrics fail	Verify `ces:metricsData:get` IAM permission
Aggregation missing time range	`start_time` / `end_time` required but not provided	Always specify both time boundaries for aggregation queries
Node IP format mismatch	Single Node metrics fail	Use the exact node IP as shown in cluster node listing (e.g. `10.0.0.1`)