Huawei Cloud Cce Observability Context Builder

Dev Tools

Collect and consolidate Huawei Cloud CCE alarms, metrics, logs, and events into a comprehensive observability context package for diagnosis handoff.

Install

openclaw skills install huawei-cloud-cce-observability-context-builder

Huawei Cloud CCE Observability Context Builder

⚠️ Execution Method (Must Read): This skill executes actions via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.

All actions are dispatched through scripts/huawei-cloud.py with --action <action_name> and --params <json_params>

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell

For action names and parameters, see the Core Tools section below

Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools

All paths are relative to the skill directory, which is the directory where this SKILL.md resides

Overview

This skill consolidates scattered fault signals into a structured, diagnosable context package. It first collects the time window, cluster, namespace, workload, Pod, node, and alarm scope, then gathers evidence by type (alarms, events, metrics, logs), merges signals along a timeline, and identifies gaps and the appropriate next diagnostic skill for hand-off. This skill is strictly read-only — it never executes remediation actions.

Architecture: python3 scripts/huawei-cloud.py dispatcher → Huawei Cloud Python SDK + AOM/LTS API → alarms, metrics, logs, events aggregation

Related Skills:

huawei-cloud-cce-pod-failure-diagnoser - Pod CrashLoopBackOff, ImagePullBackOff, OOMKilled diagnosis
huawei-cloud-cce-node-failure-diagnoser - Node health, resource pressure, NPD events diagnosis
huawei-cloud-cce-network-failure-diagnoser - Network connectivity, DNS, ELB diagnosis
huawei-cloud-cce-storage-failure-diagnoser - PVC/PV mount, storage provisioning diagnosis
huawei-cloud-cce-root-cause-analyzer - Cross-domain root cause analysis and reports
huawei-cloud-cce-auto-remediation-runner - Remediation actions (scale, drain, rollback, etc.)
huawei-cloud-cce-alarm-correlation-engine - Alarm deduplication and correlation
huawei-cloud-cce-metric-analyzer - Deep metric trend analysis
huawei-cloud-cce-log-analyzer - Deep log pattern analysis

Capabilities:

Collect active and history AOM alarms, deduplicate and group by severity (huawei_list_aom_alarms, huawei_analyze_aom_alarms)
Retrieve Kubernetes Events grouped by object and reason (huawei_get_cce_events)
Query Pod and Node TopN metrics for resource peaks and anomalies (huawei_get_cce_pod_metrics_topN, huawei_get_cce_node_metrics_topN)
Query AOM and LTS logs for deep log evidence (huawei_query_aom_logs, huawei_get_recent_logs)
Fetch Pod-side container logs (huawei_get_pod_logs)
Get AOM metrics and instance list (huawei_get_aom_metrics, huawei_list_aom_instances)
Generate monitor dashboards from collected data (huawei_generate_monitor_dashboard)
Merge signals along a timeline, mark gaps, and recommend the next diagnostic skill

Typical Use Cases:

"Collect all observability data for cluster xyz in the last hour"
"Build a context package for a Pod crash incident"
"Gather alarms, events, and metrics before diagnosis"
"I see multiple alarms, consolidate them into a diagnosis context"
"Show me the full observability picture: alarms + metrics + logs + events"
"What's happening in namespace prod over the last 30 minutes?"

Prerequisites

1. Python Dependencies

Python 3.8+ with huaweicloudsdkcce, huaweicloudsdkcore, huaweicloudsdkaom, huaweicloudsdklts packages
Run environment check before first use (see Verification section)

2. Credential Configuration

Valid Huawei Cloud credentials (AK/SK mode)
Security Rules:
- Never expose AK/SK values in code, conversation, or commands
- Never use echo $HUAWEI_AK or echo $HUAWEI_SK to check credentials
- Use environment variables: HUAWEI_AK, HUAWEI_SK, HUAWEI_REGION
- Prefer IAM users over root account for cloud operations
- Enable MFA for sensitive operations

Configuration Method (Environment Variables Only):

export HUAWEI_AK=<your-ak>
export HUAWEI_SK=<your-sk>
export HUAWEI_REGION=cn-north-4

Important Security Notes:

Never commit credentials to version control
Use IAM users with minimal required permissions
Rotate AK/SK regularly

3. IAM Permission Requirements

API Action	Permission	Purpose
`cce:cluster:get`	Get cluster	View CCE cluster details
`cce:cluster:createCert`	Create certificate	Obtain kubeconfig for kubectl access
`aom:alarm:list`	List alarms	Query AOM active/history alarms
`aom:alarm:analyze`	Analyze alarms	Deduplicate and group alarms
`aom:metricsData:get`	Get metrics data	Query Pod/node CPU/memory metrics
`aom:instance:list`	List AOM instances	Discover AOM Prom instance
`aom:logData:get`	Get log data	Query AOM/LTS log data
`lts:log:list`	List LTS logs	Query LTS log streams
`cce:event:list`	List events	Query Kubernetes Events

Permission Failure Handling:

When any command fails due to IAM permission errors, display the required permission list
Guide the user to create a custom policy in the IAM console and grant authorization
Pause execution and wait for user confirmation that permissions have been granted

Core Tools

All actions are dispatched through scripts/huawei-cloud.py using skill action=exec.

Alarm Collection

Action	Required Parameters	Description
`huawei_list_aom_alarms`	region, cluster_id	Collect active + history AOM alarms for the cluster
`huawei_analyze_aom_alarms`	region, cluster_id	Deduplicate alarms and group by severity level

Event and Metric Collection

Action	Required Parameters	Description
`huawei_get_cce_events`	region, cluster_id	Retrieve Kubernetes Events grouped by object and reason
`huawei_get_cce_pod_metrics_topN`	region, cluster_id, namespace	TopN Pod metrics (CPU/memory) for anomaly detection
`huawei_get_cce_node_metrics_topN`	region, cluster_id	TopN Node metrics for resource pressure detection
`huawei_get_aom_metrics`	region, cluster_id, namespace	Query AOM metrics for specific resources
`huawei_list_aom_instances`	region	Discover AOM Prom instance for metrics queries

Log Collection

Action	Required Parameters	Description
`huawei_query_aom_logs`	region, cluster_id, namespace	Query AOM structured log data
`huawei_get_recent_logs`	region, cluster_id, namespace	Get recent log entries (LTS)
`huawei_get_pod_logs`	region, cluster_id, pod_name, namespace	Fetch Pod container logs (previous or current)

Visualization

Action	Required Parameters	Description
`huawei_generate_monitor_dashboard`	region, cluster_id	Generate monitoring dashboard from collected data

Parameter Reference

Common Parameters

Parameter	Required	Description	Default
`region`	Yes	Huawei Cloud region	`HUAWEI_REGION`
`cluster_id`	Yes	CCE cluster ID	N/A
`namespace`	No	Kubernetes namespace	N/A
`ak`	Optional	Override AK	`HUAWEI_AK`
`sk`	Optional	Override SK	`HUAWEI_SK`
`project_id`	Optional	Project ID	Auto from IAM

Alarm Collection Parameters

Parameter	Required	Description	Default
`alarm_id`	No	Specific alarm ID to query	N/A
`alarm_level`	No	Alarm severity filter	All
`hours`	No	History lookback window (hours)	1

Log Collection Parameters

Parameter	Required	Description	Default
`pod_name`	Yes*	Pod name (for `huawei_get_pod_logs`)	N/A
`container`	No	Container name	First
`previous`	No	Fetch previous (crashed) logs	`false`
`tail_lines`	No	Number of log tail lines	100

Metric Collection Parameters

Parameter	Required	Description	Default
`top_n`	No	Number of top results	10
`hours`	No	Metrics lookback window (hours)	1

*Required for specific actions as noted.

Workflow

Record scope: Capture fault time, region, cluster_id, namespace, workload, pod, node, and alarm_id provided by the user
Set time window: If time is unclear, default to the last 1 hour and note this assumption in the output
Collect alarms: Call huawei_list_aom_alarms to collect active + history alarms, then use huawei_analyze_aom_alarms for deduplication and severity grouping
Collect events: Call huawei_get_cce_events to retrieve Kubernetes Events grouped by involved object and reason
Collect metrics: Call Pod/Node TopN metrics tools to find resource peaks, abnormal nodes, and abnormal Pods
Collect logs: When log evidence is needed, prefer huawei_query_aom_logs, then supplement with Pod-side logs from huawei_get_recent_logs or huawei_get_pod_logs
Merge and output: Merge signals along a timeline, output the anomaly summary, missing information (gaps), and recommended diagnostic skill for hand-off

For the complete evidence-gathering workflow, see references/workflow.md.

Output Format

See references/output-schema.md for the complete JSON response structure.

Context Package Output:

{
  "summary": "one paragraph context summary",
  "scope": {
    "region": "cn-north-4",
    "cluster_id": "optional",
    "namespace": "optional",
    "workload": "optional",
    "time_window": "optional"
  },
  "signals": {
    "alarms": [],
    "events": [],
    "metrics": [],
    "logs": []
  },
  "timeline": [],
  "gaps": [],
  "next_skill": "huawei-cloud-cce-pod-failure-diagnoser | huawei-cloud-cce-node-failure-diagnoser | huawei-cloud-cce-network-failure-diagnoser | huawei-cloud-cce-root-cause-analyzer"
}

Key output fields:

summary — one paragraph summarizing the collected observability context
scope — region, cluster, namespace, workload, and time window
signals — collected evidence grouped by type (alarms, events, metrics, logs)
timeline — merged signal timeline showing event chronology
gaps — missing data that could improve diagnosis
next_skill — recommended diagnostic skill for hand-off based on signal analysis

Risk Rules

This skill is strictly read-only observability — no mutations allowed.

Allow automatic R1 read-only queries: alarms, metrics, logs, events, inventory, read-only report generation
Prohibit any action requiring confirm=true — no mutations allowed
Never persist AK/SK, tokens, certificates, or kubeconfig
Log output must be sanitized. When suspected secrets are found, describe the hit location only — never copy the original text
Charts and reports must only be generated from authorized query results

For complete risk classification, see references/risk-rules.md.

Verification

Run python3 scripts/huawei-cloud.py huawei_list_aom_alarms region=cn-north-4 cluster_id=<cluster-id> to verify alarm query connectivity
Run python3 scripts/huawei-cloud.py huawei_get_cce_events region=cn-north-4 cluster_id=<cluster-id> limit=10 to verify Event query works
Run python3 scripts/huawei-cloud.py huawei_get_cce_pod_metrics_topN region=cn-north-4 cluster_id=<cluster-id> namespace=default top_n=5 to verify metrics TopN
Run a full context build on a healthy namespace and confirm the output contains valid scope, signals, timeline, gaps, and next_skill fields
Verify that no mutation actions are suggested in the output — all actions should be read-only or hand-offs to diagnosis skills

Best Practices

Always start with alarms: Check active + history alarms first using huawei_list_aom_alarms and huawei_analyze_aom_alarms — alarms provide the most direct fault signals
Define scope early: Record region, cluster_id, namespace, workload, pod, node, and time window before collecting any data. If the time window is unclear, default to the last 1 hour and note the assumption
Use TopN for quick anomaly detection: huawei_get_cce_pod_metrics_topN and huawei_get_cce_node_metrics_topN efficiently highlight resource peaks without scanning all resources
Prefer AOM logs first, then supplement with Pod logs: huawei_query_aom_logs provides structured log data; use huawei_get_pod_logs or huawei_get_recent_logs for Pod-side container log details
Merge signals along a timeline: Chronological merging of alarms, events, metrics, and logs reveals causal chains that individual data types cannot
Mark gaps explicitly: Always identify missing data in the gaps field — this guides the next diagnostic skill on what additional evidence to collect
Never suggest mutation actions: This skill is read-only. For scaling, deletion, restart, drain, or vulnerability state changes, hand off to huawei-cloud-cce-auto-remediation-runner
Recommend the correct next skill: Based on signal analysis, recommend the most specific diagnoser — Pod failures → huawei-cloud-cce-pod-failure-diagnoser, node issues → huawei-cloud-cce-node-failure-diagnoser, network → huawei-cloud-cce-network-failure-diagnoser, cross-domain → huawei-cloud-cce-root-cause-analyzer

Reference Documents

Document	Description
Workflow	Evidence-gathering workflow and step sequence
Risk Rules	Safety constraints and risk classification
Output Schema	JSON response format for context package

Notes

This skill is strictly read-only — it never executes remediation actions. For mutation actions, hand off to huawei-cloud-cce-auto-remediation-runner
All actions are R1 (read-only) — no confirm=true is ever needed
Log excerpts are sanitized — suspected passwords, tokens, AK/SK, and Authorization headers are redacted in output
AK/SK must never be hardcoded — use environment variables only
The Python dispatcher script (scripts/huawei-cloud.py) is the only execution method — do not use hcloud CLI or direct API calls
The next_skill field in the output uses huawei-cloud-cce-* naming for cross-skill hand-off
When alarm correlation is needed before context building, consider using huawei-cloud-cce-alarm-correlation-engine

Common Pitfalls

Pitfall	Symptom	Quick Fix
Missing `cluster_id`	All actions fail immediately	Provide `cluster_id` from cluster list
No time window specified	Broad, noisy results	Default to last 1 hour; note assumption in output
Skipping alarm collection	Missing critical fault signals	Always start with `huawei_list_aom_alarms`
Not merging signals on timeline	Isolated data points, no causal chain	Chronologically merge alarms, events, metrics
Suggesting mutation actions	Unsafe recommendations	All mutations → `huawei-cloud-cce-auto-remediation-runner`
Not marking data gaps	Diagnosis skill lacks direction	Always populate the `gaps` field
Querying all namespaces	Slow response, too many results	Scope with `namespace` and `workload`
AOM Prom instance not found	Metrics queries return empty	Verify with `huawei_list_aom_instances` first