Install
openclaw skills install huawei-cloud-cce-root-cause-analyzerHuawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod events/logs, recent changes, service topology, nodes, network, or metrics, and the user needs a complete Markdown root-cause report with investigation steps, evidence chain, impact scope, Top3 causes, confidence, and remediation handoff. Trigger: user mentions "root cause analysis", "根因分析", "multiple failures", "多类告警", "cross-resource diagnosis", "跨资源诊断", "comprehensive diagnosis", "综合诊断", "RCA", "故障定位", "impact scope", "影响面分析", "change correlation", "变更关联"
openclaw skills install huawei-cloud-cce-root-cause-analyzer⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the
scripts/huawei-cloud.pydispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.
- All actions are dispatched through
scripts/huawei-cloud.pywith--action <action_name>and--params <json_params>- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them; do not run them directly in a shell- For action names and parameters, see the Core Tools section below
- Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools
- All paths are relative to the skill directory, which is the directory where this SKILL.md resides
This skill converges multi-domain evidence into root cause conclusions for CCE incidents. It orchestrates workload rollout diagnosis, dependency impact analysis, change impact analysis, AOM alarm analysis, and cross-domain drill-down (network, node) to produce a complete Markdown report with investigation steps, timeline, evidence chain, impact scope, Top3 root causes, confidence, counter-evidence, and remediation handoff.
This skill is applicable to the following scenarios:
This skill does NOT handle the following:
Before using, you must run the environment check script to complete environment validation and dependency installation in one step:
skill action=exec: bash skill://scripts/check_env.shskill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1Windows Note: Do not use
&&to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.
The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before continuing with other actions.
Environment Variables:
| Variable | Required | Description |
|---|---|---|
| HW_ACCESS_KEY | Yes | Huawei Cloud AK |
| HW_SECRET_KEY | Yes | Huawei Cloud SK |
| HW_REGION_NAME | No | Default cn-north-4 |
| HW_PROJECT_ID | No | Project ID (automatically obtained via IAM API when not set) |
| HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |
| HW_CLUSTER_ID | No | Default CCE cluster ID (can also be passed per action) |
Security Constraints:
Do not output the values of the above environment variables.
| API Action | Permission | Purpose |
|---|---|---|
| cce:cluster:get | Get cluster | View cluster details |
| cce:cluster:list | List clusters | List CCE clusters |
| cce:node:list | List nodes | List cluster nodes |
| aom:*:get | Read AOM | Query AOM metrics and alarms |
| aom:event:list | List events | Query AOM alarm events |
| aom:alarmRule:list | List alarm rules | Query alarm rules |
Permission Failure Handling:
All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:
Primary Comprehensive Diagnosis:
| Action | Required Parameters | Description |
|---|---|---|
huawei_root_cause_analyze | region, cluster_id | Primary comprehensive action: orchestrates workload rollout diagnosis, dependency impact, change impact, and AOM alarms into a unified root cause report with Top3 causes |
Workload Domain Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_workload_rollout_diagnose | region, cluster_id, namespace, kind, name | Diagnose Deployment/StatefulSet/DaemonSet rollout failures with funnel and Top causes |
huawei_workload_diagnose | region, cluster_id | General workload status diagnosis |
huawei_workload_diagnose_by_alarm | region, cluster_id | Workload diagnosis triggered by AOM alarm correlation |
huawei_pod_failure_diagnose | region, cluster_id | Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) |
Dependency and Impact Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_dependency_impact_analyze | region, cluster_id | Analyze Service/Ingress/Pod/Node propagation paths and impact scope for service unavailability |
Change Impact Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_change_impact_analyze | region, cluster_id | Correlate recent changes (deployment, config, network, security policy, node changes) with observed failures via audit log and AOM alarm timeline |
Network and Node Domain Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_network_diagnose | region, cluster_id | General network connectivity diagnosis |
huawei_network_diagnose_by_alarm | region, cluster_id | Network diagnosis triggered by AOM alarm correlation |
huawei_network_failure_diagnose | region, cluster_id | Network failure diagnosis (Service, Ingress connectivity) |
huawei_node_diagnose | region, cluster_id | Node-level diagnosis (scheduling, pressure) |
huawei_node_failure_diagnose | region, cluster_id | Node failure diagnosis |
huawei_node_batch_diagnose | region, cluster_id | Batch node diagnosis for multi-node issues |
Alarm and Report Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_analyze_aom_alarms | region, cluster_id | Analyze AOM alarm patterns and correlation across resources |
huawei_generate_diagnosis_report | region, cluster_id | Generate structured Markdown diagnosis report |
huawei_generate_monitor_dashboard | region, cluster_id | Generate monitoring dashboard for ongoing observation |
Supporting Evidence Actions:
| Action | Required Parameters | Description |
|---|---|---|
huawei_get_cce_events | region, cluster_id | List Kubernetes Events in the cluster |
Common Parameters:
| Parameter | Required | Description |
|---|---|---|
| region | Yes | Huawei Cloud region, e.g., cn-north-4 |
| cluster_id | Yes | CCE cluster ID |
| namespace | Yes* | Kubernetes namespace (required for workload-specific actions) |
| kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet |
| name | Yes* | Workload name |
*Required only for huawei_workload_rollout_diagnose.
Optional Parameters (passed via --params JSON):
| Parameter | Description |
|---|---|
| ak | Override AK (uses HW_ACCESS_KEY by default) |
| sk | Override SK (uses HW_SECRET_KEY by default) |
| project_id | Override project ID (auto-obtained via IAM when not set) |
| target_name | Optional workload/app/service name for scope narrowing |
| hours | Metric/query time range in hours (default 1) |
| top_n | Number of top results for ranking (default 3) |
huawei_root_cause_analyze{
"success": true,
"analysis_trace_id": "RCA-...",
"scope": {
"region": "cn-north-4",
"cluster_id": "cluster-id",
"namespace": "optional",
"target_name": "optional workload/app/service"
},
"summary": {
"top_cause": {},
"cause_count": 3,
"data_sources": {
"rollout": true,
"dependency": true,
"change": true,
"alarms": true
}
},
"top_causes": [
{
"rank": 1,
"type": "ContainerCommandNotFound",
"title": "New version container startup command or entry file does not exist",
"domain": "workload",
"confidence": 0.94,
"evidence": [],
"counter_evidence": [],
"recommendation": [],
"remediation_hint": {
"skill": "huawei-cloud-cce-auto-remediation-runner",
"action": "huawei_auto_remediation_run",
"strategy": "rollback_previous_revision",
"requires_confirmation": true
}
}
],
"report_markdown": "# CCE Comprehensive Root Cause Analysis Report...",
"report_file": "optional"
}
Each domain action (huawei_workload_rollout_diagnose, huawei_dependency_impact_analyze, huawei_change_impact_analyze, huawei_analyze_aom_alarms) produces its own structured JSON output. See individual skill documentation for domain-specific schemas.
huawei_root_cause_analyze on a known healthy cluster to verify it returns success: true with zero or low-confidence causeshuawei_root_cause_analyze on a cluster with known multi-domain failures to verify Top3 causes are accurately identifiedhuawei_root_cause_analyze summary with individual domain action outputs for consistencyhuawei_root_cause_analyze for comprehensive diagnosis; drill down into individual domain actions only when specific evidence requires deeper analysishuawei_dependency_impact_analyze to trace Service selector → Ingress backend → Pod Ready → Node distribution propagation pathshuawei_change_impact_analyze to build "change occurred before failure" causal chain with audit logs, K8s historical events, and AOM alarmshuawei-cloud-cce-auto-remediation-runnerreferences/workflow.mdreferences/output-schema.mdreferences/risk-rules.mdskill action=exec; do not run them directly in a shellconfirm=true must be handed off to huawei-cloud-cce-auto-remediation-runner; this skill never executes remediationhuawei_root_cause_analyze and drilling into individual domains first — Always start with comprehensive analysis; individual domain drill-down is for supplementary evidence onlyhuawei-cloud-cce-auto-remediation-runner; this skill only outputs recommendationshuawei_change_impact_analyze to verify the "change before failure" causal chain