Huawei Cloud Cce Root Cause Analyzer

Other

Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod events/logs, recent changes, service topology, nodes, network, or metrics, and the user needs a complete Markdown root-cause report with investigation steps, evidence chain, impact scope, Top3 causes, confidence, and remediation handoff. Trigger: user mentions "root cause analysis", "根因分析", "multiple failures", "多类告警", "cross-resource diagnosis", "跨资源诊断", "comprehensive diagnosis", "综合诊断", "RCA", "故障定位", "impact scope", "影响面分析", "change correlation", "变更关联"

Install

openclaw skills install huawei-cloud-cce-root-cause-analyzer

CCE Root Cause Analysis

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.

  • All actions are dispatched through scripts/huawei-cloud.py with --action <action_name> and --params <json_params>
  • All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell
  • For action names and parameters, see the Core Tools section below
  • Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools
  • All paths are relative to the skill directory, which is the directory where this SKILL.md resides

Overview

This skill converges multi-domain evidence into root cause conclusions for CCE incidents. It orchestrates workload rollout diagnosis, dependency impact analysis, change impact analysis, AOM alarm analysis, and cross-domain drill-down (network, node) to produce a complete Markdown report with investigation steps, timeline, evidence chain, impact scope, Top3 root causes, confidence, counter-evidence, and remediation handoff.

This skill is applicable to the following scenarios:

  1. Cross-resource incidents involving multiple failure domains (workload + dependency + change + alarm)
  2. Root cause analysis when alarms span multiple CCE resources and the user needs comprehensive diagnosis
  3. Correlating recent changes (deployments, config updates, network/security policy changes, node changes) with observed failures
  4. Dependency impact propagation analysis (Service → Ingress → Pod → Node chain)
  5. Workload rollout failures requiring evidence funnel (generation → ReplicaSet → Pod Ready → events → logs → command/args → probes → image)
  6. Producing structured Top3 root cause reports with evidence, counter-evidence, and confidence scores

This skill does NOT handle the following:

  1. Executing any remediation actions (scale, delete, drain, reboot, vulnerability state modification, cluster sleep/wake)
  2. Making root cause conclusions from a single alarm without timeline or evidence chain
  3. Creating, modifying, or deleting CCE resources
  4. Guessing or fabricating diagnosis results without evidence

Prerequisites

Before using, you must run the environment check script to complete environment validation and dependency installation in one step:

  • Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh
  • Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

Windows Note: Do not use && to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.

The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before continuing with other actions.

Environment Variables:

VariableRequiredDescription
HW_ACCESS_KEYYesHuawei Cloud AK
HW_SECRET_KEYYesHuawei Cloud SK
HW_REGION_NAMENoDefault cn-north-4
HW_PROJECT_IDNoProject ID (automatically obtained via IAM API when not set)
HW_SECURITY_TOKENNoRequired when using temporary AK/SK
HW_CLUSTER_IDNoDefault CCE cluster ID (can also be passed per action)

Security Constraints:

  1. Never persist credentials (AK/SK/Token/Certificate) to the filesystem
  2. AK/SK exist only within the current request call stack; released after use
  3. Only non-sensitive project IDs are cached in process memory (never written to disk)
  4. All temporary certificate files must be deleted immediately after use
  5. Never expose AK/SK in logs, responses, or error messages

Do not output the values of the above environment variables.


IAM Permission Requirements

API ActionPermissionPurpose
cce:cluster:getGet clusterView cluster details
cce:cluster:listList clustersList CCE clusters
cce:node:listList nodesList cluster nodes
aom:*:getRead AOMQuery AOM metrics and alarms
aom:event:listList eventsQuery AOM alarm events
aom:alarmRule:listList alarm rulesQuery alarm rules

Permission Failure Handling:

  1. When any command fails due to permission errors, display required permission list
  2. Guide the user to create a custom policy in the IAM console
  3. Pause execution and wait for user confirmation

Core Tools

All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:

Primary Comprehensive Diagnosis:

ActionRequired ParametersDescription
huawei_root_cause_analyzeregion, cluster_idPrimary comprehensive action: orchestrates workload rollout diagnosis, dependency impact, change impact, and AOM alarms into a unified root cause report with Top3 causes

Workload Domain Actions:

ActionRequired ParametersDescription
huawei_workload_rollout_diagnoseregion, cluster_id, namespace, kind, nameDiagnose Deployment/StatefulSet/DaemonSet rollout failures with funnel and Top causes
huawei_workload_diagnoseregion, cluster_idGeneral workload status diagnosis
huawei_workload_diagnose_by_alarmregion, cluster_idWorkload diagnosis triggered by AOM alarm correlation
huawei_pod_failure_diagnoseregion, cluster_idPod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.)

Dependency and Impact Actions:

ActionRequired ParametersDescription
huawei_dependency_impact_analyzeregion, cluster_idAnalyze Service/Ingress/Pod/Node propagation paths and impact scope for service unavailability

Change Impact Actions:

ActionRequired ParametersDescription
huawei_change_impact_analyzeregion, cluster_idCorrelate recent changes (deployment, config, network, security policy, node changes) with observed failures via audit log and AOM alarm timeline

Network and Node Domain Actions:

ActionRequired ParametersDescription
huawei_network_diagnoseregion, cluster_idGeneral network connectivity diagnosis
huawei_network_diagnose_by_alarmregion, cluster_idNetwork diagnosis triggered by AOM alarm correlation
huawei_network_failure_diagnoseregion, cluster_idNetwork failure diagnosis (Service, Ingress connectivity)
huawei_node_diagnoseregion, cluster_idNode-level diagnosis (scheduling, pressure)
huawei_node_failure_diagnoseregion, cluster_idNode failure diagnosis
huawei_node_batch_diagnoseregion, cluster_idBatch node diagnosis for multi-node issues

Alarm and Report Actions:

ActionRequired ParametersDescription
huawei_analyze_aom_alarmsregion, cluster_idAnalyze AOM alarm patterns and correlation across resources
huawei_generate_diagnosis_reportregion, cluster_idGenerate structured Markdown diagnosis report
huawei_generate_monitor_dashboardregion, cluster_idGenerate monitoring dashboard for ongoing observation

Supporting Evidence Actions:

ActionRequired ParametersDescription
huawei_get_cce_eventsregion, cluster_idList Kubernetes Events in the cluster

Parameter Reference

Common Parameters:

ParameterRequiredDescription
regionYesHuawei Cloud region, e.g., cn-north-4
cluster_idYesCCE cluster ID
namespaceYes*Kubernetes namespace (required for workload-specific actions)
kindYes*Workload type: Deployment, StatefulSet, or DaemonSet
nameYes*Workload name

*Required only for huawei_workload_rollout_diagnose.

Optional Parameters (passed via --params JSON):

ParameterDescription
akOverride AK (uses HW_ACCESS_KEY by default)
skOverride SK (uses HW_SECRET_KEY by default)
project_idOverride project ID (auto-obtained via IAM when not set)
target_nameOptional workload/app/service name for scope narrowing
hoursMetric/query time range in hours (default 1)
top_nNumber of top results for ranking (default 3)

Output Format

Primary Comprehensive: huawei_root_cause_analyze

{
  "success": true,
  "analysis_trace_id": "RCA-...",
  "scope": {
    "region": "cn-north-4",
    "cluster_id": "cluster-id",
    "namespace": "optional",
    "target_name": "optional workload/app/service"
  },
  "summary": {
    "top_cause": {},
    "cause_count": 3,
    "data_sources": {
      "rollout": true,
      "dependency": true,
      "change": true,
      "alarms": true
    }
  },
  "top_causes": [
    {
      "rank": 1,
      "type": "ContainerCommandNotFound",
      "title": "New version container startup command or entry file does not exist",
      "domain": "workload",
      "confidence": 0.94,
      "evidence": [],
      "counter_evidence": [],
      "recommendation": [],
      "remediation_hint": {
        "skill": "huawei-cloud-cce-auto-remediation-runner",
        "action": "huawei_auto_remediation_run",
        "strategy": "rollback_previous_revision",
        "requires_confirmation": true
      }
    }
  ],
  "report_markdown": "# CCE Comprehensive Root Cause Analysis Report...",
  "report_file": "optional"
}

Supporting Domain Outputs

Each domain action (huawei_workload_rollout_diagnose, huawei_dependency_impact_analyze, huawei_change_impact_analyze, huawei_analyze_aom_alarms) produces its own structured JSON output. See individual skill documentation for domain-specific schemas.


Verification

  1. Run the environment check script to confirm dependencies and credentials are available
  2. Use huawei_root_cause_analyze on a known healthy cluster to verify it returns success: true with zero or low-confidence causes
  3. Use huawei_root_cause_analyze on a cluster with known multi-domain failures to verify Top3 causes are accurately identified
  4. Compare huawei_root_cause_analyze summary with individual domain action outputs for consistency
  5. Verify that evidence chains reference specific objects, events, and API fields (not generic statements)
  6. Verify that counter-evidence is present for each top cause candidate
  7. Confirm that low-confidence conclusions are clearly labeled with required supplementary data

Best Practices

  1. Always start with huawei_root_cause_analyze for comprehensive diagnosis; drill down into individual domain actions only when specific evidence requires deeper analysis
  2. For workload rollout failures, prioritize the rollout funnel: generation/observedGeneration → ReplicaSet → Pod Ready → Events → Logs → command/args → probes → image
  3. For service unavailability, use huawei_dependency_impact_analyze to trace Service selector → Ingress backend → Pod Ready → Node distribution propagation paths
  4. For suspected change-induced failures, use huawei_change_impact_analyze to build "change occurred before failure" causal chain with audit logs, K8s historical events, and AOM alarms
  5. Never conclude root cause from a single alarm alone; always provide timeline or evidence chain
  6. Record supporting evidence, counter-evidence, data gaps, and remediation handoff for each root cause candidate
  7. Sort root causes by impact scope, timeline alignment, evidence strength, and recoverability
  8. Clearly label low-confidence conclusions with required supplementary data
  9. All remediation actions must be output as recommendations only and handed off to huawei-cloud-cce-auto-remediation-runner

Reference Documents


Notes

  1. This skill is read-only diagnosis and report generation only; no write, scale, delete, cordon, drain, reboot, vulnerability state modification, or cluster sleep/wake operations
  2. Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables
  3. All scripts must be executed via skill action=exec; do not run them directly in a shell
  4. Any action requiring confirm=true must be handed off to huawei-cloud-cce-auto-remediation-runner; this skill never executes remediation
  5. The environment check script must be run before any diagnosis action
  6. When using temporary AK/SK, HW_SECURITY_TOKEN must be set

Common Pitfalls

  1. Concluding root cause from a single alarm — Always require timeline or evidence chain; a single alarm without temporal correlation is insufficient evidence
  2. Skipping huawei_root_cause_analyze and drilling into individual domains first — Always start with comprehensive analysis; individual domain drill-down is for supplementary evidence only
  3. Ignoring counter-evidence — Each root cause candidate must include counter-evidence and data gaps; omitting these leads to false confidence
  4. Not building a fault timeline — Establish user-perceived time, alarm trigger time, Kubernetes event time, and change time before ranking causes
  5. Attempting remediation actions from this skill — All changes must be handed off to huawei-cloud-cce-auto-remediation-runner; this skill only outputs recommendations
  6. Failing to label low-confidence conclusions — When evidence is insufficient, write "insufficient evidence" explicitly; never present guesses as conclusions
  7. Not correlating changes with failures — When a recent deployment, config, network, or security policy change exists, use huawei_change_impact_analyze to verify the "change before failure" causal chain
  8. Treating dependency propagation as single-direction — Dependency impact can propagate bidirectionally (upstream failure affects downstream, and downstream back-pressure affects upstream); analyze both directions