Huawei Cloud Cce Change Impact Analyzer

Other

Huawei Cloud CCE change impact analysis skill that converts "what changed before the incident" into provable causal attribution. Use this skill when a CCE incident may be caused by recent changes, including workload releases, ConfigMap/Secret updates, Service/Ingress/Gateway route changes, NetworkPolicy/RBAC/security policy changes, node taints or infrastructure changes, and the user needs a complete Markdown report with timeline, evidence matrix, blast radius, risk score, and conclusion. Trigger: "change impact analysis", "变更影响分析", "change risk", "变更风险", "deployment impact", "发布影响", "config change", "配置变更", "network policy change", "网络策略变更", "node taint change", "节点污点变更", "blast radius", "爆炸半径", "recent changes", "近期变更", "audit log correlation", "审计日志关联"

Install

openclaw skills install huawei-cloud-cce-change-impact-analyzer

CCE Change Impact Analyzer

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.

All actions are dispatched through scripts/huawei-cloud.py with --action <action_name> and --params <json_params>

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell

For action names and parameters, see the Core Tools section below

Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools

All paths are relative to the skill directory, which is the directory where this SKILL.md resides

Overview

This skill turns "what changed before the incident" into provable causal attribution. It ingests audit logs, K8s historical events, AOM active+history alarms, and current resource topology snapshots; filters noise; maps core changes to blast radius; scores risk by sensitivity, topology scope, security boundary span, temporal proximity to fault, and event/alarm correlation; then outputs a complete Markdown report with investigation steps, core change timeline, evidence matrix, blast radius, Top N risk alerts, conclusion, and data gaps.

This skill is applicable to the following scenarios:

Incidents where recent workload releases, config updates, or network/security policy changes may be the cause
CoreDNS, kube-proxy, or cluster plugin configuration changes causing business-wide failures
Node taint, cordon/drain, node pool resize, or cluster upgrade triggering Pod Pending, Evicted, or NotReady
NetworkPolicy/RBAC changes causing connection timeouts, 403 errors, DNS anomalies, or cross-namespace access failures
Service/Ingress/Gateway route changes causing traffic routing failures
Correlating audit trail changes with observed failures and alarm timelines

This skill does NOT handle the following:

Executing any remediation actions (rollback, scale, delete, drain, reboot, modify NetworkPolicy/RBAC/Security Group/VPC ACL)
Making causal conclusions from object updates alone without temporal or response-signal correlation
Creating, modifying, or deleting CCE resources
Guessing or fabricating diagnosis results without evidence

Prerequisites

Before using, you must run the environment check script to complete environment validation and dependency installation in one step:

Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh
Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

Windows Note: Do not use && to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.

The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before continuing with other actions.

Environment Variables:

Variable	Required	Description
HW_ACCESS_KEY	Yes	Huawei Cloud AK
HW_SECRET_KEY	Yes	Huawei Cloud SK
HW_REGION_NAME	No	Default cn-north-4
HW_PROJECT_ID	No	Project ID (automatically obtained via IAM API when not set)
HW_SECURITY_TOKEN	No	Required when using temporary AK/SK
HW_CLUSTER_ID	No	Default CCE cluster ID (can also be passed per action)

Security Constraints:

Never persist credentials (AK/SK/Token/Certificate) to the filesystem
AK/SK exist only within the current request call stack; released after use
Only non-sensitive project IDs are cached in process memory (never written to disk)
All temporary certificate files must be deleted immediately after use
Never expose AK/SK in logs, responses, or error messages

Do not output the values of the above environment variables.

IAM Permission Requirements

API Action	Permission	Purpose
cce:cluster:get	Get cluster	View cluster details
cce:cluster:list	List clusters	List CCE clusters
cce:node:list	List nodes	List cluster nodes
cce:nodepool:list	List node pools	List node pools
aom:*:get	Read AOM	Query AOM metrics and alarms
aom:alarmRule:list	List alarm rules	Query alarm rules
aom:event:list	List events	Query AOM alarm events

Permission Failure Handling:

When any command fails due to permission errors, display required permission list
Guide the user to create a custom policy in the IAM console
Pause execution and wait for user confirmation

Core Tools

All actions are dispatched through scripts/huawei-cloud.py using skill action=exec.

Primary Change Impact Analysis

Action	Required Parameters	Description
`huawei_change_impact_analyze`	region, cluster_id	Primary comprehensive action: orchestrates audit log ingestion, K8s event correlation, AOM alarm correlation, resource snapshot collection, noise filtering, blast radius modeling, and risk scoring into a unified change impact report with Top N risk alerts

Audit and Event Collection

Action	Required Parameters	Description
`huawei_query_cce_audit_logs`	region, cluster_id	Query CCE Kubernetes audit logs for create/update/patch/delete operations with actor, verb, resource, namespace, name, requestURI, statusCode
`huawei_query_k8s_events_from_lts`	region, cluster_id	Query historical K8s Events from LTS (overcomes the K8s API short event window)
`huawei_get_cce_events`	region, cluster_id	List current Kubernetes Events when LTS is unavailable

Alarm Correlation

Action	Required Parameters	Description
`huawei_analyze_aom_alarms`	region, cluster_id	Analyze AOM active + history alarm patterns and correlation across resources

Domain Drill-Down (Read-Only)

Action	Required Parameters	Description
`huawei_workload_rollout_diagnose`	region, cluster_id, namespace, kind, name	Drill down when changes point to Deployment/StatefulSet/DaemonSet rollout failures (cross-skill: `huawei-cloud-cce-workload-failure-diagnoser`)
`huawei_network_failure_diagnose`	region, cluster_id	Drill down when changes point to Service/Ingress/NetworkPolicy/ELB connectivity failures (cross-skill: `huawei-cloud-cce-network-failure-diagnoser`)
`huawei_node_failure_diagnose`	region, cluster_id	Drill down when changes point to Node taint, NotReady, scheduling, or resource pressure (cross-skill: `huawei-cloud-cce-node-failure-diagnoser`)

Current Topology Snapshots

Action	Required Parameters	Description
`huawei_get_cce_pods`	region, cluster_id	List current Pod status for blast radius modeling
`huawei_get_cce_deployments`	region, cluster_id	List current Deployment status
`huawei_get_cce_services`	region, cluster_id	List current Service selector/ports for impact mapping
`huawei_get_cce_ingresses`	region, cluster_id	List current Ingress rules/backends for impact mapping
`huawei_get_kubernetes_nodes`	region, cluster_id	List current Node status for taint/impact mapping
`huawei_list_cce_configmaps`	region, cluster_id	List current ConfigMap objects (identify CoreDNS, kube-proxy, business configs)
`huawei_list_cce_secrets`	region, cluster_id	List current Secret objects
`huawei_list_cce_nodepools`	region, cluster_id	List current NodePool status for infrastructure change context

Cloud Network Snapshots

Action	Required Parameters	Description
`huawei_list_security_groups`	region	List current Security Group rules for cloud network context
`huawei_list_vpc_acls`	region	List current VPC ACL rules for cloud network context

Parameter Reference

Common Parameters:

Parameter	Required	Description
region	Yes	Huawei Cloud region, e.g., cn-north-4
cluster_id	Yes	CCE cluster ID

Optional Parameters (passed via --params JSON):

Parameter	Description
hours	Analysis window in hours (default 1)
start_time	Analysis window start (YYYY-MM-DD HH:MM:SS), alternative to hours
end_time	Analysis window end (YYYY-MM-DD HH:MM:SS), alternative to hours
namespace	Narrow scope to a namespace, but do not exclude kube-system/CoreDNS global changes
target_name	Target object name for scope narrowing
workload_name	Workload name for scope narrowing
app_name	Application name for scope narrowing
fault_time	Incident time point for temporal proximity scoring
incident_time	Alternative to fault_time
log_group_id	Audit log group ID (manual fallback when auto-discovery fails)
log_stream_id	Audit log stream ID (manual fallback when auto-discovery fails)
include_audit	Enable/disable audit log collection (default true)
include_k8s_events	Enable/disable K8s event collection (default true)
include_aom	Enable/disable AOM alarm collection (default true)
include_snapshots	Enable/disable resource snapshot collection (default true)
top_n	Number of top risk alerts in report (default 3)
output_file	Path to write the Markdown report file
ak	Override AK (uses HW_ACCESS_KEY by default)
sk	Override SK (uses HW_SECRET_KEY by default)
project_id	Override project ID (auto-obtained via IAM when not set)

Output Format

Primary: `huawei_change_impact_analyze`

Returns structured JSON with embedded report_markdown. See references/output-schema.md for full schema.

{
  "success": true,
  "analysis_trace_id": "CIA-yyyymmddHHMMSS-xxxxxxxx",
  "analysis_window": {
    "start_time": "YYYY-MM-DD HH:MM:SS",
    "end_time": "YYYY-MM-DD HH:MM:SS",
    "hours": 1
  },
  "scope": {
    "region": "cn-north-4",
    "cluster_id": "cluster-id",
    "namespace": "optional",
    "target_name": "optional"
  },
  "summary": {
    "core_change_count": 3,
    "top_risk_count": 3,
    "data_sources": {
      "CCE Audit Logs": "success",
      "K8s Historical Events": "success",
      "AOM Alarms": "success",
      "Current Resource Snapshots": "success"
    }
  },
  "top_changes": [
    {
      "time": "YYYY-MM-DD HH:MM:SS",
      "verb": "patch",
      "resource": "configmaps",
      "namespace": "kube-system",
      "name": "coredns",
      "object_key": "kube-system/coredns",
      "category": "global_config_change",
      "title": "Cluster core configuration change",
      "actor": "user or serviceAccount",
      "semantic_fields": ["data", "Corefile"],
      "blast_radius": "cluster-wide",
      "impacted_entities": {
        "pods": [],
        "services": ["kube-system/kube-dns"],
        "ingresses": [],
        "nodes": ["node-a"]
      },
      "risk_score": 96,
      "risk_level": "Critical",
      "confidence": "high",
      "risk_reasons": [],
      "evidence": []
    }
  ],
  "changes": [],
  "report_markdown": "# CCE Change Impact Analysis Report\n...",
  "report_file": "/optional/path/report.md",
  "capture_metadata": {}
}

Verification

Run the environment check script to confirm dependencies and credentials are available
Use huawei_change_impact_analyze on a known stable cluster to verify it returns success: true with zero or low-confidence core changes
Use huawei_change_impact_analyze on a cluster with known recent changes to verify Top N risk alerts are accurately identified
Verify that noise filtering correctly excludes HPA replica-only updates, controller status writes, Lease/Token/status subresource writes
Verify that CoreDNS/kube-proxy/kube-system changes are always included regardless of namespace scope
Verify that blast radius mapping correctly traces Service selector → Pod → Ingress → Node propagation
Confirm that low-confidence conclusions are clearly labeled with data gaps

Best Practices

Always start with huawei_change_impact_analyze for comprehensive change correlation; drill down into domain diagnoser actions only when specific evidence requires deeper analysis
Find changes first, then map impact, then align with alarms/events/fault time — do not conclude root cause from object updates alone
CoreDNS, kube-proxy, network plugin, and Ingress controller config changes in kube-system must always be included in business fault analysis regardless of target namespace scope
Deployment HPA-only replicas adjustments are noise; image, startup args, probe, resource spec, environment variable, and ConfigMap/Secret reference changes are core changes
NetworkPolicy/RBAC changes must be correlated with connection timeouts, 403, DNS anomalies, and cross-namespace access failures
Node taint, cordon/drain, node pool resize, and cluster upgrade changes must be correlated with Pod Pending, Evicted, NotReady, and node events
All remediation actions must be output as recommendations only and handed off to huawei-cloud-cce-auto-remediation-runner
Clearly label low-confidence conclusions with required supplementary data; never present speculation as fact

Reference Documents

Four-stage pipeline and risk scoring rules: references/workflow.md
Reusable capabilities, gaps, and suggested atomic actions: references/capability-map.md
Output field specification and Markdown template: references/output-schema.md
Read-only boundaries and remediation handoff rules: references/risk-rules.md
Huawei Cloud CCE Documentation
Huawei Cloud Python SDK Documentation

Notes

This skill is read-only analysis and report generation only; no modification of workloads, rollback, ConfigMap/Secret changes, Security Group/ACL/NetworkPolicy/RBAC adjustments, or node cordon/drain/reboot operations
Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables
All scripts must be executed via skill action=exec; do not run them directly in a shell
Any remediation action must be handed off to huawei-cloud-cce-auto-remediation-runner; this skill never executes remediation
The environment check script must be run before any analysis action
When using temporary AK/SK, HW_SECURITY_TOKEN must be set
Cross-skill references: remediation → huawei-cloud-cce-auto-remediation-runner; comprehensive root cause → huawei-cloud-cce-root-cause-analyzer; workload diagnosis → huawei-cloud-cce-workload-failure-diagnoser; network diagnosis → huawei-cloud-cce-network-failure-diagnoser; node diagnosis → huawei-cloud-cce-node-failure-diagnoser

Common Pitfalls

Concluding root cause from object updates alone — Always require temporal proximity, event/alarm response, and topology impact evidence; an object update without correlation is insufficient evidence
Excluding kube-system changes when scoped to a namespace — CoreDNS, kube-proxy, and cluster plugin changes are global even when the target namespace is different; always include them
Treating HPA replica updates as core changes — HPA-only replicas modifications are noise; only image, probe, resource, env, config reference changes are core
Not correlating NetworkPolicy/RBAC with connectivity symptoms — NetworkPolicy/RBAC changes must be cross-referenced with connection timeout, 403, DNS anomaly, and cross-namespace access failure events
Attempting remediation actions from this skill — All changes must be handed off to huawei-cloud-cce-auto-remediation-runner; this skill only outputs recommendations
Failing to label low-confidence conclusions — When evidence is insufficient, write "insufficient evidence" explicitly with data gaps; never present guesses as conclusions
Ignoring controller and platform noise — Lease, Token, status subresource, Node status patch, scheduler binding, and CCE platform-managed RBAC updates must all be filtered out; they are control-plane closed-loop operations, not user changes
Not building a fault timeline — Establish user-perceived fault time, alarm trigger time, Kubernetes event time, and change time before scoring risk