Huawei Cloud Cce Workload Failure Diagnoser

Other

Huawei Cloud CCE workload failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose Deployment/StatefulSet/DaemonSet rollout failures, (2) analyze workload replica shortages and update rollback issues, (3) diagnose probe-related readiness failures (startup, liveness, readiness), (4) identify ReplicaSet creation blocked by quota, admission, or webhook rejection, (5) detect control-plane-not-observed issues (observedGeneration lag), (6) check workload status, events and metrics comprehensively. Trigger: user mentions "workload failure", "工作负载故障", "Deployment rollback", "Deployment 回滚", "rollout stuck", "发布失败", "replica unavailable", "副本不可用", "workload diagnosis", "工作负载诊断", "workload unavailable", "负载异常", "Service unreachable", "Service 不通", "probe failure", "探针失败", "ReplicaSet blocked", "ReplicaSet 阻塞", "observedGeneration lag", "控制面未观测"

Install

openclaw skills install huawei-cloud-cce-workload-failure-diagnoser

Huawei Cloud CCE Workload Failure Diagnoser

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.

All actions are dispatched through scripts/huawei-cloud.py with --action <action_name> and --params <json_params>

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell

For action names and parameters, see the Core Commands section below

Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools

All paths are relative to the skill directory, which is the directory where this SKILL.md resides

Overview

This skill diagnoses CCE workload rollout failures, replica availability issues, and probe-related readiness failures for Deployment, StatefulSet, and DaemonSet workloads. It builds evidence from controller state, version ownership, and event trees, then drills down into abnormal Pods using Pod diagnosis logic.

Architecture: Python SDK dispatcher (scripts/huawei-cloud.py) → Huawei Cloud CCE API / Kubernetes API → Workload + ReplicaSet + Pod + Event data → Rollout funnel analysis → Top causes ranking → Handoff recommendations

Related Skills:

huawei-cloud-cce-pod-failure-diagnoser - Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) for drill-down from workload diagnosis
huawei-cloud-cce-node-failure-diagnoser - Node-level failure diagnosis (NotReady, DiskPressure, MemoryPressure, etc.) for scheduling/node pressure handoff
huawei-cloud-cce-root-cause-analyzer - Multi-domain root cause analysis converging workload, alarm, change, and dependency evidence
huawei-cloud-cce-auto-remediation-runner - Remediation execution (rollback, scale, resize, cordon, drain, reboot) with preview-confirm-verify workflow
huawei-cloud-cce-alarm-correlation-engine - AOM alarm correlation, deduplication, and severity grouping for alarm-related evidence
huawei-cloud-cce-cce-workload-manager - CCE workload lifecycle management (create, query, scale, update, delete)

Capabilities:

Diagnose Deployment/StatefulSet/DaemonSet rollout failures (rollout stuck, rollback issues)
Analyze workload replica shortages (unavailable replicas, updated replicas below expected)
Diagnose probe failures (startup, liveness, readiness probe failures causing Pod not Ready)
Identify ReplicaSet creation blocked by quota, admission, or webhook rejection
Detect control-plane-not-observed issues (observedGeneration lagging behind generation)
Collect workload rollout context (Workload, ReplicaSet, Pod, and UID-filtered Events)
Drill down into Pod-level failures for abnormal Pods
Cross-domain handoff to node, network, storage, and root-cause skills

Typical Use Cases:

"My Deployment rollout is stuck, diagnose the failure"
"Replicas are unavailable, find the root cause"
"Pods are Running but not Ready, check probe failures"
"New ReplicaSet has zero replicas, check quota or admission blocking"
"Deployment observedGeneration is lagging, check control plane pressure"
"Collect workload rollout context for manual analysis"
"Drill down into Pod-level failures from workload diagnosis"
"Check PVC/PV for storage-related workload issues"
"Hand off to node or network diagnosis for cross-domain failures"

Prerequisites

1. Python Requirements (MANDATORY)

Python >= 3.6 installed
Run the environment check script before any diagnosis action

Environment Check:

Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh
Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

Windows Note: Do not use && to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.

The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before continuing with other actions.

2. Credential Configuration

Valid Huawei Cloud credentials (AK/SK mode)
Security Rules:
- 🚫 Never expose AK/SK values in code, conversation, or commands
- 🚫 Never use echo $HW_ACCESS_KEY or echo $HW_SECRET_KEY to check credentials
- ✅ Use environment variables: HW_ACCESS_KEY, HW_SECRET_KEY, HW_REGION_NAME
- ✅ Prefer IAM users over root account for cloud operations
- ✅ Enable MFA for sensitive operations
- ✅ Never persist credentials (AK/SK/Token/Certificate) to the filesystem
- ✅ All temporary certificate files must be deleted immediately after use

Configuration Method (Environment Variables Only):

export HW_ACCESS_KEY=<your-ak>
export HW_SECRET_KEY=<your-sk>
export HW_REGION_NAME=cn-north-4
export HW_PROJECT_ID=<your-project-id>

Optional for temporary AK/SK:

export HW_SECURITY_TOKEN=<your-security-token>

⚠️ Important Security Notes:

AK/SK exist only within the current request call stack; released after use
Only non-sensitive project IDs are cached in process memory (never written to disk)
Never commit credentials to version control
Use IAM users with minimal required permissions
Rotate AK/SK regularly
Do not output the values of environment variables

3. IAM Permission Requirements

API Action	Permission	Purpose
`cce:cluster:get`	Get cluster	Obtain CCE cluster details and kubeconfig
`cce:cluster:createCert`	Create certificate	Obtain CCE cluster kubeconfig for API access
`cce:node:list`	List nodes	Query cluster node information
`cce:workload:get`	Get workload	Read Deployment/StatefulSet/DaemonSet status
`cce:pod:list`	List pods	Query Pod status and container state
`cce:event:list`	List events	Query Kubernetes Events for diagnosis
`aom:metric:get`	Get metrics	Query Pod/Node CPU/memory metrics

Permission Failure Handling:

When any action fails due to IAM permission errors, verify the permissions listed above
Guide the user to create a custom policy in the IAM console for Huawei Cloud permissions
Pause execution and wait for user confirmation that permissions have been granted

Core Commands

All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:

python3 scripts/huawei-cloud.py --action <action_name> --params '<json_params>'

1. Primary Diagnosis

See Workflow for detailed evidence collection and diagnosis flow.

# Diagnose Deployment rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"Deployment","name":"api"}'

# Diagnose StatefulSet rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"StatefulSet","name":"my-db"}'

# Diagnose DaemonSet rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"DaemonSet","name":"log-agent"}'

2. Context Collection (Evidence Only)

# Collect raw workload rollout context without diagnosis ranking
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_workload_rollout_context --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"Deployment","name":"api"}'

3. Pod-Level Drill-Down

# Diagnose Pod-level failures (CrashLoop, ImagePull, OOM, Pending, etc.)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_pod_failure_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# List Pods in the cluster
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pods --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# Retrieve container logs for a specific Pod
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_pod_logs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","pod_name":"<pod-name>"}'

4. Supporting Evidence

# List Kubernetes Events in the cluster
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_events --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# Get CPU/memory metrics for a specific Pod
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pod_metrics --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","pod_name":"<pod-name>"}'

# Get top-N Pod metrics by resource usage
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pod_metrics_topN --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

5. Cross-Domain Drill-Down

# List PVCs for storage-related issues
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pvcs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# List PVs for storage-related issues
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pvs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# Diagnose node-level failures (scheduling, pressure)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_node_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

# Diagnose network-level failures (Service, Ingress connectivity)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_network_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'

Parameter Reference

Common Parameters

Parameter	Required	Description	Default
`region`	Yes	Huawei Cloud region, e.g., cn-north-4	`HW_REGION_NAME`
`cluster_id`	Yes	CCE cluster ID	`HW_CLUSTER_ID`
`namespace`	Yes*	Kubernetes namespace	N/A
`kind`	Yes*	Workload type: Deployment, StatefulSet, or DaemonSet	N/A
`name`	Yes*	Workload name	N/A

*Required only for huawei_workload_rollout_diagnose and huawei_get_workload_rollout_context.

Workload-Specific Parameters

Parameter	Required	Description	Constraints
`kind`	Yes	Workload type	`Deployment`, `StatefulSet`, or `DaemonSet` only
`name`	Yes	Workload name	Must reference existing workload in the namespace

Optional Parameters (passed via `--params` JSON)

Parameter	Description	Default
`ak`	Override AK (uses `HW_ACCESS_KEY` by default)	env var
`sk`	Override SK (uses `HW_SECRET_KEY` by default)	env var
`project_id`	Override project ID (auto-obtained via IAM when not set)	env var
`label_selector`	Pod label selector for filtering	None
`hours`	Metric query time range in hours	1
`top_n`	Number of top results for metrics	10

Supported Regions

Region Name	Region ID
North China - Beijing 4	`cn-north-4`
North China - Beijing 1	`cn-north-1`
East China - Shanghai 1	`cn-east-3`
East China - Shanghai 2	`cn-east-2`
South China - Guangzhou	`cn-south-1`
Southwest China - Guiyang 1	`cn-southwest-2`
Asia Pacific - Bangkok	`ap-southeast-2`
Asia Pacific - Singapore	`ap-southeast-1`
Asia Pacific - Hong Kong	`ap-southeast-3`
Europe - Paris	`eu-west-0`

Output Format

See Output Schema for detailed response format examples.

Primary Diagnosis: `huawei_workload_rollout_diagnose`

{
  "success": true,
  "action": "workload_rollout_diagnose",
  "target": {
    "namespace": "default",
    "kind": "Deployment",
    "name": "api"
  },
  "selector": {
    "value": "app=api",
    "source": "matchLabels"
  },
  "summary": {
    "status": "control_plane_not_observed | new_version_not_created | rollout_blocked | replicas_unavailable | probe_failure | healthy",
    "headline": "human-readable diagnosis; may note when old-version replicas remain available",
    "expected_replicas": 3,
    "ready_replicas": 1,
    "available_replicas": 1,
    "top_cause": "ProbeFailure | ContainerCommandNotFound | CrashLoopOrAppExit | ..."
  },
  "generation_check": {
    "generation": 5,
    "observed_generation": 5,
    "observed": true
  },
  "workload": {
    "kind": "Deployment",
    "uid": "workload-uid",
    "desired_replicas": 3,
    "updated_replicas": 3,
    "ready_replicas": 1,
    "available_replicas": 1,
    "conditions": []
  },
  "version": {
    "strategy": "DeploymentReplicaSet",
    "new_rs": {},
    "old_rs": []
  },
  "funnel": [
    {"layer": "workload_current", "expected": 3, "actual": 3, "status": "pass"},
    {"layer": "new_pods_ready", "expected": 3, "actual": 1, "status": "fail"}
  ],
  "events": {
    "filtered_count": 5,
    "timeline": [],
    "filter": {
      "uid_count": 6,
      "before_count": 40,
      "after_count": 5,
      "events_without_involved_uid": 0
    }
  },
  "pod_diagnosis": {
    "diagnosed_pods": 1,
    "pods": []
  },
  "top_causes": [
    {
      "rank": 1,
      "type": "ProbeFailure",
      "title": "New version Pods are Running but probe checks fail or Pods are not Ready",
      "confidence": 0.88,
      "evidence": [],
      "recommendation": []
    }
  ],
  "handoff": [
    {
      "skill": "huawei-cloud-cce-pod-failure-diagnoser",
      "reason": "Probe failure requires Pod logs and health check configuration analysis"
    }
  ],
  "warnings": []
}

Context-Only: `huawei_get_workload_rollout_context`

{
  "success": true,
  "action": "get_workload_rollout_context",
  "workload": {},
  "replicasets": [],
  "pods": [],
  "events": [],
  "event_filter": {},
  "warnings": []
}

Summary Status Values

Status	Description
`healthy`	All replicas ready and available; rollout complete
`control_plane_not_observed`	`observedGeneration < generation`; controller lagging
`new_version_not_created`	New ReplicaSet has zero current replicas or no owned Pods
`rollout_blocked`	Rollout funnel layer fails; replicas below expected
`replicas_unavailable`	Updated/ready/available replicas below desired count
`probe_failure`	Pod Running but not Ready; Unhealthy probe events present

Top Cause Types

Top Cause Type	Description
`ProbeFailure`	Startup/liveness/readiness probe failing
`ContainerCommandNotFound`	Executable not found in container `$PATH`
`CrashLoopOrAppExit`	Container crashing or exiting unexpectedly
`ImagePullBackOff`	Container image cannot be pulled
`OOMKilled`	Container killed by OOM
`QuotaOrAdmissionRejected`	ReplicaSet creation blocked by quota/admission/webhook
`SchedulingFailure`	Pod Pending due to FailedScheduling
`StorageMountFailure`	Pod FailedMount/FailedAttachVolume

Verification

See Verification Method for step-by-step verification.

Best Practices

Always start with primary diagnosis — Use huawei_workload_rollout_diagnose before collecting raw context; it provides the rollout funnel and ranked Top causes
Drill down from workload layer — Identify which funnel layer first fails before drilling into Pod-level diagnosis; do not skip the workload layer
Use Pod diagnosis for drill-down — When NewRS Pods exist but are not Ready, use huawei_pod_failure_diagnose and huawei_get_pod_logs for deeper analysis
Only use UID-filtered Events — Only accept Events whose involvedObject.uid belongs to Workload, ReplicaSet, or Pod objects; do not treat all namespace Warning events as relevant
Check generation before Pod analysis — If observedGeneration < generation, diagnose control-plane pressure before looking at Pods
Distinguish old vs new version replicas — A Deployment with old ReplicaSet replicas still available is not "healthy"; the funnel tracks new-version readiness specifically
Cite specific evidence — Always cite specific objects, events, log segments, or API fields in diagnosis output; never present guesses as conclusions
Write "insufficient evidence" when lacking — When evidence is insufficient, write "insufficient evidence" explicitly; never fabricate diagnosis results
Hand off cross-domain failures — When evidence points to scheduling/node pressure, hand off to huawei-cloud-cce-node-failure-diagnoser; for Service/Ingress/ELB issues, hand off to huawei-cloud-cce-network-failure-diagnoser
Remediation as recommendations only — All remediation actions (scale, resize, delete, cordon/drain/reboot) must be output as recommendations and handed off to huawei-cloud-cce-auto-remediation-runner

Reference Documents

Document	Description
Workflow	Evidence collection order, failure rules, and diagnosis flow
Output Schema	Response format specification for primary and context actions
Risk Rules	Allowed/not-allowed operations and handoff mapping
Verification Method	Step-by-step verification procedures
Huawei Cloud CCE Documentation	Official CCE documentation
Huawei Cloud Python SDK Documentation	CCE API reference

Notes

This skill is read-only diagnosis only — no write, scale, delete, cordon, drain, or reboot operations
Do not output credential values — never expose HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN values in logs, responses, or error messages
All scripts must be executed via skill action=exec — do not run them directly in a shell
Do not call remediation actions — huawei_scale_cce_workload, huawei_resize_cce_workload, huawei_delete_cce_workload are out of scope
Do not add confirm=true — this skill never confirms remediation actions; it only outputs recommendations
Run environment check first — the check script must be executed before any diagnosis action
Temporary AK/SK requires HW_SECURITY_TOKEN — set this environment variable when using temporary credentials
Cross-skill handoff uses prefixed names — handoff targets use huawei-cloud-cce- prefix convention (e.g., huawei-cloud-cce-pod-failure-diagnoser)

Common Pitfalls

Pitfall	Symptom	Quick Fix
Treating all namespace Warning events as evidence	False-positive diagnosis with irrelevant events	Only use UID-filtered events whose `involvedObject.uid` belongs to Workload/RS/Pod
Skipping generation check	Misdiagnosing Pod failures when controller is lagging	If `observedGeneration < generation`, diagnose control-plane pressure first
Confusing old-version replicas with health	Declaring workload "healthy" when old RS still active	The rollout funnel tracks new-version readiness specifically
Diagnosing Pod failures directly without funnel	Missing the workload-layer root cause	Start from workload layer; identify which funnel layer first fails
Attempting remediation from this skill	Unauthorized scale/delete/cordon/drain actions	All changes must be handed off to `huawei-cloud-cce-auto-remediation-runner`
Ignoring ReplicaSetCreateBlocked subtype	Missing quota/admission/webhook rejection evidence	Check FailedCreate events mentioning quota, LimitRange, admission, webhook, forbidden
Using unfiltered namespace events	Too many irrelevant Warning events in diagnosis	Apply UID-based event filtering per workflow step 5
Not running environment check script first	SDK import errors or credential failures	Run `check_env.sh` / `check_env.ps1` before any action
Hardcoding cluster_id instead of environment variable	Wrong cluster or repeated manual entry	Use `HW_CLUSTER_ID` env var or pass `cluster_id` per action
Confusing Deployment RS vs StatefulSet version model	Wrong version analysis for StatefulSet/DaemonSet	Deployment uses ReplicaSet revision; StatefulSet/DaemonSet use workload itself