Install
openclaw skills install huawei-cloud-cce-workload-failure-diagnoserHuawei Cloud CCE workload failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose Deployment/StatefulSet/DaemonSet rollout failures, (2) analyze workload replica shortages and update rollback issues, (3) diagnose probe-related readiness failures (startup, liveness, readiness), (4) identify ReplicaSet creation blocked by quota, admission, or webhook rejection, (5) detect control-plane-not-observed issues (observedGeneration lag), (6) check workload status, events and metrics comprehensively. Trigger: user mentions "workload failure", "工作负载故障", "Deployment rollback", "Deployment 回滚", "rollout stuck", "发布失败", "replica unavailable", "副本不可用", "workload diagnosis", "工作负载诊断", "workload unavailable", "负载异常", "Service unreachable", "Service 不通", "probe failure", "探针失败", "ReplicaSet blocked", "ReplicaSet 阻塞", "observedGeneration lag", "控制面未观测"
openclaw skills install huawei-cloud-cce-workload-failure-diagnoser⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the
scripts/huawei-cloud.pydispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.
- All actions are dispatched through
scripts/huawei-cloud.pywith--action <action_name>and--params <json_params>- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them; do not run them directly in a shell- For action names and parameters, see the Core Commands section below
- Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools
- All paths are relative to the skill directory, which is the directory where this SKILL.md resides
This skill diagnoses CCE workload rollout failures, replica availability issues, and probe-related readiness failures for Deployment, StatefulSet, and DaemonSet workloads. It builds evidence from controller state, version ownership, and event trees, then drills down into abnormal Pods using Pod diagnosis logic.
Architecture: Python SDK dispatcher (scripts/huawei-cloud.py) → Huawei Cloud CCE API / Kubernetes API → Workload + ReplicaSet + Pod + Event data → Rollout funnel analysis → Top causes ranking → Handoff recommendations
Related Skills:
huawei-cloud-cce-pod-failure-diagnoser - Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) for drill-down from workload diagnosishuawei-cloud-cce-node-failure-diagnoser - Node-level failure diagnosis (NotReady, DiskPressure, MemoryPressure, etc.) for scheduling/node pressure handoffhuawei-cloud-cce-root-cause-analyzer - Multi-domain root cause analysis converging workload, alarm, change, and dependency evidencehuawei-cloud-cce-auto-remediation-runner - Remediation execution (rollback, scale, resize, cordon, drain, reboot) with preview-confirm-verify workflowhuawei-cloud-cce-alarm-correlation-engine - AOM alarm correlation, deduplication, and severity grouping for alarm-related evidencehuawei-cloud-cce-cce-workload-manager - CCE workload lifecycle management (create, query, scale, update, delete)Capabilities:
Typical Use Cases:
Environment Check:
skill action=exec: bash skill://scripts/check_env.shskill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1Windows Note: Do not use
&&to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.
The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability. If the environment check fails, fix the issues before continuing with other actions.
echo $HW_ACCESS_KEY or echo $HW_SECRET_KEY to check credentialsHW_ACCESS_KEY, HW_SECRET_KEY, HW_REGION_NAMEConfiguration Method (Environment Variables Only):
export HW_ACCESS_KEY=<your-ak>
export HW_SECRET_KEY=<your-sk>
export HW_REGION_NAME=cn-north-4
export HW_PROJECT_ID=<your-project-id>
Optional for temporary AK/SK:
export HW_SECURITY_TOKEN=<your-security-token>
⚠️ Important Security Notes:
| API Action | Permission | Purpose |
|---|---|---|
cce:cluster:get | Get cluster | Obtain CCE cluster details and kubeconfig |
cce:cluster:createCert | Create certificate | Obtain CCE cluster kubeconfig for API access |
cce:node:list | List nodes | Query cluster node information |
cce:workload:get | Get workload | Read Deployment/StatefulSet/DaemonSet status |
cce:pod:list | List pods | Query Pod status and container state |
cce:event:list | List events | Query Kubernetes Events for diagnosis |
aom:metric:get | Get metrics | Query Pod/Node CPU/memory metrics |
Permission Failure Handling:
All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:
python3 scripts/huawei-cloud.py --action <action_name> --params '<json_params>'
See Workflow for detailed evidence collection and diagnosis flow.
# Diagnose Deployment rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"Deployment","name":"api"}'
# Diagnose StatefulSet rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"StatefulSet","name":"my-db"}'
# Diagnose DaemonSet rollout failure
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_workload_rollout_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"DaemonSet","name":"log-agent"}'
# Collect raw workload rollout context without diagnosis ranking
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_workload_rollout_context --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","namespace":"default","kind":"Deployment","name":"api"}'
# Diagnose Pod-level failures (CrashLoop, ImagePull, OOM, Pending, etc.)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_pod_failure_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# List Pods in the cluster
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pods --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# Retrieve container logs for a specific Pod
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_pod_logs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","pod_name":"<pod-name>"}'
# List Kubernetes Events in the cluster
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_events --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# Get CPU/memory metrics for a specific Pod
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pod_metrics --params '{"region":"cn-north-4","cluster_id":"<cluster-id>","pod_name":"<pod-name>"}'
# Get top-N Pod metrics by resource usage
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pod_metrics_topN --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# List PVCs for storage-related issues
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pvcs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# List PVs for storage-related issues
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_get_cce_pvs --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# Diagnose node-level failures (scheduling, pressure)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_node_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
# Diagnose network-level failures (Service, Ingress connectivity)
skill action=exec: python3 scripts/huawei-cloud.py --action huawei_network_diagnose --params '{"region":"cn-north-4","cluster_id":"<cluster-id>"}'
| Parameter | Required | Description | Default |
|---|---|---|---|
region | Yes | Huawei Cloud region, e.g., cn-north-4 | HW_REGION_NAME |
cluster_id | Yes | CCE cluster ID | HW_CLUSTER_ID |
namespace | Yes* | Kubernetes namespace | N/A |
kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet | N/A |
name | Yes* | Workload name | N/A |
*Required only for huawei_workload_rollout_diagnose and huawei_get_workload_rollout_context.
| Parameter | Required | Description | Constraints |
|---|---|---|---|
kind | Yes | Workload type | Deployment, StatefulSet, or DaemonSet only |
name | Yes | Workload name | Must reference existing workload in the namespace |
--params JSON)| Parameter | Description | Default |
|---|---|---|
ak | Override AK (uses HW_ACCESS_KEY by default) | env var |
sk | Override SK (uses HW_SECRET_KEY by default) | env var |
project_id | Override project ID (auto-obtained via IAM when not set) | env var |
label_selector | Pod label selector for filtering | None |
hours | Metric query time range in hours | 1 |
top_n | Number of top results for metrics | 10 |
| Region Name | Region ID |
|---|---|
| North China - Beijing 4 | cn-north-4 |
| North China - Beijing 1 | cn-north-1 |
| East China - Shanghai 1 | cn-east-3 |
| East China - Shanghai 2 | cn-east-2 |
| South China - Guangzhou | cn-south-1 |
| Southwest China - Guiyang 1 | cn-southwest-2 |
| Asia Pacific - Bangkok | ap-southeast-2 |
| Asia Pacific - Singapore | ap-southeast-1 |
| Asia Pacific - Hong Kong | ap-southeast-3 |
| Europe - Paris | eu-west-0 |
See Output Schema for detailed response format examples.
huawei_workload_rollout_diagnose{
"success": true,
"action": "workload_rollout_diagnose",
"target": {
"namespace": "default",
"kind": "Deployment",
"name": "api"
},
"selector": {
"value": "app=api",
"source": "matchLabels"
},
"summary": {
"status": "control_plane_not_observed | new_version_not_created | rollout_blocked | replicas_unavailable | probe_failure | healthy",
"headline": "human-readable diagnosis; may note when old-version replicas remain available",
"expected_replicas": 3,
"ready_replicas": 1,
"available_replicas": 1,
"top_cause": "ProbeFailure | ContainerCommandNotFound | CrashLoopOrAppExit | ..."
},
"generation_check": {
"generation": 5,
"observed_generation": 5,
"observed": true
},
"workload": {
"kind": "Deployment",
"uid": "workload-uid",
"desired_replicas": 3,
"updated_replicas": 3,
"ready_replicas": 1,
"available_replicas": 1,
"conditions": []
},
"version": {
"strategy": "DeploymentReplicaSet",
"new_rs": {},
"old_rs": []
},
"funnel": [
{"layer": "workload_current", "expected": 3, "actual": 3, "status": "pass"},
{"layer": "new_pods_ready", "expected": 3, "actual": 1, "status": "fail"}
],
"events": {
"filtered_count": 5,
"timeline": [],
"filter": {
"uid_count": 6,
"before_count": 40,
"after_count": 5,
"events_without_involved_uid": 0
}
},
"pod_diagnosis": {
"diagnosed_pods": 1,
"pods": []
},
"top_causes": [
{
"rank": 1,
"type": "ProbeFailure",
"title": "New version Pods are Running but probe checks fail or Pods are not Ready",
"confidence": 0.88,
"evidence": [],
"recommendation": []
}
],
"handoff": [
{
"skill": "huawei-cloud-cce-pod-failure-diagnoser",
"reason": "Probe failure requires Pod logs and health check configuration analysis"
}
],
"warnings": []
}
huawei_get_workload_rollout_context{
"success": true,
"action": "get_workload_rollout_context",
"workload": {},
"replicasets": [],
"pods": [],
"events": [],
"event_filter": {},
"warnings": []
}
| Status | Description |
|---|---|
healthy | All replicas ready and available; rollout complete |
control_plane_not_observed | observedGeneration < generation; controller lagging |
new_version_not_created | New ReplicaSet has zero current replicas or no owned Pods |
rollout_blocked | Rollout funnel layer fails; replicas below expected |
replicas_unavailable | Updated/ready/available replicas below desired count |
probe_failure | Pod Running but not Ready; Unhealthy probe events present |
| Top Cause Type | Description |
|---|---|
ProbeFailure | Startup/liveness/readiness probe failing |
ContainerCommandNotFound | Executable not found in container $PATH |
CrashLoopOrAppExit | Container crashing or exiting unexpectedly |
ImagePullBackOff | Container image cannot be pulled |
OOMKilled | Container killed by OOM |
QuotaOrAdmissionRejected | ReplicaSet creation blocked by quota/admission/webhook |
SchedulingFailure | Pod Pending due to FailedScheduling |
StorageMountFailure | Pod FailedMount/FailedAttachVolume |
See Verification Method for step-by-step verification.
huawei_workload_rollout_diagnose before collecting raw context; it provides the rollout funnel and ranked Top causeshuawei_pod_failure_diagnose and huawei_get_pod_logs for deeper analysisinvolvedObject.uid belongs to Workload, ReplicaSet, or Pod objects; do not treat all namespace Warning events as relevantobservedGeneration < generation, diagnose control-plane pressure before looking at Podshuawei-cloud-cce-node-failure-diagnoser; for Service/Ingress/ELB issues, hand off to huawei-cloud-cce-network-failure-diagnoserhuawei-cloud-cce-auto-remediation-runner| Document | Description |
|---|---|
| Workflow | Evidence collection order, failure rules, and diagnosis flow |
| Output Schema | Response format specification for primary and context actions |
| Risk Rules | Allowed/not-allowed operations and handoff mapping |
| Verification Method | Step-by-step verification procedures |
| Huawei Cloud CCE Documentation | Official CCE documentation |
| Huawei Cloud Python SDK Documentation | CCE API reference |
HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN values in logs, responses, or error messagesskill action=exec — do not run them directly in a shellhuawei_scale_cce_workload, huawei_resize_cce_workload, huawei_delete_cce_workload are out of scopeconfirm=true — this skill never confirms remediation actions; it only outputs recommendationsHW_SECURITY_TOKEN — set this environment variable when using temporary credentialshuawei-cloud-cce- prefix convention (e.g., huawei-cloud-cce-pod-failure-diagnoser)| Pitfall | Symptom | Quick Fix |
|---|---|---|
| Treating all namespace Warning events as evidence | False-positive diagnosis with irrelevant events | Only use UID-filtered events whose involvedObject.uid belongs to Workload/RS/Pod |
| Skipping generation check | Misdiagnosing Pod failures when controller is lagging | If observedGeneration < generation, diagnose control-plane pressure first |
| Confusing old-version replicas with health | Declaring workload "healthy" when old RS still active | The rollout funnel tracks new-version readiness specifically |
| Diagnosing Pod failures directly without funnel | Missing the workload-layer root cause | Start from workload layer; identify which funnel layer first fails |
| Attempting remediation from this skill | Unauthorized scale/delete/cordon/drain actions | All changes must be handed off to huawei-cloud-cce-auto-remediation-runner |
| Ignoring ReplicaSetCreateBlocked subtype | Missing quota/admission/webhook rejection evidence | Check FailedCreate events mentioning quota, LimitRange, admission, webhook, forbidden |
| Using unfiltered namespace events | Too many irrelevant Warning events in diagnosis | Apply UID-based event filtering per workflow step 5 |
| Not running environment check script first | SDK import errors or credential failures | Run check_env.sh / check_env.ps1 before any action |
| Hardcoding cluster_id instead of environment variable | Wrong cluster or repeated manual entry | Use HW_CLUSTER_ID env var or pass cluster_id per action |
| Confusing Deployment RS vs StatefulSet version model | Wrong version analysis for StatefulSet/DaemonSet | Deployment uses ReplicaSet revision; StatefulSet/DaemonSet use workload itself |