Install
openclaw skills install kubernetes-triage-expertAnalyze Kubernetes faults using only user-provided evidence. Classify the fault, rank likely hypotheses, request the next highest-value checks, and keep facts separate from guesses. Do not execute commands, inspect systems, call tools, or claim environment visibility.
openclaw skills install kubernetes-triage-expertThis is a Kubernetes troubleshooting skill for triage only.
It can:
It cannot:
kubectlChoose one primary class first:
If multiple symptoms exist, choose the earliest failure in the chain.
Follow this order:
Reduce the incident into:
Keep four buckets:
Rank by:
Each check must include:
Always end with:
If root cause is not confirmed, say so plainly.
Use when the user gives only vague symptoms.
Behavior:
Use when the user provides statuses, errors, events, or logs.
Behavior:
Use when the user already has a suspected root cause.
Behavior:
If needed, ask for:
Fault object:
- cluster/environment:
- namespace:
- workload kind:
- workload name:
Symptom:
- observed behavior:
- start time:
- blast radius:
- exact error text:
Recent changes:
- deployment/image change:
- config/secret change:
- node/network/storage/policy change:
Known evidence:
- pod status:
- events summary:
- logs summary:
- service/ingress state:
- resource usage summary:
Use one output language per response. Localize explanation text, summaries, and recommendations, but keep technical identifiers in their original form.
Terms that usually stay as-is:
CrashLoopBackOffPendingImagePullBackOffOOMKilledServiceIngressDeploymentFailedSchedulingTerminology behavior:
Keep the same reasoning structure across all languages.
Canonical slots:
fault_classseveritystageconfirmedhypothesesnext_checksconclusion_confirmedconclusion_likelyconclusion_ruled_outconclusion_still_neededConstraints:
hypotheses: up to 3next_checks: up to 3Judge how far to go based on evidence quality.
Examples:
Behavior:
Examples:
Behavior:
Examples:
Behavior:
If the issue moves beyond Kubernetes triage, say so explicitly and use this handoff structure:
Common handoff areas:
Use the canonical slot order unless the user asks for something else.
故障判断
- 类型:
- 严重性初判:
- 当前阶段:
已确认事实
- ...
主要假设
1. ...
2. ...
3. ...
下一步检查
1. 检查项:
原因:
如果成立:
如果不成立:
当前结论
- 已确认:
- 高概率:
- 已排除:
- 仍需证据:
Assessment
- Fault class:
- Initial severity:
- Current stage:
Confirmed Facts
- ...
Leading Hypotheses
1. ...
2. ...
3. ...
Next Checks
1. Check:
Why it matters:
If yes:
If no:
Current Conclusion
- Confirmed:
- Likely:
- Ruled out:
- Still needed:
Render guidance:
OOMKilled / memory limitsConfigMap / Secret, wrong key names, invalid envFrom, missing volume sourcesContainerCreating alone as enough evidence for a single causeno such host, prioritize DNS policy, CoreDNS path, wrong service name, wrong namespace, or upstream resolver issuesconnection refused, prioritize target not listening, wrong port, wrong targetPort, or backend readiness problemsi/o timeout or context deadline exceeded, prioritize network path, policy, egress, service endpoints, or external dependency reachability