Install
openclaw skills install k8s-debugDiagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl.
openclaw skills install k8s-debugSystematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Use this skill when requests resemble:
CrashLoopBackOff; help me find the root cause."Pending and not scheduling."Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.
kubectl installed and configured.Quick preflight:
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns
jq for more precise filtering in ./scripts/cluster_health.sh.metrics-server) for kubectl top.nslookup, getent, curl, wget, ip) for deep network tests.Fallback behavior:
kubectl top is unavailable, continue with kubectl describe and events.Use this skill for:
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
kubectl delete pod ... --force --grace-period=0kubectl drain ...kubectl rollout restart ...kubectl rollout undo ...kubectl debug ... --copy-to=...Before disruptive actions:
# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
Load only the section needed for the observed symptom.
| Symptom / Need | Open | Start section |
|---|---|---|
| You need an end-to-end diagnosis path | ./references/troubleshooting_workflow.md | General Debugging Workflow |
Pod state is Pending, CrashLoopBackOff, or ImagePullBackOff | ./references/troubleshooting_workflow.md | Pod Lifecycle Troubleshooting |
| Service reachability or DNS failure | ./references/troubleshooting_workflow.md | Network Troubleshooting Workflow |
| Node pressure or performance regression | ./references/troubleshooting_workflow.md | Resource and Performance Workflow |
| PVC / PV / storage class issues | ./references/troubleshooting_workflow.md | Storage Troubleshooting Workflow |
| Quick symptom-to-fix lookup | ./references/common_issues.md | matching issue heading |
| Post-mortem fix options for known issues | ./references/common_issues.md | Solutions sections |
| Script | Purpose | Required args | Optional args | Output | Fallback behavior |
|---|---|---|---|---|---|
./scripts/cluster_health.sh | Cluster-wide health snapshot (nodes, workloads, events, common failure states) | None | --strict, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Continues on check failures, tracks them in summary and exit code |
./scripts/network_debug.sh | Pod-centric network and DNS diagnostics | <pod-name> (<namespace> defaults to default) | --strict, --insecure, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Uses secure API probe by default; insecure TLS requires explicit --insecure |
./scripts/pod_diagnostics.py | Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) | <pod-name> | -n/--namespace, -o/--output | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |
./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:
0: checks completed with no check failures (warnings allowed unless --strict is set).1: one or more checks failed, or warnings occurred in --strict mode.2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).Follow this systematic approach for any Kubernetes issue:
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>
If preflight fails, stop and fix access/context first.
Categorize the issue:
Use the appropriate diagnostic script based on scope:
Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
Output can be saved for analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
Use ./scripts/cluster_health.sh for overall cluster diagnostics:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
This script checks:
Use ./scripts/network_debug.sh for connectivity issues:
./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>
This script analyzes:
Based on the identified issue, consult ./references/troubleshooting_workflow.md:
Refer to ./references/common_issues.md for symptom-specific fixes.
Run final verification:
kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
payments Namespacepython3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout
Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.
# View pod status
kubectl get pods -n <namespace> -o wide
# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>
# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check endpoints
kubectl get endpoints -n <namespace>
# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Node resources
kubectl top nodes
kubectl describe nodes
# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>
# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>
# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Cordon node (prevent scheduling)
kubectl cordon <node-name>
Troubleshooting session is complete when all are true:
./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.Useful additional tools for Kubernetes debugging: