K8s Aiops

Other

Use this skill whenever the user needs to operate a Kubernetes cluster — list/inspect pods, deployments, services, nodes, namespaces, and events, read pod logs, scale and rollout-restart deployments, delete pods/deployments, and cordon/uncordon nodes. Works with any kubeconfig-reachable cluster (standard Kubernetes, k3s, EKS, GKE, AKS). Always use this skill for "list k8s pods", "scale deployment", "kubernetes pod logs", "cordon node", "restart deployment", "k3s", or "kubectl"-style tasks when the context is explicitly Kubernetes / a cluster. Do NOT use when the target is not a Kubernetes cluster (hypervisor VM lifecycle, backup products, or cloud-provider consoles are out of scope). Preview — common Kubernetes operations with a built-in governance harness (audit, policy, token budget, undo, risk-tiers).

Install

openclaw skills install @zw008/k8s-aiops

k8s AIops (preview)

Disclaimer: This is a community-maintained open-source project and is not affiliated with, endorsed by, or sponsored by the Cloud Native Computing Foundation, the Kubernetes project, or k3s/Rancher. "Kubernetes" and "k3s" are trademarks of their respective owners. Source code is publicly auditable at github.com/AIops-tools/K8s-AIops under the MIT license.

Governed Kubernetes operations — 15 MCP tools, every one wrapped with the bundled @governed_tool harness: a local unified audit log under ~/.k8s-aiops/, policy engine, token/runaway budget guard, undo-token recording, and graduated-autonomy risk tiers. Works with any kubeconfig-reachable cluster (standard Kubernetes, k3s, EKS, GKE, AKS).

Standalone: the governance harness is bundled in the package (k8s_aiops.governance) — k8s-aiops has no external skill-family dependency. Preview: common operations, not yet exhaustive.

What This Skill Does

Category	Tools	Count	Read or Write
Pods	list, get, logs, delete	4	3 read / 1 write
Deployments	list, get, scale, rollout restart, delete	5	2 read / 3 write
Services	list	1	1 read
Nodes	list, cordon, uncordon	3	1 read / 2 write
Namespaces	list	1	1 read
Events	list	1	1 read

Quick Install

uv tool install k8s-aiops
k8s-aiops doctor          # uses your current kube-context out of the box

When to Use This Skill

List/inspect pods, deployments, services, nodes, namespaces and recent events
Read a pod's recent log lines to diagnose a crash loop
Scale a deployment up/down, or trigger a rolling restart
Delete a stuck pod (a controller recreates it) or a deployment
Cordon a node before maintenance, then uncordon it after

Do NOT use when the target is not a Kubernetes cluster (hypervisor VM lifecycle, backup products, or cloud-provider consoles are out of scope for this skill).

Related Skills — Skill Routing

If the user wants…	Use
Kubernetes pods / deployments / nodes	k8s-aiops (this skill)
Hypervisor VM lifecycle (power, snapshot, migrate)	a hypervisor ops skill
Backup & restore	a backup ops skill

Common Workflows

Diagnose a crash-looping pod and restart its deployment

k8s-aiops pod list -n prod → find the pod with high restarts / non-Running phase
k8s-aiops pod logs <pod> -n prod --tail 200 → read the recent logs for the crash cause
k8s-aiops events -n prod → check for FailedScheduling / image-pull events
k8s-aiops deployment restart <deploy> -n prod → roll the deployment after fixing the cause
Failure branch: if logs/events show an RBAC 403, the kube context lacks the verb — run kubectl auth can-i get pods -n prod and switch to a context with adequate RBAC; the skill never retries a denied auth.

Drain a node for maintenance, safely reversible

k8s-aiops node list → identify the node and confirm it is Ready/schedulable
k8s-aiops node cordon <node> --dry-run → preview, then k8s-aiops node cordon <node> (double confirm) — records an inverse uncordon_node undo descriptor
After maintenance: k8s-aiops node uncordon <node> → re-enable scheduling
Failure branch: if doctor shows the cluster unreachable, fix the kubeconfig context (kubectl config get-contexts) before retrying — cordon is never issued against an unauthenticated session.

Usage Mode

Scenario	Recommended	Why
Local/small models (Ollama, Qwen)	CLI	fewer tokens than MCP
Cloud models (Claude, GPT)	Either	MCP gives structured JSON I/O
Automated pipelines	MCP	type-safe parameters, audited

MCP Tools (15 — 9 read, 6 write)

Category	Tools	R/W
Pods	`pod_list`, `pod_get`, `pod_logs`	Read
	`delete_pod`	Write
Deployments	`deployment_list`, `deployment_get`	Read
	`scale_deployment`, `rollout_restart_deployment`, `delete_deployment`	Write
Services	`service_list`	Read
Nodes	`node_list`	Read
	`cordon_node`, `uncordon_node`	Write
Namespaces	`namespace_list`	Read
Events	`event_list`	Read

Harness features that light up: write tools with a clean inverse pass an undo= lambda so the harness records an inverse descriptor (with _undo_id) to the undo store — scale_deployment records a scale-back to its returned previous_replicas, and cordon_node ↔ uncordon_node are mutual inverses. delete_pod, delete_deployment, and rollout_restart_deployment declare no undo; delete_deployment is tagged risk_level=high. All 15 tools are audit-logged under ~/.k8s-aiops/ and pass through the policy pre-check + budget/runaway guard + graduated risk-tier gate. Avoid tight poll loops (re-listing pods every second) — the runaway breaker backs this up.

CLI Quick Reference

k8s-aiops pod list [-n <ns>] [-t <target>]
k8s-aiops pod get <name> [-n <ns>]
k8s-aiops pod logs <name> [-n <ns>] [--tail 200] [-c <container>]
k8s-aiops pod delete <name> [-n <ns>] [--dry-run]        # double confirm
k8s-aiops deployment list [-n <ns>]
k8s-aiops deployment get <name> [-n <ns>]
k8s-aiops deployment scale <name> <replicas> [-n <ns>]
k8s-aiops deployment restart <name> [-n <ns>]
k8s-aiops deployment delete <name> [-n <ns>] [--dry-run]  # double confirm
k8s-aiops service list [-n <ns>]
k8s-aiops node list
k8s-aiops node cordon <name> [--dry-run]                  # double confirm
k8s-aiops node uncordon <name>
k8s-aiops namespace list
k8s-aiops events [-n <ns>]
k8s-aiops doctor
k8s-aiops mcp                                             # start MCP server (stdio)

See references/cli-reference.md for the full command list.

Troubleshooting

"Could not load kubeconfig … context not found"

The named context does not exist in your kubeconfig. Run kubectl config get-contexts and set the target's context: to a listed name (or omit it to use current-context).

"Authentication/authorization failed (401/403)"

The kube context lacks the RBAC verb for the resource. Check with kubectl auth can-i <verb> <resource> -n <ns> and switch to a context/ServiceAccount with adequate roles. For EKS/GKE/AKS, confirm the exec-plugin (aws/gcloud/az CLI) is installed and logged in.

"Resource not found (404)"

The pod/deployment/node name or namespace is wrong, or the object was deleted. List the parent collection first (pod list, deployment list, node list) to get a current name. Remember most commands default to the default namespace unless -n is given.

"Conflict (409)"

The object changed concurrently (or already exists). Re-read it and retry the write.

Logs are empty or truncated

pod logs returns the trailing --tail lines (default 100); raise --tail. For a multi-container pod, pass -c <container> or the API returns an error naming the available containers.

Audit & Safety

All operations are automatically audited via the bundled @governed_tool decorator (k8s_aiops.governance):

Every tool call logged to ~/.k8s-aiops/audit.db (local SQLite audit DB; relocate with K8S_AIOPS_HOME)
Policy rules enforced via ~/.k8s-aiops/rules.yaml (deny rules, maintenance windows, risk tiers)
Budget / runaway guard caps cumulative tool calls and wall-time, and trips on tight poll/retry loops
Undo store records inverse descriptors for reversible writes (scale → previous replicas; cordon ↔ uncordon)
Graduated-autonomy risk tiers gate write operations (require a recorded approver for the highest tiers)

The harness is bundled in the package — no external dependency, no manual setup. See references/setup-guide.md for security details.

License

MIT — github.com/AIops-tools/K8s-AIops