cluster-upgrade-planner

v1.0.0

Plan Kubernetes cluster upgrades with API deprecation checks, addon compatibility verification, and rollback-safe runbooks

0· 26· 1 versions· 0 current· 0 all-time· Updated 7h ago· MIT-0

by@charlie-morrison

Cluster Upgrade Planner

Systematically plan Kubernetes cluster upgrades by analyzing the current cluster state, detecting deprecated API usage, verifying addon and workload compatibility with the target version, and producing a step-by-step upgrade runbook with rollback procedures. This skill prevents upgrade failures caused by unnoticed deprecations, incompatible controllers, or workload disruption.

Use when: "plan cluster upgrade", "upgrade kubernetes", "check k8s deprecations", "upgrade readiness", "pre-upgrade check", "upgrade runbook"

Commands

1. `preflight` --- Check compatibility before upgrading

Gather cluster facts and compare them against the target Kubernetes version to surface blockers.

Step 1 -- Identify current state

# Current server and client versions
kubectl version -o yaml 2>/dev/null || kubectl version --short

# Node versions and status
kubectl get nodes -o wide

# Control-plane component versions (kubeadm clusters)
kubectl -n kube-system get pods -l tier=control-plane -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

Record CURRENT_VERSION (e.g. 1.28) and ask the user for TARGET_VERSION if not provided.

Step 2 -- Detect deprecated and removed APIs

# Scan all manifests stored in the cluster for deprecated apiVersions
# Uses kubectl to fetch every resource and check apiVersion fields
for api in $(kubectl api-resources --verbs=list -o name); do
  kubectl get "$api" --all-namespaces -o jsonpath='{range .items[*]}{.apiVersion}{"\t"}{.kind}{"\t"}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' 2>/dev/null
done | sort -u > /tmp/cluster-api-usage.txt

# Cross-reference against known removals for the target version
# Key removals by version:
# 1.25: policy/v1beta1 PodSecurityPolicy removed
# 1.26: flowcontrol.apiserver.k8s.io/v1beta1 removed
# 1.27: storage.k8s.io/v1beta1 CSIStorageCapacity removed
# 1.29: flowcontrol.apiserver.k8s.io/v1beta2 removed
# 1.32: autoscaling/v2beta1 removed

# Check for problematic apiVersions in the usage dump
rg 'v1beta1|v1beta2|v1alpha1' /tmp/cluster-api-usage.txt || echo "No deprecated beta APIs found"

Step 3 -- Check addon compatibility

# List all Helm releases and their chart versions
helm list -A -o json 2>/dev/null | python3 -c "
import json, sys
releases = json.load(sys.stdin)
for r in releases:
    print(f\"{r['namespace']}/{r['name']}\tChart: {r['chart']}\tApp: {r.get('app_version','?')}\tStatus: {r['status']}\")
"

# List non-Helm workloads in kube-system (operators, CNI, etc.)
kubectl -n kube-system get deployments,daemonsets -o custom-columns=NAME:.metadata.name,IMAGE:.spec.template.spec.containers[0].image

For each addon, verify the installed version supports TARGET_VERSION by checking the upstream compatibility matrix. Key addons to verify:

CNI plugin (Calico, Cilium, Flannel)
Ingress controller (nginx, Traefik)
cert-manager
metrics-server
CoreDNS
CSI drivers

Step 4 -- Assess workload disruption risk

# Pods without PodDisruptionBudgets
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.ownerReferences[0].kind}{"\n"}{end}' > /tmp/all-pods.txt
kubectl get pdb --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.spec.selector.matchLabels}{"\n"}{end}' > /tmp/all-pdbs.txt

echo "=== Namespaces with pods but no PDB ==="
awk '{print $1}' /tmp/all-pods.txt | sort -u > /tmp/ns-with-pods.txt
awk '{print $1}' /tmp/all-pdbs.txt | sort -u > /tmp/ns-with-pdbs.txt
comm -23 /tmp/ns-with-pods.txt /tmp/ns-with-pdbs.txt

# Single-replica deployments (high disruption risk)
kubectl get deployments --all-namespaces -o jsonpath='{range .items[?(@.spec.replicas==1)]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'

Report template:

## Preflight Report: Upgrade from {CURRENT} to {TARGET}

### Blockers (must fix before upgrade)
- [ ] {list removed APIs still in use}
- [ ] {incompatible addons}

### Warnings (should fix, not blocking)
- [ ] {deprecated APIs that will be removed in next version}
- [ ] {single-replica deployments without PDB}

### Addon Compatibility
| Addon | Current Version | Target K8s Supported | Action |
|-------|----------------|----------------------|--------|

### Node Readiness
- Total nodes: {N}
- Nodes at current version: {N}
- Nodes with issues: {list}

2. `plan` --- Generate an upgrade runbook

Produce a step-by-step, copy-pasteable upgrade plan based on preflight findings.

Step 1 -- Determine upgrade strategy

Decide based on cluster type:

kubeadm: sequential control-plane then worker upgrade
EKS/GKE/AKS: managed control-plane upgrade, then node group rolling update
k3s/RKE2: binary replacement strategy

# Detect cluster type
PROVIDER="unknown"
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' 2>/dev/null | grep -qi 'aws' && PROVIDER="eks"
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' 2>/dev/null | grep -qi 'gce' && PROVIDER="gke"
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' 2>/dev/null | grep -qi 'azure' && PROVIDER="aks"
kubectl get nodes -o jsonpath='{.items[0].metadata.labels}' 2>/dev/null | grep -q 'node.kubernetes.io/instance-type' || PROVIDER="kubeadm"
echo "Detected provider: $PROVIDER"

Step 2 -- Generate the runbook

For kubeadm clusters, the runbook follows this structure:

## Upgrade Runbook: {CURRENT} -> {TARGET}
Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)

### Pre-upgrade checklist
1. Confirm etcd backup exists:
   ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-pre-upgrade-$(date +%s).db
2. Back up all cluster manifests:
   kubectl get all -A -o yaml > /backup/cluster-state-pre-upgrade.yaml
3. Verify all preflight blockers resolved

### Phase 1: Upgrade first control-plane node
   sudo apt-get update && sudo apt-get install -y kubeadm={TARGET_PATCH}
   sudo kubeadm upgrade plan
   sudo kubeadm upgrade apply v{TARGET}
   sudo apt-get install -y kubelet={TARGET_PATCH} kubectl={TARGET_PATCH}
   sudo systemctl daemon-reload && sudo systemctl restart kubelet

### Phase 2: Upgrade remaining control-plane nodes
   (repeat for each additional CP node, using `kubeadm upgrade node`)

### Phase 3: Upgrade worker nodes (one at a time)
   kubectl drain {NODE} --ignore-daemonsets --delete-emptydir-data
   # On the node:
   sudo apt-get install -y kubeadm={TARGET_PATCH} kubelet={TARGET_PATCH}
   sudo kubeadm upgrade node
   sudo systemctl daemon-reload && sudo systemctl restart kubelet
   # From control plane:
   kubectl uncordon {NODE}
   # Verify node healthy before proceeding to next

### Phase 4: Post-upgrade validation
   kubectl get nodes  # all nodes at new version
   kubectl get pods -A | grep -v Running | grep -v Completed
   kubectl run upgrade-test --image=busybox --rm -it -- echo "cluster healthy"

### Rollback procedure
   If control-plane upgrade fails:
   1. Restore etcd: etcdctl snapshot restore /backup/etcd-pre-upgrade-*.db
   2. Downgrade kubeadm/kubelet packages to {CURRENT} version
   3. Restart kubelet

   If worker node fails after drain:
   1. kubectl uncordon {NODE}
   2. Downgrade kubelet on the node
   3. Restart kubelet

Step 3 -- Estimate timing and risk

# Count resources to estimate upgrade duration
NODES=$(kubectl get nodes --no-headers | wc -l)
CP_NODES=$(kubectl get nodes -l node-role.kubernetes.io/control-plane --no-headers 2>/dev/null | wc -l)
WORKER_NODES=$((NODES - CP_NODES))
echo "Estimated time: ~$((CP_NODES * 10 + WORKER_NODES * 8)) minutes"
echo "Control plane nodes: $CP_NODES (~10 min each)"
echo "Worker nodes: $WORKER_NODES (~8 min each with drain/uncordon)"

3. `deprecations` --- Find deprecated APIs in local manifests

Scan local YAML/Helm files for deprecated apiVersions, not just the live cluster.

Step 1 -- Scan manifest files

# Find all YAML manifests in the repository
TARGET_DIR="${1:-.}"

# Known deprecated/removed apiVersions mapped to removal version
python3 << 'PYEOF'
import os, re, sys, yaml

DEPRECATIONS = {
    "extensions/v1beta1": {"removed": "1.22", "replacement": "apps/v1 or networking.k8s.io/v1"},
    "apps/v1beta1": {"removed": "1.16", "replacement": "apps/v1"},
    "apps/v1beta2": {"removed": "1.16", "replacement": "apps/v1"},
    "networking.k8s.io/v1beta1": {"removed": "1.22", "replacement": "networking.k8s.io/v1"},
    "policy/v1beta1": {"removed": "1.25", "replacement": "policy/v1"},
    "rbac.authorization.k8s.io/v1beta1": {"removed": "1.22", "replacement": "rbac.authorization.k8s.io/v1"},
    "admissionregistration.k8s.io/v1beta1": {"removed": "1.22", "replacement": "admissionregistration.k8s.io/v1"},
    "apiextensions.k8s.io/v1beta1": {"removed": "1.22", "replacement": "apiextensions.k8s.io/v1"},
    "storage.k8s.io/v1beta1": {"removed": "1.27", "replacement": "storage.k8s.io/v1"},
    "flowcontrol.apiserver.k8s.io/v1beta1": {"removed": "1.26", "replacement": "flowcontrol.apiserver.k8s.io/v1"},
    "flowcontrol.apiserver.k8s.io/v1beta2": {"removed": "1.29", "replacement": "flowcontrol.apiserver.k8s.io/v1"},
    "autoscaling/v2beta1": {"removed": "1.26", "replacement": "autoscaling/v2"},
    "autoscaling/v2beta2": {"removed": "1.32", "replacement": "autoscaling/v2"},
    "batch/v1beta1": {"removed": "1.25", "replacement": "batch/v1"},
}

target_dir = sys.argv[1] if len(sys.argv) > 1 else "."
findings = []

for root, dirs, files in os.walk(target_dir):
    dirs[:] = [d for d in dirs if d not in (".git", "node_modules", "vendor")]
    for fname in files:
        if not fname.endswith((".yaml", ".yml")):
            continue
        fpath = os.path.join(root, fname)
        try:
            with open(fpath) as f:
                content = f.read()
            for doc in yaml.safe_load_all(content):
                if not isinstance(doc, dict):
                    continue
                api = doc.get("apiVersion", "")
                kind = doc.get("kind", "")
                name = doc.get("metadata", {}).get("name", "unknown")
                if api in DEPRECATIONS:
                    d = DEPRECATIONS[api]
                    findings.append({
                        "file": fpath,
                        "apiVersion": api,
                        "kind": kind,
                        "name": name,
                        "removed_in": d["removed"],
                        "replacement": d["replacement"],
                    })
        except Exception:
            pass

if not findings:
    print("No deprecated APIs found in manifest files.")
else:
    print(f"Found {len(findings)} deprecated API usage(s):\n")
    for f in findings:
        print(f"  {f['file']}")
        print(f"    {f['kind']}/{f['name']}: {f['apiVersion']} -> removed in {f['removed_in']}")
        print(f"    Replace with: {f['replacement']}")
        print()
PYEOF

Step 2 -- Scan Helm templates

# Render Helm charts and scan the output
for chart in $(find "${TARGET_DIR}" -name Chart.yaml -exec dirname {} \;); do
  echo "=== Scanning Helm chart: $chart ==="
  helm template test-scan "$chart" 2>/dev/null | \
    python3 -c "
import sys, yaml
for doc in yaml.safe_load_all(sys.stdin):
    if not isinstance(doc, dict): continue
    api = doc.get('apiVersion','')
    if 'beta' in api:
        print(f\"  WARNING: {doc.get('kind','?')}/{doc.get('metadata',{}).get('name','?')} uses {api}\")
" || echo "  (helm template failed -- check values)"
done

Step 3 -- Generate migration patches

# For each finding, suggest a sed command to fix the apiVersion
# Example output:
# sed -i 's|apiVersion: policy/v1beta1|apiVersion: policy/v1|' path/to/file.yaml

Report template:

## API Deprecation Scan Report

### Summary
- Files scanned: {N}
- Deprecated APIs found: {N}
- Already removed in current version: {N} (CRITICAL)
- Will be removed in target version: {N} (MUST FIX)
- Deprecated but not yet removed: {N} (SHOULD FIX)

### Findings
| File | Kind/Name | Current API | Removed In | Replacement |
|------|-----------|-------------|------------|-------------|

### Auto-fix commands
{list of sed/yq commands to apply fixes}

Version tags

latestvk97998s83kg8z3zgzh4hxxga2h85rx6q