# Safety Patterns for Agent Operations This document defines the safety conventions that all KubeBlocks skills should follow. When writing or updating a skill, reference this document instead of duplicating the rules. ## 1. Dry-Run Before Apply Every `kubectl apply` of a Cluster, OpsRequest, Backup, or any other KubeBlocks CR should be preceded by a server-side dry-run. This catches RBAC issues, webhook rejections, quota violations, and schema errors before any real change occurs. ```bash kubectl apply -f --dry-run=server ``` If the dry-run reports errors, fix the YAML before proceeding. Only after a clean dry-run should the agent run the actual `kubectl apply`. For `kubectl delete`, dry-run is not applicable — use the pre-deletion checklist instead (see section 3). ## 2. Status Confirmation After Apply After applying a change, confirm the operation reached its expected end-state. Use `kubectl get -w` to watch status transitions. Each watch step should have three things clearly defined: - **Success condition**: the specific `.status.phase` value that means "done" - **Typical duration**: how long this normally takes - **Timeout threshold**: how long to wait before suspecting a problem and running `kubectl describe` ### Status Condition Quick Reference | Resource | Success Condition | Typical Duration | Timeout / Investigate After | |---|---|---|---| | Cluster (create/restore) | `.status.phase` = `Running` | 1-5 min | 10 min | | Cluster (stop) | `.status.phase` = `Stopped` | 1-3 min | 5 min | | Cluster (start/restart) | `.status.phase` = `Running` | 1-3 min | 5 min | | OpsRequest (general) | `.status.phase` = `Succeed` | 1-5 min | 10 min | | OpsRequest (reconfigure, static params) | `.status.phase` = `Succeed` (may pass through `Restarting`) | 2-8 min | 15 min | | Backup (full) | `.status.phase` = `Completed` | Varies by data size | 30 min | | Backup (continuous) | `.status.phase` = `Running` (stays running) | 1 min to start | 5 min | ### When Timeout Is Reached If the resource has not reached its success condition within the timeout threshold: ```bash # For Cluster kubectl describe cluster -n kubectl get events -n --sort-by='.lastTimestamp' | grep # For OpsRequest kubectl describe ops -n # For Backup kubectl describe backup -n # For Pods kubectl describe pod -n kubectl logs -n --tail=50 ``` Report the findings to the user rather than continuing blindly. ## 3. Pre-Deletion Checklist Before deleting any cluster, list what will be affected and get explicit confirmation from the user: 1. Show the cluster's `terminationPolicy`: ```bash kubectl get cluster -n -o jsonpath='{.spec.terminationPolicy}' ``` 2. If `WipeOut`, warn that backups will also be deleted 3. List existing backups: ```bash kubectl get backup -n -l app.kubernetes.io/instance= ``` 4. List running OpsRequests (must be cancelled first): ```bash kubectl get opsrequest -n -l app.kubernetes.io/instance= ``` 5. Ask the user to confirm before proceeding ## 4. Production Cluster Protection If a cluster has `terminationPolicy: DoNotTerminate`, treat it as a production cluster: - **Any destructive operation** (delete, scale-in, stop) should trigger a warning and explicit confirmation - **Changing terminationPolicy** requires the agent to explain the implications before patching - Recommend creating a backup before any risky operation (switchover, upgrade, reconfigure) ## 5. Pre-Check Before Day-2 Operations Before executing any Day-2 operation, verify the cluster is in a healthy state: ```bash # Cluster must be Running (not Updating, Creating, or Stopped) kubectl get cluster -n -o jsonpath='{.status.phase}' # No pending OpsRequests (only one OpsRequest can run at a time) kubectl get opsrequest -n -l app.kubernetes.io/instance= --field-selector=status.phase!=Succeed ``` If the cluster is not `Running` or has a pending OpsRequest, inform the user and wait rather than stacking operations.