{"skill":{"slug":"afrexai-devops-engine","displayName":"DevOps Engine","summary":"Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE pr...","description":"---\nname: afrexai-devops-engine\ndescription: Complete DevOps & Platform Engineering system. CI/CD pipelines, infrastructure as code, container orchestration, observability, incident response, and SRE practices — all platforms, all clouds.\nmetadata: {\"clawdbot\":{\"emoji\":\"🔧\",\"os\":[\"linux\",\"darwin\",\"win32\"]}}\n---\n\n# DevOps & Platform Engineering Engine\n\nComplete system for building, deploying, operating, and observing production software. Covers the entire DevOps lifecycle — not just CI/CD, not just one cloud.\n\n## Phase 1: Repository & Branch Strategy\n\n### Git Flow Decision Matrix\n\n| Team Size | Release Cadence | Strategy | Branches |\n|-----------|----------------|----------|----------|\n| 1-3 | Continuous | Trunk-based | main + short-lived feature/ |\n| 4-15 | Weekly/biweekly | GitHub Flow | main + feature/ + PR |\n| 15+ | Scheduled releases | Git Flow | main + develop + feature/ + release/ + hotfix/ |\n| Regulated | Audited releases | Git Flow + tags | Above + signed tags + audit trail |\n\n### Branch Protection Rules (Apply These)\n\n```yaml\n# branch-protection.yml — document your rules\nmain:\n  required_reviews: 2\n  dismiss_stale_reviews: true\n  require_codeowners: true\n  require_status_checks:\n    - ci/test\n    - ci/lint\n    - ci/security\n  require_linear_history: true  # No merge commits\n  restrict_pushes: true         # Only via PR\n  require_signed_commits: false # Enable for regulated\n\ndevelop:\n  required_reviews: 1\n  require_status_checks:\n    - ci/test\n```\n\n### Commit Convention\n\nFormat: `<type>(<scope>): <description>`\n\nTypes: `feat`, `fix`, `docs`, `style`, `refactor`, `perf`, `test`, `build`, `ci`, `chore`\n\nBreaking changes: `feat!: remove legacy API` or footer `BREAKING CHANGE: description`\n\nEnforce with commitlint + husky (Node) or pre-commit hooks.\n\n## Phase 2: CI/CD Pipeline Architecture\n\n### Pipeline Design Principles\n\n1. **Build once, deploy everywhere** — same artifact through dev→staging→prod\n2. **Fail fast** — cheapest checks first (lint→unit→integration→e2e)\n3. **Hermetic builds** — no external state, reproducible from commit SHA\n4. **Immutable artifacts** — never modify after build; tag with git SHA\n5. **Parallelise independent stages** — test/lint/security scan simultaneously\n\n### Universal Pipeline Template\n\n```yaml\n# pipeline-stages.yml — adapt to your CI system\nstages:\n  # Stage 1: Quality Gate (parallel, <2 min)\n  lint:\n    run: lint\n    parallel: true\n    timeout: 2m\n  typecheck:\n    run: tsc --noEmit\n    parallel: true\n    timeout: 2m\n  security_scan:\n    run: trivy, snyk, or semgrep\n    parallel: true\n    timeout: 3m\n\n  # Stage 2: Test (parallel by type, <10 min)\n  unit_tests:\n    run: test --unit\n    parallel: true\n    coverage_threshold: 80%\n    timeout: 5m\n  integration_tests:\n    run: test --integration\n    parallel: true\n    needs: [database_service]\n    timeout: 10m\n\n  # Stage 3: Build (<5 min)\n  build:\n    needs: [lint, typecheck, unit_tests]\n    outputs: [docker_image, release_artifact]\n    tag: \"${GIT_SHA}\"\n    cache: [node_modules, .next/cache, target/]\n\n  # Stage 4: Deploy Staging (auto)\n  deploy_staging:\n    needs: [build]\n    environment: staging\n    strategy: rolling\n    smoke_test: true\n    auto: true\n\n  # Stage 5: E2E on Staging (<15 min)\n  e2e_tests:\n    needs: [deploy_staging]\n    timeout: 15m\n    retry: 1\n    artifacts: [screenshots, videos]\n\n  # Stage 6: Deploy Production (manual gate or auto)\n  deploy_prod:\n    needs: [e2e_tests]\n    environment: production\n    strategy: canary  # or blue-green\n    approval: required  # manual gate\n    rollback_on_failure: true\n    monitoring_window: 15m\n```\n\n### CI Platform Cheat Sheet\n\n| Feature | GitHub Actions | GitLab CI | CircleCI | Jenkins |\n|---------|---------------|-----------|----------|---------|\n| Config file | `.github/workflows/*.yml` | `.gitlab-ci.yml` | `.circleci/config.yml` | `Jenkinsfile` |\n| Parallelism | `jobs.<id>` (automatic) | `stages` + `parallel` | `workflows` | `parallel` step |\n| Caching | `actions/cache` | `cache:` key | `save_cache/restore_cache` | Stash/unstash |\n| Secrets | Settings → Secrets | Settings → CI/CD → Variables | Project Settings → Env | Credentials plugin |\n| Matrix builds | `strategy.matrix` | `parallel:matrix` | `matrix` in workflows | `matrix` in pipeline |\n| Self-hosted | `runs-on: self-hosted` | GitLab Runner | `resource_class` | Default |\n| OIDC/Keyless | `permissions: id-token: write` | `id_tokens:` | OIDC context | Plugin |\n\n### Caching Strategy\n\n```yaml\n# Cache key patterns (ordered by specificity)\ncache_keys:\n  # Exact match first\n  - \"deps-{{ runner.os }}-{{ hashFiles('**/lockfile') }}\"\n  # Partial match fallback\n  - \"deps-{{ runner.os }}-\"\n\n# What to cache by stack\nnode: [node_modules, .next/cache, .turbo]\npython: [.venv, .mypy_cache, .pytest_cache]\nrust: [target/, ~/.cargo/registry]\ngo: [~/go/pkg/mod, ~/.cache/go-build]\ndocker: [/tmp/.buildx-cache]  # BuildKit layer cache\n```\n\n### GitHub Actions Specific Patterns\n\n```yaml\n# Reusable workflow (DRY across repos)\n# .github/workflows/reusable-deploy.yml\non:\n  workflow_call:\n    inputs:\n      environment:\n        required: true\n        type: string\n    secrets:\n      DEPLOY_KEY:\n        required: true\n\n# Caller workflow\njobs:\n  deploy:\n    uses: ./.github/workflows/reusable-deploy.yml\n    with:\n      environment: production\n    secrets: inherit\n```\n\n```yaml\n# Path-based triggers (monorepo)\non:\n  push:\n    paths:\n      - 'packages/api/**'\n      - 'shared/**'\n  # Skip CI for docs-only changes\n  pull_request:\n    paths-ignore:\n      - '**.md'\n      - 'docs/**'\n```\n\n```yaml\n# Concurrency (cancel in-progress on new push)\nconcurrency:\n  group: ${{ github.workflow }}-${{ github.ref }}\n  cancel-in-progress: true\n```\n\n## Phase 3: Container Strategy\n\n### Dockerfile Best Practices\n\n```dockerfile\n# Multi-stage build template\n# Stage 1: Build\nFROM node:20-alpine AS builder\nWORKDIR /app\nCOPY package.json package-lock.json ./\nRUN npm ci --production=false    # Install all deps for build\nCOPY . .\nRUN npm run build\n\n# Stage 2: Production\nFROM node:20-alpine AS production\nRUN addgroup -g 1001 app && adduser -u 1001 -G app -s /bin/sh -D app\nWORKDIR /app\nCOPY --from=builder --chown=app:app /app/dist ./dist\nCOPY --from=builder --chown=app:app /app/node_modules ./node_modules\nCOPY --from=builder --chown=app:app /app/package.json ./\n\nUSER app\nEXPOSE 3000\nHEALTHCHECK --interval=30s --timeout=3s --retries=3 \\\n  CMD wget -qO- http://localhost:3000/health || exit 1\nCMD [\"node\", \"dist/index.js\"]\n```\n\n### Image Size Reduction Checklist\n\n- [ ] Use alpine or distroless base images\n- [ ] Multi-stage builds (build deps not in final image)\n- [ ] `.dockerignore` excludes: `.git`, `node_modules`, `*.md`, tests, docs\n- [ ] Combine RUN commands (fewer layers)\n- [ ] Clean package manager cache in same RUN (`rm -rf /var/cache/apk/*`)\n- [ ] No dev dependencies in production stage\n- [ ] Pin base image SHA: `FROM node:20-alpine@sha256:abc123...`\n\n### Container Security Scan\n\n```bash\n# Trivy (recommended — free, fast)\ntrivy image myapp:latest --severity HIGH,CRITICAL\ntrivy fs . --security-checks vuln,secret,config\n\n# Scan in CI before push\n# Fail pipeline if CRITICAL vulnerabilities found\ntrivy image --exit-code 1 --severity CRITICAL myapp:${GIT_SHA}\n```\n\n### Docker Compose for Local Dev\n\n```yaml\n# docker-compose.yml — local development stack\nservices:\n  app:\n    build:\n      context: .\n      target: builder  # Use build stage for hot reload\n    volumes:\n      - .:/app\n      - /app/node_modules  # Don't override node_modules\n    ports:\n      - \"3000:3000\"\n    environment:\n      - DATABASE_URL=postgres://user:pass@db:5432/app\n      - REDIS_URL=redis://cache:6379\n    depends_on:\n      db:\n        condition: service_healthy\n\n  db:\n    image: postgres:16-alpine\n    volumes:\n      - pgdata:/var/lib/postgresql/data\n    environment:\n      POSTGRES_USER: user\n      POSTGRES_PASSWORD: pass\n      POSTGRES_DB: app\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -U user\"]\n      interval: 5s\n      timeout: 3s\n      retries: 5\n\n  cache:\n    image: redis:7-alpine\n    ports:\n      - \"6379:6379\"\n\nvolumes:\n  pgdata:\n```\n\n## Phase 4: Infrastructure as Code\n\n### IaC Decision Matrix\n\n| Tool | Best For | State | Language | Learning Curve |\n|------|----------|-------|----------|----------------|\n| Terraform/OpenTofu | Multi-cloud, cloud-agnostic | Remote (S3, GCS) | HCL | Medium |\n| Pulumi | Devs who prefer real code | Remote | TS/Python/Go | Low (if you code) |\n| AWS CDK | AWS-only shops | CloudFormation | TS/Python | Medium |\n| Ansible | Config management, server setup | Stateless | YAML | Low |\n| Helm | Kubernetes deployments | Tiller/OCI | YAML+Go templates | Medium |\n\n### Terraform Project Structure\n\n```\ninfrastructure/\n├── modules/                    # Reusable components\n│   ├── vpc/\n│   │   ├── main.tf\n│   │   ├── variables.tf\n│   │   └── outputs.tf\n│   ├── ecs-service/\n│   └── rds/\n├── environments/\n│   ├── dev/\n│   │   ├── main.tf            # Calls modules with dev params\n│   │   ├── terraform.tfvars\n│   │   └── backend.tf         # Dev state bucket\n│   ├── staging/\n│   └── prod/\n├── .terraform-version          # Pin terraform version\n└── .tflint.hcl\n```\n\n### Terraform Safety Rules\n\n1. **Always `plan` before `apply`** — review every change\n2. **Remote state with locking** — S3 + DynamoDB or GCS + locking\n3. **State never in git** — contains secrets (DB passwords, keys)\n4. **Import existing resources** before managing them — don't recreate\n5. **Use `prevent_destroy`** on critical resources (databases, S3 buckets)\n6. **Tag everything** — `environment`, `team`, `cost-center`, `managed-by: terraform`\n7. **`terraform fmt`** in CI — consistent formatting\n\n```hcl\n# backend.tf — remote state with locking\nterraform {\n  backend \"s3\" {\n    bucket         = \"mycompany-terraform-state\"\n    key            = \"prod/main.tfstate\"\n    region         = \"eu-west-1\"\n    encrypt        = true\n    dynamodb_table = \"terraform-locks\"\n  }\n}\n\n# Protect critical resources\nresource \"aws_rds_instance\" \"main\" {\n  # ...\n  lifecycle {\n    prevent_destroy = true\n  }\n}\n```\n\n### Environment Promotion Pattern\n\n```\n                    ┌──────────────────┐\n  terraform plan ──►│  Review in PR    │\n                    └────────┬─────────┘\n                             │ merge\n                    ┌────────▼─────────┐\n  auto-apply ──────►│  Dev             │──► smoke tests\n                    └────────┬─────────┘\n                             │ promote\n                    ┌────────▼─────────┐\n  manual approve ──►│  Staging         │──► integration tests\n                    └────────┬─────────┘\n                             │ promote (manual gate)\n                    ┌────────▼─────────┐\n  manual approve ──►│  Production      │──► monitoring window\n                    └──────────────────┘\n```\n\n## Phase 5: Kubernetes Operations\n\n### K8s Resource Templates\n\n```yaml\n# deployment.yml — production-ready template\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: myapp\n  labels:\n    app: myapp\n    version: \"1.0.0\"\nspec:\n  replicas: 3\n  strategy:\n    type: RollingUpdate\n    rollingUpdate:\n      maxSurge: 1\n      maxUnavailable: 0    # Zero-downtime\n  selector:\n    matchLabels:\n      app: myapp\n  template:\n    metadata:\n      labels:\n        app: myapp\n    spec:\n      securityContext:\n        runAsNonRoot: true\n        runAsUser: 1000\n      containers:\n        - name: myapp\n          image: myregistry/myapp:abc123  # Git SHA tag\n          ports:\n            - containerPort: 3000\n          resources:\n            requests:\n              cpu: 100m\n              memory: 128Mi\n            limits:\n              cpu: 500m\n              memory: 512Mi\n          livenessProbe:\n            httpGet:\n              path: /health\n              port: 3000\n            initialDelaySeconds: 10\n            periodSeconds: 10\n            failureThreshold: 3\n          readinessProbe:\n            httpGet:\n              path: /ready\n              port: 3000\n            initialDelaySeconds: 5\n            periodSeconds: 5\n          env:\n            - name: DATABASE_URL\n              valueFrom:\n                secretKeyRef:\n                  name: myapp-secrets\n                  key: database-url\n      topologySpreadConstraints:\n        - maxSkew: 1\n          topologyKey: topology.kubernetes.io/zone\n          whenUnsatisfiable: DoNotSchedule\n```\n\n```yaml\n# hpa.yml — autoscaling\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: myapp\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1\n    kind: Deployment\n    name: myapp\n  minReplicas: 3\n  maxReplicas: 20\n  metrics:\n    - type: Resource\n      resource:\n        name: cpu\n        target:\n          type: Utilization\n          averageUtilization: 70\n    - type: Resource\n      resource:\n        name: memory\n        target:\n          type: Utilization\n          averageUtilization: 80\n  behavior:\n    scaleDown:\n      stabilizationWindowSeconds: 300  # 5 min cooldown\n      policies:\n        - type: Pods\n          value: 1\n          periodSeconds: 60  # Scale down 1 pod per minute max\n```\n\n### Helm Chart Checklist\n\n- [ ] `values.yaml` with sensible defaults (works out of the box)\n- [ ] Resource requests AND limits set\n- [ ] Health/readiness probes defined\n- [ ] PodDisruptionBudget (minAvailable: 1 or maxUnavailable: 25%)\n- [ ] NetworkPolicy (deny all, allow specific)\n- [ ] ServiceAccount (not default)\n- [ ] Secrets via external-secrets-operator or sealed-secrets (not plain)\n- [ ] `helm lint` and `helm template` in CI\n- [ ] NOTES.txt with post-install instructions\n\n### kubectl Cheat Sheet\n\n```bash\n# Debugging\nkubectl get pods -l app=myapp -o wide          # Pod status + node\nkubectl describe pod <pod>                      # Events, conditions\nkubectl logs <pod> --tail=100 -f               # Stream logs\nkubectl logs <pod> --previous                   # Crashed container logs\nkubectl exec -it <pod> -- /bin/sh              # Shell into pod\nkubectl top pods -l app=myapp                  # Resource usage\n\n# Rollouts\nkubectl rollout status deployment/myapp        # Watch rollout\nkubectl rollout history deployment/myapp       # Revision history\nkubectl rollout undo deployment/myapp          # Rollback to previous\nkubectl rollout undo deployment/myapp --to-revision=3  # Specific\n\n# Scaling\nkubectl scale deployment/myapp --replicas=5    # Manual scale\nkubectl autoscale deployment/myapp --min=3 --max=10 --cpu-percent=70\n\n# Context management\nkubectl config get-contexts                     # List clusters\nkubectl config use-context prod-cluster         # Switch\nkubectl config set-context --current --namespace=myapp  # Set namespace\n```\n\n## Phase 6: Deployment Strategies\n\n### Strategy Decision Matrix\n\n| Strategy | Risk | Speed | Rollback | Cost | Best For |\n|----------|------|-------|----------|------|----------|\n| Rolling | Low-Med | Fast | Slow (re-roll) | None | Standard deployments |\n| Blue-Green | Low | Instant | Instant (switch) | 2x infra | Critical services, zero-downtime |\n| Canary | Very Low | Slow | Instant (route 0%) | Minimal | High-traffic, risky changes |\n| Feature Flag | Very Low | Instant | Instant (toggle) | None | Gradual rollout, A/B testing |\n| Recreate | High | Fast | Slow | None | Dev/staging, stateful apps |\n\n### Canary Deployment Workflow\n\n```\n1. Deploy canary (1 pod with new version)\n2. Route 5% traffic → canary\n3. Monitor for 5 minutes:\n   - Error rate < baseline + 0.1%?\n   - p99 latency < baseline + 50ms?\n   - No new error types?\n4. If healthy → 25% → monitor 10 min\n5. If healthy → 50% → monitor 10 min\n6. If healthy → 100% (full rollout)\n7. If ANY check fails → route 0% to canary → rollback → alert\n\nAutomation: Argo Rollouts, Flagger, or Istio + custom controller\n```\n\n### Rollback Checklist\n\nWhen a deployment goes wrong:\n1. **Immediate**: Route traffic away from new version (canary→0%, blue-green→switch)\n2. **If rolling**: `kubectl rollout undo` or redeploy previous SHA\n3. **Check**: Are database migrations backward-compatible? (If not, you have a bigger problem)\n4. **Verify**: Rollback successful? Check error rates, latency\n5. **Communicate**: Post in #incidents, update status page\n6. **Investigate**: Don't re-deploy until root cause found\n\n### Database Migration Safety\n\n```\nRULE: Migrations must be backward-compatible with the PREVIOUS version.\n      (Because during rolling deploy, both versions run simultaneously)\n\nSafe migration pattern:\n  v1: Add new column (nullable, with default)\n  v2: Backfill data, start writing to new column\n  v3: Make new column required, stop writing old column\n  v4: Drop old column (after v3 is fully deployed)\n\nNEVER in one deploy:\n  ❌ Rename column\n  ❌ Change column type\n  ❌ Drop column still read by current version\n  ❌ Add NOT NULL without default\n```\n\n## Phase 7: Observability Stack\n\n### Three Pillars + Bonus\n\n| Pillar | What | Tools | Priority |\n|--------|------|-------|----------|\n| **Metrics** | Numeric measurements over time | Prometheus, Datadog, CloudWatch | 1 (start here) |\n| **Logs** | Event records | ELK, Loki, CloudWatch Logs | 2 |\n| **Traces** | Request flow across services | Jaeger, Tempo, X-Ray, Honeycomb | 3 |\n| **Profiling** | CPU/memory hot paths | Pyroscope, Parca | 4 (when optimizing) |\n\n### Key Metrics to Track\n\n```yaml\n# RED Method (request-driven services)\nrate:     # Requests per second\nerrors:   # Failed requests per second\nduration: # Latency distribution (p50, p95, p99)\n\n# USE Method (infrastructure/resources)\nutilization:  # % of resource in use (CPU, memory, disk)\nsaturation:   # Queue depth, pending work\nerrors:       # Resource errors (OOM, disk full)\n\n# Business Metrics (most important!)\nsignups_per_hour:\ncheckout_completion_rate:\napi_calls_by_customer:\nrevenue_per_minute:\n```\n\n### Alerting Rules\n\n```yaml\n# alerting-rules.yml\nalerts:\n  # Symptom-based (good — tells you users are impacted)\n  - name: HighErrorRate\n    condition: \"error_rate_5xx > 1% for 5m\"\n    severity: critical\n    runbook: docs/runbooks/high-error-rate.md\n    notify: [pagerduty, slack-incidents]\n\n  - name: HighLatency\n    condition: \"p99_latency > 2s for 5m\"\n    severity: warning\n    runbook: docs/runbooks/high-latency.md\n    notify: [slack-incidents]\n\n  # Cause-based (supplementary — helps diagnose)\n  - name: PodCrashLooping\n    condition: \"pod_restart_count increase > 3 in 10m\"\n    severity: warning\n    notify: [slack-platform]\n\n  - name: DiskSpaceWarning\n    condition: \"disk_usage > 80%\"\n    severity: warning\n    notify: [slack-platform]\n\n  - name: CertificateExpiring\n    condition: \"cert_expiry_days < 14\"\n    severity: warning\n    notify: [slack-platform]\n\n# Alert rules:\n# 1. Every alert must have a runbook link\n# 2. Every alert must be actionable (if you can't do anything, remove it)\n# 3. Critical = wake someone up. Warning = check next business day.\n# 4. Review alerts monthly — archive unused, tune noisy ones\n```\n\n### Structured Logging Standard\n\n```json\n{\n  \"timestamp\": \"2026-02-16T05:00:00.000Z\",\n  \"level\": \"error\",\n  \"service\": \"api\",\n  \"trace_id\": \"abc123\",\n  \"span_id\": \"def456\",\n  \"method\": \"POST\",\n  \"path\": \"/api/orders\",\n  \"status\": 500,\n  \"duration_ms\": 342,\n  \"user_id\": \"usr_789\",\n  \"error\": {\n    \"type\": \"DatabaseError\",\n    \"message\": \"connection timeout\",\n    \"stack\": \"...\"\n  },\n  \"context\": {\n    \"order_id\": \"ord_123\",\n    \"payment_method\": \"card\"\n  }\n}\n```\n\n**Log level guide:**\n- `error`: Something failed, needs attention\n- `warn`: Unexpected but handled (retry succeeded, fallback used)\n- `info`: Business events (order placed, user signed up, deploy started)\n- `debug`: Technical detail (query executed, cache hit/miss) — OFF in prod\n\n### Dashboard Template\n\nEvery service dashboard should have:\n\n```\nRow 1: Traffic Overview\n  - Request rate (per endpoint)\n  - Error rate (4xx, 5xx separate)\n  - Active users / connections\n\nRow 2: Performance\n  - p50, p95, p99 latency\n  - Throughput\n  - Apdex score\n\nRow 3: Resources\n  - CPU utilization (per pod/instance)\n  - Memory usage (vs limit)\n  - Disk I/O / Network I/O\n\nRow 4: Business\n  - Revenue per minute (if applicable)\n  - Conversion funnel\n  - Queue depth / processing lag\n\nRow 5: Dependencies\n  - Database query latency + connection pool\n  - External API latency + error rate\n  - Cache hit rate\n```\n\n## Phase 8: Incident Response\n\n### Severity Levels\n\n| Level | Definition | Response Time | Example |\n|-------|-----------|---------------|---------|\n| SEV-1 | Complete outage, revenue impact | 15 min | Site down, payments failing |\n| SEV-2 | Major feature broken, workaround exists | 30 min | Search broken, checkout slow |\n| SEV-3 | Minor feature broken, low impact | 4 hours | Admin panel bug, non-critical API |\n| SEV-4 | Cosmetic / no user impact | Next sprint | Typo, minor UI glitch |\n\n### Incident Workflow\n\n```\n1. DETECT (automated or reported)\n   → Alert fires / user reports issue\n   → Create incident channel: #inc-YYYY-MM-DD-description\n\n2. TRIAGE (first 5 minutes)\n   → Assign Incident Commander (IC)\n   → Determine severity level\n   → Post initial assessment in channel\n   → Update status page (if customer-facing)\n\n3. MITIGATE (focus on stopping the bleeding)\n   → Can we rollback? → Do it\n   → Can we scale up? → Do it\n   → Can we feature-flag disable? → Do it\n   → DON'T debug root cause yet — restore service first\n\n4. RESOLVE\n   → Confirm service restored (metrics, customer reports)\n   → Communicate resolution to stakeholders\n   → Update status page\n\n5. POST-MORTEM (within 48 hours)\n   → Blameless — focus on systems, not people\n   → Timeline of events\n   → Root cause analysis (5 Whys)\n   → Action items with owners and deadlines\n   → Share with team\n```\n\n### Post-Mortem Template\n\n```markdown\n# Incident Post-Mortem: [Title]\n\n**Date:** YYYY-MM-DD\n**Duration:** Xh Ym\n**Severity:** SEV-X\n**Incident Commander:** [name]\n**Author:** [name]\n\n## Summary\n[1-2 sentence summary of what happened and impact]\n\n## Impact\n- Users affected: [number/percentage]\n- Revenue impact: [if applicable]\n- Duration: [start to full resolution]\n\n## Timeline (all times UTC)\n| Time | Event |\n|------|-------|\n| 14:00 | Deploy v2.3.1 begins |\n| 14:05 | Error rate spikes to 15% |\n| 14:07 | Alert fires, IC paged |\n| 14:12 | Rollback initiated |\n| 14:15 | Service restored |\n\n## Root Cause\n[Technical explanation — what actually broke and why]\n\n## Contributing Factors\n- [Factor 1 — e.g., migration not tested with production data volume]\n- [Factor 2 — e.g., canary deployment not configured for this service]\n\n## What Went Well\n- [Fast detection — alert fired within 2 minutes]\n- [Clear runbook — IC knew rollback procedure]\n\n## What Went Wrong\n- [No canary — went straight to 100% rollout]\n- [Migration was not backward-compatible]\n\n## Action Items\n| Action | Owner | Due | Priority |\n|--------|-------|-----|----------|\n| Add canary to deployment | @engineer | YYYY-MM-DD | P1 |\n| Add migration backward-compat check | @engineer | YYYY-MM-DD | P1 |\n| Update runbook for this service | @sre | YYYY-MM-DD | P2 |\n\n## Lessons Learned\n[Key takeaways for the team]\n```\n\n### On-Call Best Practices\n\n```yaml\non_call:\n  rotation: weekly\n  handoff: Monday 10:00 (overlap 1h with previous)\n  escalation:\n    - primary: respond within 15 min\n    - secondary: auto-page if no ack in 15 min\n    - manager: auto-page if no ack in 30 min\n\n  expectations:\n    - Laptop + internet within reach\n    - Respond to page within 15 minutes\n    - Follow runbook first, improvise second\n    - Escalate early — \"I don't know\" is fine\n    - Update incident channel every 15 min during active incident\n\n  wellness:\n    - No more than 1 week in 4 on-call\n    - Comp time after major incidents\n    - Toil budget: <30% of on-call time should be toil\n    - Quarterly review: are we paging too much?\n```\n\n## Phase 9: Security Hardening\n\n### Security Checklist (CI Pipeline)\n\n```yaml\nsecurity_gates:\n  # Pre-commit\n  - tool: gitleaks / trufflehog\n    what: Secret detection in code\n    block: true\n\n  # Build\n  - tool: semgrep / CodeQL\n    what: Static analysis (SAST)\n    block: critical findings\n\n  - tool: npm audit / pip audit / cargo audit\n    what: Dependency vulnerabilities (SCA)\n    block: critical/high\n\n  # Container\n  - tool: trivy / grype\n    what: Image vulnerability scan\n    block: critical\n\n  - tool: hadolint\n    what: Dockerfile best practices\n    block: error level\n\n  # Deploy\n  - tool: checkov / tfsec\n    what: IaC security scan\n    block: high findings\n\n  # Runtime\n  - tool: falco / sysdig\n    what: Runtime anomaly detection\n    alert: true\n```\n\n### Secrets Management Decision\n\n| Method | Security | Complexity | Best For |\n|--------|----------|------------|----------|\n| CI/CD env vars | Basic | Low | Small teams, non-critical |\n| AWS Secrets Manager / GCP Secret Manager | High | Medium | Cloud-native apps |\n| HashiCorp Vault | Very High | High | Multi-cloud, strict compliance |\n| SOPS + git | Good | Low | GitOps workflows |\n| External Secrets Operator | High | Medium | Kubernetes + cloud secrets |\n\n**Rules:**\n- Rotate secrets every 90 days minimum\n- Different secrets per environment (dev ≠ staging ≠ prod)\n- Audit all secret access\n- Never log secrets — mask in CI output\n- Use OIDC/keyless auth where possible (no long-lived tokens)\n\n### Network Security Baseline\n\n```\n1. Default deny all — explicitly allow what's needed\n2. TLS everywhere — including internal service-to-service\n3. No public IPs on internal services — use load balancers / API gateways\n4. WAF on public endpoints — OWASP Top 10 rules minimum\n5. Rate limiting on all APIs — prevent abuse and DDoS\n6. DNS for service discovery — never hardcode IPs\n7. VPN or zero-trust for admin access — no SSH from internet\n8. Network policies in K8s — pods can't talk to everything\n9. Egress control — services should only reach what they need\n10. Certificate auto-renewal — cert-manager or ACM\n```\n\n## Phase 10: SRE Practices\n\n### SLO Framework\n\n```yaml\n# Define SLOs for every user-facing service\nservice: checkout-api\nslos:\n  availability:\n    target: 99.95%        # 4.38 hours downtime/year\n    window: 30d rolling\n    measurement: \"successful_requests / total_requests\"\n\n  latency:\n    target: 99%           # 99% of requests under threshold\n    threshold: 500ms      # p99 < 500ms\n    window: 30d rolling\n\n  freshness:\n    target: 99.9%         # Data updated within SLA\n    threshold: 5m\n    window: 30d rolling\n\nerror_budget:\n  monthly_budget: 0.05%   # ~21.6 minutes\n  burn_rate_alert:\n    fast: 14.4x           # Budget consumed in 1 hour → page\n    slow: 3x              # Budget consumed in 10 hours → ticket\n  policy:\n    budget_exhausted:\n      - freeze non-critical deploys\n      - redirect eng effort to reliability\n      - review in weekly SRE sync\n```\n\n### Toil Reduction\n\n```\nToil = manual, repetitive, automatable, reactive, no lasting value\n\nTrack toil:\n  - Log manual interventions for 2 weeks\n  - Categorize: deployment, scaling, cert renewal, data fixes, permissions\n  - Prioritize: frequency × time × frustration\n\nTarget: <30% of engineering time on toil\nIf toil > 50%: stop feature work, automate the top 3 toil items\n\nCommon toil automation:\n  Manual deploys         → CI/CD pipeline\n  Certificate renewal    → cert-manager / ACM\n  Scaling up/down        → HPA / auto-scaling groups\n  Permission requests    → Self-service IAM with approval\n  Data fixes             → Admin API / scripts\n  Dependency updates     → Renovate / Dependabot\n  Flaky test management  → Auto-quarantine + ticket\n```\n\n### Capacity Planning\n\n```yaml\ncapacity_review:\n  frequency: monthly\n  inputs:\n    - current_utilization: \"CPU, memory, disk, network per service\"\n    - growth_rate: \"request rate trend over 90 days\"\n    - planned_events: \"launches, marketing campaigns, seasonal peaks\"\n    - headroom_target: 30%  # Don't run above 70% sustained\n\n  formula:\n    needed_capacity: \"current_usage × (1 + growth_rate) × (1 + headroom)\"\n    lead_time: \"14 days for cloud, 60+ days for hardware\"\n\n  actions:\n    - \"If utilization > 70%: plan scaling within 2 weeks\"\n    - \"If utilization > 85%: emergency scaling NOW\"\n    - \"If utilization < 30%: rightsize down (save money)\"\n```\n\n## Phase 11: Cost Optimization\n\n### Cloud Cost Rules\n\n```\n1. Right-size first — most instances are overprovisioned\n   Check: actual CPU/memory usage vs provisioned (CloudWatch, Datadog)\n   Action: downsize to next tier that maintains 70% headroom\n\n2. Reserved capacity for baseline — spot/preemptible for burst\n   Pattern: 60% reserved + 30% on-demand + 10% spot\n   Savings: 40-70% on reserved vs on-demand\n\n3. Auto-scale to zero when possible\n   - Dev/staging environments: scale down nights + weekends\n   - Serverless for bursty workloads (Lambda, Cloud Functions)\n\n4. Delete zombie resources monthly\n   - Unattached EBS volumes\n   - Old snapshots (>90 days, not tagged for retention)\n   - Unused load balancers\n   - Orphaned Elastic IPs\n\n5. Storage tiering\n   - Hot: SSD (frequently accessed)\n   - Warm: HDD (monthly access)\n   - Cold: S3 Glacier / Archive (yearly access)\n   - Auto-lifecycle policies on S3 buckets\n\n6. Tag everything — untagged = untracked = wasted\n   Required tags: environment, team, service, cost-center\n   Weekly report: cost by tag, highlight untagged resources\n```\n\n### Monthly Cost Review Template\n\n```markdown\n## Cloud Cost Review — [Month YYYY]\n\n### Summary\n- Total spend: $X,XXX (vs budget: $X,XXX)\n- MoM change: +X% ($XXX)\n- Top 3 cost drivers: [service1, service2, service3]\n\n### By Service\n| Service | Cost | % of Total | MoM Change | Action |\n|---------|------|-----------|------------|--------|\n| EKS | $XXX | XX% | +X% | Right-size node group |\n| RDS | $XXX | XX% | 0% | Consider reserved |\n| S3 | $XXX | XX% | +X% | Add lifecycle rules |\n\n### Optimization Actions Taken\n- [Action 1]: Saved $XXX/mo\n- [Action 2]: Saved $XXX/mo\n\n### Next Month Actions\n- [ ] [Action with estimated savings]\n```\n\n## DevOps Maturity Assessment\n\nScore your team (1-5 per dimension):\n\n| Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) |\n|-----------|-----------|-------------|----------------|\n| **CI/CD** | Manual deploy | Automated pipeline, manual gate | Full auto with canary, <15 min to prod |\n| **IaC** | Click-ops console | Some Terraform, manual tweaks | 100% IaC, GitOps, drift detection |\n| **Monitoring** | Check when broken | Dashboards + basic alerts | SLOs, error budgets, auto-remediation |\n| **Incident** | Panic + SSH | Runbooks, on-call rotation | Blameless postmortems, chaos engineering |\n| **Security** | Annual audit | CI scanning, secret manager | Shift-left, runtime detection, zero-trust |\n| **Cost** | Surprise bills | Monthly review, some reservations | Real-time tracking, auto-optimization |\n\n**Score interpretation:**\n- 6-12: Foundations needed — focus on CI/CD and basic monitoring\n- 13-20: Growing — add IaC and incident process\n- 21-26: Mature — optimize with SRE practices and cost management\n- 27-30: Elite — focus on chaos engineering and developer experience\n\n## Natural Language Commands\n\nSay things like:\n- \"Set up CI/CD for my Node.js project\"\n- \"Create a Dockerfile for my Python API\"\n- \"Write Terraform for an ECS service with RDS\"\n- \"Design a monitoring dashboard for my service\"\n- \"Help me write a post-mortem for yesterday's outage\"\n- \"Review my Kubernetes deployment for production readiness\"\n- \"What deployment strategy should I use?\"\n- \"Help me set up alerting rules\"\n- \"Create an incident response runbook for database failures\"\n- \"Audit my cloud costs and suggest optimizations\"\n- \"Assess our DevOps maturity\"\n- \"Set up secret management for our CI pipeline\"\n","tags":{"cicd":"1.0.0","devops":"1.0.0","docker":"1.0.0","infrastructure":"1.0.0","kubernetes":"1.0.0","latest":"1.0.0","sre":"1.0.0"},"stats":{"comments":0,"downloads":234,"installsAllTime":9,"installsCurrent":1,"stars":0,"versions":1},"createdAt":1771345517880,"updatedAt":1778491567625},"latestVersion":{"version":"1.0.0","createdAt":1771345517880,"changelog":"Initial release of afrexai-devops-engine — a complete DevOps & Platform Engineering system.\n\n- Covers the full DevOps lifecycle: repository strategy, CI/CD pipelines, containerization, and platform practices.\n- Provides practical templates for branch protection, commit conventions, and pipeline design for all major CI tools.\n- Includes best practices for Dockerfile creation, image optimization, container security, and local development setup.\n- Features adaptable guides on caching strategies, pipeline concurrency, and reusable workflow patterns.\n- Designed for any team size, release cadence, and major cloud/platform environments.","license":null},"metadata":{"setup":[],"os":["linux","darwin","win32"],"systems":null},"owner":{"handle":"1kalin","userId":"s17e1q0nx23qnh4n429zzqc05x83hvsw","displayName":"1kalin","image":"https://avatars.githubusercontent.com/u/15705344?v=4"},"moderation":null}