Install
openclaw skills install production-readinessMeta-skill that orchestrates logging, monitoring, error handling, performance, security, deployment, and testing skills to ensure a service is fully production-ready before launch. Use before first deploy, major releases, quarterly reviews, or after incidents.
openclaw skills install production-readinessCoordinates all operational concerns into a single readiness review. Instead of duplicating domain expertise, this skill routes to specialized skills and agents for each area, then synthesizes results into a unified go/no-go assessment.
npx clawhub@latest install production-readiness
Ensure a service is production-ready by systematically checking every operational concern — logging, error handling, performance, security, deployment, testing, and documentation — before traffic hits it.
A production-ready service:
| Trigger | Context |
|---|---|
| Before first deploy | New service going to production for the first time |
| Before major release | Significant feature or architectural change shipping |
| Quarterly production review | Scheduled audit of existing services |
| After incident | Post-incident hardening to prevent recurrence |
| Dependency upgrade | Major framework, runtime, or infrastructure change |
| Team handoff | Transferring ownership of a service to another team |
Run each area sequentially or in parallel. Each step delegates to a specialized skill or agent — this skill does not re-implement their logic.
┌─────────────────────────────────────────────────┐
│ Production Readiness Review │
├─────────────────────────────────────────────────┤
│ │
│ 1. Logging & Observability ──► logging-observability skill
│ 2. Error Handling ───────────► error-handling-patterns skill
│ 3. Performance ──────────────► performance-agent
│ 4. Security ─────────────────► security-review meta-skill
│ 5. Deployment ───────────────► deployment-agent + docker-expert skill
│ 6. Testing ──────────────────► testing-workflow meta-skill
│ 7. Documentation ────────────► /generate-docs command
│ │
│ ──► Synthesize results into go/no-go report │
└─────────────────────────────────────────────────┘
| Concern | Skill / Agent | Path |
|---|---|---|
| Logging & Observability | logging-observability | ai/skills/tools/logging-observability/SKILL.md |
| Error Handling | error-handling-patterns | ai/skills/backend/error-handling-patterns/SKILL.md |
| Performance | performance-agent | ai/agents/performance/ |
| Security | security-review | ai/skills/meta/security-review/SKILL.md |
| Deployment (containers) | docker-expert | ai/skills/devops/docker/SKILL.md |
| Deployment (pipelines) | deployment-agent | ai/agents/deployment/ |
| Testing | testing-workflow | ai/skills/testing/testing-workflow/SKILL.md |
| Rate Limiting | rate-limiting-patterns | ai/skills/backend/rate-limiting-patterns/SKILL.md |
| Documentation | /generate-docs | ai/commands/documentation/ |
Routing rule: Read the target skill first, follow its instructions, then return results here for synthesis.
/healthz or /health) returns dependency status| Level | Name | Requirements |
|---|---|---|
| L1 | MVP | Health check, basic logging, error handling, manual deploy, unit tests, README |
| L2 | Stable | Structured logging, metrics, graceful shutdown, CI/CD pipeline, integration tests, runbooks |
| L3 | Resilient | Distributed tracing, circuit breakers, auto-scaling, chaos testing, SLOs, on-call rotation |
| L4 | Optimized | Adaptive rate limiting, predictive alerting, canary deploys, full observability, error budgets, postmortem culture |
| Severity | Response Time | Escalation After | Stakeholder Notification |
|---|---|---|---|
| SEV-1 (outage) | 15 min | 30 min | Immediate — exec + customers |
| SEV-2 (degraded) | 30 min | 1 hour | Within 1 hour — eng lead |
| SEV-3 (minor) | 4 hours | Next business day | Daily standup |
| SEV-4 (cosmetic) | Next sprint | N/A | Backlog |
## Incident: [Title]
**Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** SEV-N
### Summary
One-paragraph description of what happened and impact.
### Timeline
- HH:MM — First alert fired
- HH:MM — Engineer paged, investigation started
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Full resolution confirmed
### Root Cause
What broke and why. Link to code/config change if applicable.
### Impact
- Users affected: N
- Revenue impact: $X (if applicable)
- SLO budget consumed: X%
### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Fix X | @eng | YYYY-MM-DD | Open |
### Lessons Learned
- What went well
- What went poorly
- Where we got lucky