Install
openclaw skills install branerailCTO-level architectural advisor for AI-native code, focusing on state ownership, resilience, observability, scaling, dependencies, and system design best pra...
openclaw skills install branerailCore principle: AI generates code at lightspeed. Your job is to conduct the orchestra, not play a single instrument. In an AI-native world, architectural thinking—not syntactic fluency—separates valuable builders from those building houses of cards.
Use this skill for:
Trigger keywords (use liberally):
Before shipping any logic, answer these three questions with certainty. If you cannot, your system is fragile.
The Question: What is the single source of truth for each mutable piece of data?
Why It Matters: Multiple components claiming ownership creates race conditions, sync bugs, and silent data corruption. AI-generated code often scatters state without a coherent strategy.
Audit Process:
Architecture Patterns:
| Pattern | Use When | Trade-offs |
|---|---|---|
| Single Source of Truth (DB) | Correctness is critical (payments, inventory, auth) | Higher latency (must hit DB) |
| Write-Through Cache | High read volume, acceptable write latency | Must update cache after DB |
| Write-Back Cache | Low write latency needed | Risk of cache loss before sync |
| Event Sourcing | Need audit trail and point-in-time recovery | Complexity, eventual consistency |
| CQRS | Read/write patterns differ radically | Query model sync complexity |
| Distributed Consensus | Sync state across replicas (e.g., etcd, Raft) | Complex, higher latency |
Red Flags:
Code Review Checklist:
The Question: How do you know if your system is working? What alerts you to failures?
Why It Matters: A system without visibility is failing silently. By the time a user reports it, the damage may be irreversible.
Audit Process:
Logging Strategy:
✅ GOOD: Structured, contextual
{
"timestamp": "2026-04-27T10:30:45Z",
"service": "order-processor",
"operation": "process_payment",
"orderId": "order_12345",
"customerId": "cust_67890",
"status": "failed",
"error": "payment_gateway_timeout",
"retries_attempted": 3,
"latency_ms": 5000,
"trace_id": "tr_abc123def456"
}
❌ BAD: Unstructured, no context
[ERROR] Payment failed. Retrying...
Metrics to Track:
Alerting Strategy:
Red Flags:
Code Review Checklist:
The Question: Can you trace the blast radius of every component?
Why It Matters: If you cannot articulate what happens when a piece is removed, you do not truly understand the system.
Audit Process:
Blast Radius Analysis:
Scenario: Delete the cache layer
A: Web → Cache → DB
If cache is deleted:
- Reads go directly to DB (slower, but correct)
- Throughput drops 10x
- DB CPU spikes
- Users on slow connections timeout
- Blast radius: ALL users
- Mitigation: Circuit breaker (fail fast instead of timing out)
Scenario: Delete the notification service
Orders → Notification Service → Email / SMS
If notification service is deleted:
- Orders still process (good)
- Users don't get confirmation emails (bad UX)
- Blast radius: Marketing, customer trust
- Mitigation: Queue notifications, retry asynchronously
Dependency Mapping:
| Component | Depends On | Depended On By | Fallback? | SPOF? |
|---|---|---|---|---|
| Auth Service | DB | All services | No | YES |
| Payment Gateway | External API | Orders | Retry + queue | Partial |
| Cache | In-memory store | API | Direct DB read | No |
| Notification | Message queue | Orders, Users | Queue message | No |
Red Flags:
Code Review Checklist:
These practices slow you down. They save you from building on sand.
Workflow:
Example Diagram:
Client
|
[API Gateway]
/ | \
Order User Payment
Service Service Service
| | |
[Order DB] [User DB] [Payment Gateway]
| |
[Cache] [Cache]
|
[Message Queue]
|
[Notification Service]
|
[Email / SMS Provider]
Blast radius analysis:
- If Order Service ↓: Can't create orders (orders = core feature)
- If User Service ↓: Can't login (cascade fail)
- If Cache ↓: Slower reads, but queries still work
- If Email Provider ↓: Orders process, confirmations queue, retried
Checkpoint: Can you sketch this in 5 minutes and explain it to someone else? If not, you don't understand it yet. Do not prompt AI.
Use design.md for visual design systems. Use architectural specs for system design.
DESIGN.md is a format specification that combines machine-readable design tokens (YAML front matter) with human-readable design rationale in markdown prose, allowing AI agents to generate on-brand interfaces without needing repeated explanations.
DESIGN.md Structure:
---
name: ProductName
colors:
primary: "#1A1C1E"
secondary: "#6C7278"
accent: "#B8422E"
success: "#2E7D32"
error: "#C62828"
neutral: "#F7F5F2"
typography:
h1:
fontFamily: "Public Sans"
fontSize: "3rem"
fontWeight: "700"
body:
fontFamily: "Public Sans"
fontSize: "1rem"
lineHeight: "1.5"
spacing:
xs: "4px"
sm: "8px"
md: "16px"
lg: "32px"
rounded:
sm: "4px"
md: "8px"
lg: "16px"
---
## Visual Intent
Describe the aesthetic and emotional tone: minimalist, bold, approachable, professional.
## Color Usage
Explain the semantic meaning of each color and when to use it.
## Typography
Explain font choices and when to use each scale.
## Component Patterns
Define behavior for buttons, cards, forms, modals, etc.
## Accessibility
Document WCAG AA/AAA compliance, contrast ratios, keyboard navigation.
Validation: Use Google's design.md CLI tool to validate the file, check WCAG contrast ratios, and export tokens to Tailwind or W3C DTCG format.
# [Component Name] Specification
## Purpose
One sentence. What does this do?
## Inputs
- Data structure(s), format, size limits, example payloads
## Outputs
- Data structure(s), format, example payloads
## State Ownership
- What state does this own?
- What state does it read (from where)?
- How are conflicts resolved?
## Critical Path
- Happy path: input → process → output
- Timeline and latency targets
## Failure Modes
| Failure | Probability | Impact | Detection | Recovery |
|---------|-------------|--------|-----------|----------|
| Network timeout | High | Partial | Timeout + log | Retry with exponential backoff |
| Disk full | Medium | Total | No space error | Alert, manual intervention |
| Invalid input | High | Partial | Schema validation | Reject + log |
| Cascade from dependency | High | Partial | Dependency error | Fallback or circuit break |
## Observability
- Logs: what events are logged?
- Metrics: what is measured?
- Alerts: what triggers escalation?
## Constraints
- Performance targets (latency p99, throughput)
- Scaling limits (max concurrent, max data size)
- Dependencies (what must be running first)
## Questions Answered
- Where does state live? [Describe single source of truth]
- Where does feedback live? [Describe observability]
- What breaks if I delete this? [Describe blast radius]
Checkpoint: If you cannot fill this out without guessing, the design is incomplete. Do not proceed.
For each component:
[ ] What calls this?
[ ] What does this output to?
[ ] What happens to those dependents if this is gone?
[ ] Are there fallbacks?
[ ] How many users are affected?
[ ] How long until they notice?
Workflow:
Frequency: Weekly for critical code, monthly for infrastructure.
Key distinction: Compilers are deterministic. LLMs are probabilistic.
A compiler follows provably correct rules. You trust it without auditing the machine code.
An LLM makes choices based on statistical likelihood. It can introduce:
Your role: Auditor and architect.
When Claude Code or another agent generates code, audit it against these criteria:
| Anti-Pattern | Failure Mode | Fix |
|---|---|---|
| No State Ownership | Race conditions, sync bugs, data corruption | Designate a single owner for each data type |
| Scattered State | Inconsistency, silent failures, hard to debug | Centralize or use consensus protocol |
| Silent Failures | User reports bug hours later; data is corrupted | Instrument everything; alert on anomalies |
| Circular Dependencies | Can't isolate changes; cascading failures | Restructure to acyclic dependency graph |
| Single Point of Failure (SPOF) | One component down = entire system down | Add redundancy, fallbacks, bulkheads |
| Implicit Dependencies | Hidden globals, env vars, side effects | Make dependencies explicit; inject them |
| Premature Optimization | Complex code, fragile systems, maintenance nightmare | Simplify first, optimize after measurement |
| Tight Coupling | Can't change one service without affecting others | Loosen via async, contracts, versioning |
| No Monitoring | System fails silently; rollbacks are expensive | Instrument every critical operation |
| Cache Invalidation | "There are only 2 hard things in CS..." | Explicit invalidation or TTL; measure hit ratio |
These are the hardest problems. Think deeply.
Mutex / Lock:
Atomic Operations:
Immutable Data:
Channels / Queues:
Transactions:
Consensus (Raft, Paxos):
Eventual Consistency:
Event Sourcing:
CQRS (Command Query Responsibility Segregation):
Circuit Breaker:
Create a DESIGN.md (for UI consistency):
# Ask Claude Code to generate DESIGN.md
"Create a DESIGN.md file that defines our brand colors, typography, and component patterns."
Create architectural specs (for system design):
# Ask Claude Code to scaffold spec documents
"Generate spec templates for each major component: auth, payment, notifications."
Link specs to prompts:
You are a CTO-level code generator.
When I ask you to build [feature], first:
1. Reference the spec at /specs/[feature].md
2. Verify your code satisfies all requirements.
3. Implement the failure modes listed.
4. Include structured logging for every operation.
If building UI:
1. Reference /DESIGN.md for colors, typography, components.
2. Ensure all generated UI respects those tokens.
3. Check WCAG AA contrast ratios.
Good Prompt:
Using the spec at /specs/order-processing.md:
1. Implement the order processing service.
2. All state mutations go through OrderStore (single source of truth).
3. Implement retry logic with exponential backoff for payment gateway failures.
4. Log every operation: orderId, status, latency, errors.
5. Emit metrics: order count, latency p50/p95/p99, error rate.
6. Add a circuit breaker: if payment fails >5% of the time, fail fast.
7. Handle the failure modes in the spec: timeout, invalid input, gateway down, database error.
Why It Works:
Move beyond simple client-server models into resilient, high-scale patterns.
Advanced techniques for self-healing systems.
Score yourself 0-3 on each:
| Criterion | 0 - Fragile | 1 - Risky | 2 - Solid | 3 - Resilient |
|---|---|---|---|---|
| State Ownership | Multiple owners or scattered | Some replicas without strategy | Single owner, clear replicas | Central authority + audit trail |
| Observability | No logging or metrics | Logs exist but unstructured | Structured logs, basic metrics | Full tracing, anomaly detection |
| Failure Handling | No fallbacks, cascades fail | Some fallbacks, partial coverage | All critical failures handled | Self-healing, circuit breakers |
| Blast Radius | Don't know what's coupled | Loosely mapped | Well documented | Tested via chaos engineering |
| Testing | No tests | Happy path only | Happy + failure cases | Concurrency, performance, chaos |
| Scaling | Doesn't scale | Scales to 10x | Scales to 100x with planning | Horizontal scaling built-in |
| Dependency Clarity | Hidden globals, side effects | Some explicit, some implicit | All dependencies injected | Versioned contracts, no surprises |
| Code Quality | Unreadable, no comments | Readable but dense | Clear intent, documented | Self-documenting, easy to extend |
Target Score: 2+ on all dimensions. Anything below 1 is a risk.
AI will replace typing. It will not replace thinking.
The most valuable builders will be those who:
The shift from "coder" to "conductor" is not optional. It is the price of remaining relevant.
If you can answer yes to all 15, your system is sound.