Install
openclaw skills install skill-107Design and apply replication, partitioning, consensus, failure recovery, and message ordering patterns for reliable, scalable distributed systems.
openclaw skills install skill-107Quality Grade: 94-95/100
Author: OpenClaw Assistant
Last Updated: March 2026
Difficulty: Advanced (requires architectural thinking, trade-off analysis)
Distributed System Patterns are proven solutions to recurring problems in systems that span multiple machines, networks, and datacenters. As systems scale beyond single machines, coordination, fault tolerance, and consistency become non-negotiable.
This skill covers:
How it works:
Trade-offs:
When to use: Read-heavy workloads, geographic distribution, backup resilience
How it works:
Trade-offs:
When to use: High availability needs, offline-first systems, global distribution
How it works:
Trade-offs:
When to use: Consistent reads critical, moderate write frequency
Partition 0: UserIDs [0, 1000000)
Partition 1: UserIDs [1000000, 2000000)
Partition 2: UserIDs [2000000, ∞)
Pros: Simple, range queries efficient
Cons: Uneven distribution (hotspots), rebalancing expensive
Partition = hash(key) % num_partitions
Pros: Even distribution, fast lookup
Cons: Range queries require full scan, rebalancing complex
Nodes arranged in ring, key maps to first node clockwise
Adding/removing node affects only adjacent partitions (~1/N data moves)
Pros: Minimal rebalancing, scalable additions
Cons: Uneven distribution without virtual nodes, algorithm complexity
Flow:
Guarantees: Atomic across all participants
Problems: Blocking, not partition-tolerant, slow
Use case: Database transactions across shards
Leader election + log replication:
Guarantees: Safety (never lose committed data), liveness (will elect leader)
Performance: Lower throughput than 2PC, but more resilient
Use case: Distributed consensus (etcd, Consul), metadata stores
Approach: Assign unique IDs, track causal history
Guarantees: Automatic conflict resolution, commutative operations
Example: Vector clocks + last-write-wins for distributed counters
Use case: Collaborative editing, offline-first applications
Make operations repeatable—if a request is retried, result is same:
def transfer_funds(from_id, to_id, amount, idempotency_key):
# Check: did we already process this key?
if idempotency_cache.get(idempotency_key):
return idempotency_cache[idempotency_key]
result = _do_transfer(from_id, to_id, amount)
idempotency_cache[idempotency_key] = result
return result
Key: Idempotency key must be client-chosen and immutable
Attempt 1: immediate
Attempt 2: wait 1s
Attempt 3: wait 2s
Attempt 4: wait 4s
Attempt 5: wait 8s (give up if still failing)
Jitter: add random delay to avoid thundering herd
backoff_time = min(max_backoff, base * (2 ^ attempt)) + random(0, jitter)
State: CLOSED (normal) → OPEN (failing) → HALF_OPEN (testing)
CLOSED → OPEN: When error rate > threshold for duration
OPEN → HALF_OPEN: After cooldown period
HALF_OPEN → CLOSED: If test request succeeds
HALF_OPEN → OPEN: If test request fails
Messages between two nodes arrive in send order.
Implementation: Sequence numbers, TCP guarantees
If event A causally precedes B, A's message arrives before B's.
Implementation: Vector clocks or version vectors
All nodes receive all messages in same order.
Implementation: Consensus-based broadcast, sequencer node
Trade-offs: Ordering strength vs. latency cost
Distributed system patterns are essential vocabulary for building scalable, reliable systems. Understanding replication, partitioning, consensus, and failure recovery lets you design systems that survive failures, scale horizontally, and provide guarantees users can depend on.
Key Takeaway: Choose patterns based on your actual requirements (CAP theorem), not ideals. Consistency, availability, and partition tolerance—pick two.