Install
openclaw skills install bookforge-distributed-failure-analyzerDiagnose distributed system failures caused by network faults, unreliable clocks, or process pauses — and map each to its correct mitigation. Use when: a node is intermittently timing out with no clear network outage; a lock-holder or leader keeps acting after being declared dead (zombie leader / split brain via distributed locking, not replication topology — use replication-failure-analyzer for replica split brain); stale reads persist beyond expected replication lag; wall-clock-based lease checks or last-write-wins conflict resolution is producing data loss under clock skew; or cascading node-death declarations are occurring under load. Also use proactively to audit timing assumptions in new system designs (absence of fencing tokens, NTP drift exposure, GC pause risk). Distinct from replication-failure-analyzer (replication lag anomalies, failover pitfalls, quorum edge cases). Produces a structured failure report: symptom → fault category → mechanism → mitigation. Covers: asynchronous network behavior, timeout tuning and cascade risk, NTP drift and clock jump mechanics, process pause causes (GC, VM migration, paging, SIGSTOP), fencing tokens with ZooKeeper zxid/cversion, Byzantine fault scoping, and system model selection (crash-stop vs. crash-recovery vs. Byzantine; synchronous vs. partially synchronous vs. asynchronous).
openclaw skills install bookforge-distributed-failure-analyzerYou are diagnosing or preventing a failure in a distributed system and the root cause is not immediately obvious. The symptom could be: a timeout that might mean the remote node is dead, or might mean the network is congested, or might mean the node is alive but paused. A write that appeared to succeed but whose data is now missing. A leader that is still writing after another leader was elected. A lock that was held correctly but still allowed two writers simultaneously.
This skill imposes a diagnostic framework: every unexplained distributed system failure traces to one of three root fault categories — network faults, clock unreliability, or process pauses — and each category has a bounded set of mechanisms and well-understood mitigations. The skill maps symptoms to categories, categories to mechanisms, and mechanisms to concrete fixes.
Use it reactively (incident post-mortem, production debugging) or proactively (design review, codebase audit for timing anti-patterns).
Cross-references:
replication-failure-analyzer — for failures specific to replication lag, failover, and quorum behaviorconsistency-model-selector — for selecting isolation and consistency guarantees that prevent a class of failures at the application layerBefore analysis, collect the following. Ask the user for any that are missing.
Required:
Useful: 4. Infrastructure configuration — timeout values, NTP setup, GC settings (heap size, collector type), VM migration policies. 5. Relevant code — any section that checks wall-clock time, holds a lease or lock, does conflict resolution (especially last-write-wins), or makes assumptions about execution timing. 6. Logs or metrics — round-trip time distributions, GC pause logs, NTP offset metrics, CPU steal time.
If no codebase is available: accept a verbal architecture description and produce an analysis based on stated behavior patterns. The output will note which findings are speculative vs. confirmed.
WHY: Distributed failures are non-deterministic and the same symptom can arise from multiple root causes. The three-category taxonomy forces explicit elimination: you cannot address "network issues" without first determining which of the six network failure modes is actually occurring, or whether it is a process pause mimicking a network failure.
For each reported symptom, classify it against this taxonomy:
Category A: Network faults The network is an asynchronous packet network with unbounded delays. When a request is sent and no response arrives, it is impossible to distinguish which of these occurred:
Key diagnostic signal: timeouts tell you nothing about whether the remote node executed your request. A timeout means you gave up waiting; it does not mean the request failed.
Category B: Clock unreliability Each node has its own hardware clock (quartz oscillator). Clocks drift and can jump:
Category C: Process pauses A thread can be preempted at any point in its execution and paused for an arbitrary duration without being notified. Pause causes:
After categorizing, state the most probable category per symptom and note any ambiguity.
WHY: The category narrows the space; the mechanism identifies the specific causal chain and determines which mitigation is effective. A "clock unreliability" failure caused by NTP being blocked at the firewall needs a different fix than one caused by VM time virtualization.
For network faults:
For clock unreliability:
For process pauses:
WHY: Many distributed system bugs are latent — they exist in the code but only manifest under specific timing conditions (high load, GC pressure, VM migration). Proactive scanning finds them before they trigger in production.
Search for:
Anti-pattern 1: Wall-clock time used for cross-node event ordering
# Look for: System.currentTimeMillis(), time.time(), Date.now(), new Date()
# used to timestamp events that are replicated or compared across nodes
Risk: If last-write-wins is the conflict resolution strategy and timestamps come from node clocks, any clock skew causes silent data loss. Fix: Use logical clocks (version vectors, Lamport timestamps) for causal ordering. Use physical clocks only for duration measurement (monotonic clock).
Anti-pattern 2: Client-side lease/lock validity check before a protected operation
# Look for: if (lease.isValid()) { ... do protected operation ... }
# or: if (System.currentTimeMillis() < leaseExpiryTime) { ... }
Risk: A process pause between the check and the operation can cause the lease to expire. The node proceeds, believing it still holds the lock, while another node has already acquired it. Fix: The resource itself must enforce the fencing token (see Step 4). Client-side checks are defense-in-depth only, not a correctness guarantee.
Anti-pattern 3: Timeout values hard-coded without accounting for load variability
# Look for: timeout = 5000 (fixed constant), no jitter, no adaptive adjustment
Risk: Timeouts calibrated for p50 latency will fire on p99 latency spikes, causing false node-death declarations. Fix: Measure round-trip time distribution empirically. Set timeout at p99 + safety margin. Consider adaptive failure detectors (Phi Accrual, used in Akka and Cassandra) that adjust based on observed jitter.
Anti-pattern 4: Distributed locking without fencing tokens
# Look for: acquire lock → use resource → release lock
# without any monotonically-increasing token passed to the resource
Risk: A zombie lock-holder (paused during lock hold, revived after expiry) will corrupt shared state by writing concurrently with the legitimate new lock-holder. Fix: See Step 4.
WHY: Any lease or lock system that relies on the lock-holder checking its own lease status is vulnerable to process pauses. The lock-holder cannot detect that it was paused; the check passes because, from the holder's perspective, no time has elapsed. The only correct solution puts enforcement in the resource, not the client.
The fencing token pattern:
Every time the lock service grants a lock or lease, it returns a fencing token — a monotonically increasing integer (each new grant increments the counter). The lock-holder includes its token in every request to the protected resource. The resource tracks the highest token it has seen and rejects any request with a lower token.
Concrete mechanics:
ZooKeeper integration: If ZooKeeper is used as the lock service, the transaction ID zxid or the node version cversion serve as fencing tokens. They are guaranteed to be monotonically increasing.
When the resource does not support fencing tokens natively: For a file storage service, encode the fencing token in the filename or as a conditional write (compare-and-swap on a version field). Some kind of server-side enforcement is required — client-side enforcement alone is not sufficient for correctness.
Important: Fencing tokens protect against inadvertent zombie behavior (a node that does not know it has been declared dead). They do not protect against deliberate misbehavior — that requires Byzantine fault tolerance (see Step 5).
WHY: Byzantine fault tolerance is expensive and complex. For most datacenter systems it is unnecessary. The analysis must be explicit about whether Byzantine faults are in scope, so engineering effort is not misallocated.
Byzantine faults are relevant when:
Byzantine faults are NOT relevant for most server-side data systems when:
For typical datacenter systems: assume crash-recovery faults (nodes may fail and restart, stable storage survives crashes, in-memory state is lost on crash), not Byzantine faults. Standard authentication, checksums in the application protocol, and NTP with multiple servers cover the "weak lying" cases (corrupted packets, misconfigured servers) without the overhead of full Byzantine fault-tolerant protocols.
WHY: Distributed algorithms are designed for specific system model assumptions. Running an algorithm in a system whose actual behavior violates its model assumptions causes correctness failures. This step makes the model explicit so algorithm selection is grounded.
Timing models:
| Model | Assumption | Reality fit |
|---|---|---|
| Synchronous | Bounded network delay, bounded process pauses, bounded clock error | Not realistic for most packet networks and commodity hardware |
| Partially synchronous | Usually synchronous, occasionally exceeds bounds | Realistic for most production systems |
| Asynchronous | No timing assumptions, no timeouts | Very restrictive; few practical algorithms can operate without timeouts |
Node failure models:
| Model | Assumption | When to use |
|---|---|---|
| Crash-stop | Node fails by stopping; never recovers | Simplest; safe assumption for algorithm design even if nodes do recover |
| Crash-recovery | Nodes may crash and restart; stable storage survives; in-memory state is lost | Realistic for server-side systems with durable storage |
| Byzantine | Nodes may do anything, including sending false messages | Only for multi-party untrusted environments or high-radiation hardware |
Recommended default for most server-side data systems: partially synchronous model + crash-recovery faults.
State the chosen model explicitly in the analysis report. Any algorithm the team adopts (leader election, distributed locking, consensus) must be evaluated against this model.
Output a structured report with:
These are the mistakes most frequently made when reasoning about distributed failures. Treat each as an active risk to check.
"The network is reliable inside our datacenter — this must be a software bug." Network faults occur inside datacenters. One study found ~12 network faults per month in a medium-sized datacenter, half disconnecting an entire rack. Redundant networking gear does not reduce faults proportionally because human error (misconfiguration) is a primary cause.
"The timeout fired, so the node must be dead." A timeout means you stopped waiting. The remote node may have received your request and processed it, with only the response being lost or delayed. Acting on a false timeout (e.g., promoting a new leader) can create two leaders simultaneously.
"We use NTP so our clocks are synchronized." NTP accuracy is limited by network round-trip time. On a congested network, NTP error can exceed 100ms. VMs are paused by the hypervisor and see their clock jump forward when resumed. Clocks behind a firewall that blocks NTP silently drift. "Synchronized" does not mean "accurate to the millisecond."
"The node checked its lease and it was valid, so it was safe to proceed." A check-then-act sequence is not atomic in a distributed system. A GC pause, VM migration, or OS context switch between the check and the act can make the lease expire. Only server-side enforcement (fencing tokens) provides correctness.
"Last-write-wins is fine because our timestamps are accurate enough." LWW silently discards writes when a node with a lagging clock overwrites values from a node with a fast clock. Clock skew between nodes under 3ms can cause this. The application receives no error. The data is simply gone.
"The node is dead — it stopped responding." The node may be in a stop-the-world GC pause. It will resume, discover that it was declared dead, and attempt to continue its previous role. Without fencing tokens, this zombie behavior can corrupt state.
"We need Byzantine fault tolerance because we can't trust all nodes." In a datacenter where your organization controls all nodes, Byzantine fault tolerance is almost certainly not needed and its cost (algorithmic complexity, performance overhead) is not justified. Standard authentication and checksums handle the realistic "lying" cases.
Scenario: A team runs Cassandra with multi-datacenter replication. After a network partition heals, some writes from one datacenter appear to be silently overwritten or lost. No errors were logged.
Trigger: Post-incident investigation after user complaints about missing data. Developer wonders if there is a replication bug.
Process:
Output: Failure analysis report identifying LWW + clock skew as root cause. Migration plan to version-vector-based conflict resolution. Alert thresholds for inter-datacenter clock offset monitoring.
Scenario: A service uses a ZooKeeper-based distributed lock before writing to shared object storage. Occasionally, two clients write to the same object simultaneously, corrupting it. The team has confirmed the locking code "looks correct."
Trigger: Data corruption incident. Developer audits the locking implementation.
Process:
lease.isValid(), finds the clock says only milliseconds have elapsed, concludes it still holds the lock, and continues writing. Two writers are now active simultaneously.zxid is a suitable fencing token — it is monotonically increasing and returned with each lock grant. Pass the zxid with every write to the storage service. The storage service rejects writes with a zxid lower than the highest it has seen. Alternatively: use conditional writes (compare-and-swap on an object version field) at the storage layer to detect and reject stale writes.Output: Failure analysis report. Code diff showing where to extract the zxid from ZooKeeper and pass it to storage writes. Storage service modification to track and enforce the fencing token.
Scenario: During a traffic spike, a distributed database cluster declares multiple nodes dead in rapid succession. The cluster degrades severely. When traffic drops, all nodes recover and show healthy, but the cluster has partially lost quorum and requires manual intervention.
Trigger: Post-incident review. SRE team wants to understand why nodes were declared dead when they were actually alive.
Process:
Output: Failure analysis report. Timeout recalibration recommendation with p99.9 measurement methodology. Configuration changes. Recommendation to evaluate Phi Accrual failure detector.
references/failure-taxonomy.md — complete taxonomy: network fault modes, clock error sources, process pause causes, with detection signals and mitigation optionsreferences/fencing-token-pattern.md — fencing token mechanics, ZooKeeper integration, implementation for storage services without native fencing supportreferences/system-models.md — timing models (synchronous, partially synchronous, asynchronous), node failure models (crash-stop, crash-recovery, Byzantine), safety vs. liveness propertiesreferences/clock-pitfalls.md — NTP accuracy limits, LWW data loss mechanics, clock confidence intervals, Google Spanner TrueTime approachThis skill is licensed under CC-BY-SA-4.0. Source: BookForge — Designing Data-Intensive Applications by Martin Kleppmann.
Install related skills from ClawhHub:
clawhub install bookforge-replication-strategy-selectorclawhub install bookforge-consistency-model-selectorOr install the full book set from GitHub: bookforge-skills