Install
openclaw skills install bookforge-data-integration-architectDesign the integration architecture for systems with multiple specialized data stores (Postgres, Elasticsearch, Redis, data warehouses) that must stay in sync. Use when deciding how data flows between components, avoiding dual writes, reasoning about correctness across system boundaries (idempotency, end-to-end operation identifiers), choosing between Lambda and Kappa architecture, or applying the "unbundling databases" pattern to compose specialized tools instead of relying on a single monolith. Trigger phrases: "how do I keep Postgres and Elasticsearch in sync?", "should I use CDC or event sourcing to propagate data?", "how do I avoid dual writes across microservices?", "my downstream systems are going out of sync — how do I fix the architecture?", "how do I design derived data pipelines?", "what is the system of record pattern?", "how do I integrate OLTP with a search index and an analytics warehouse?", "how do I design for end-to-end idempotency?". This is the capstone skill for data systems design — it synthesizes batch pipelines, stream integration, consistency, and replication into a single architecture recommendation. Produces a component map (systems of record vs derived views), data flow diagram, and correctness analysis. Does not replace batch-pipeline-designer or stream-processing-designer — delegates to them for pipeline internals.
openclaw skills install bookforge-data-integration-architectUse this skill when you are designing or evaluating an architecture where multiple specialized data stores need to work together and you must reason about how data flows between them, stays consistent, and remains correct in the face of faults.
Concrete preconditions:
Do not use this skill if you have a single-database workload with no integration requirements — use oltp-olap-workload-classifier and storage-engine-selector instead.
Before proceeding, gather the following. If any are unknown, ask the user.
Required:
Optional but valuable:
architecture.md / docker-compose.ymlIf dependencies are installed, invoke them to fill gaps:
consistency-model-selector → determine per-operation consistency requirementsreplication-strategy-selector → determine replication topology for the primary storebatch-pipeline-designer → design the batch processing layer if historical reprocessing is neededstream-processing-designer → design the stream processing layer if low-latency propagation is neededACTION: Identify every system that stores data and every transformation that moves data between systems.
Draw (or describe) the current state:
[Write source] → [Store A] → ??? → [Store B]
→ ??? → [Store C]
For each data path, classify the current integration mechanism:
WHY: You cannot improve what you cannot see. Dual writes are the most common source of drift between systems — if the write to Store B fails after the write to Store A succeeds, the stores diverge permanently. Making the dataflow explicit immediately surfaces where this risk exists.
ACTION: For each logical data entity (user, order, product, event), determine which single store is the system of record — the authoritative source of truth that other representations are derived from.
Apply this test: if Store A and Store B disagree about the value of an entity, which one is correct by definition?
Document the result as a table:
| Entity | System of Record | Derived Views |
|---|---|---|
| User profile | PostgreSQL users table | Elasticsearch user index, Redis session cache |
| Order | PostgreSQL orders table | Kafka order-events log, Snowflake orders_fact |
| Search index | (none — fully derived) | Elasticsearch ← PostgreSQL via CDC |
WHY: The system of record designation is not just documentation — it determines write authority. Only the system of record should accept writes for its entities. All other stores are read-optimized derived views that are populated by processing the source of truth. This is what prevents split-brain inconsistency between stores.
ACTION: For each system of record → derived view pair, select the integration mechanism from this decision framework:
Decision tree:
Is the derived view latency requirement < 1 minute?
├─ YES → Use stream-based propagation (CDC or event sourcing)
│ → invoke stream-processing-designer for details
└─ NO → Can you afford to reprocess all historical data?
├─ YES → Consider batch ETL (simpler, replayable)
│ → invoke batch-pipeline-designer for details
└─ NO → Use incremental batch or hybrid (CDC + periodic batch)
For the propagation mechanism, choose:
| Mechanism | Use when | Trade-off |
|---|---|---|
| Change data capture (CDC) | Existing database cannot be changed; need low-latency propagation | Requires access to replication log; schema changes need care |
| Event sourcing | New system or greenfield; want full audit log; need replayability | Application must be redesigned around immutable events |
| Batch ETL | High latency acceptable; historical reprocessing needed regularly | Simple but creates lag windows; schema evolution is manual |
| Log-based messaging (Kafka) | High throughput; multiple consumers; need replay; decoupled teams | Operational complexity; ordering only within partition |
Key principle — prefer the event log over distributed transactions: Distributed transactions (2PC) across heterogeneous stores are fragile: they have poor fault tolerance (coordinator failure leaves participants in-doubt), poor performance (blocking protocol), and require all participating systems to speak the same transaction protocol. An ordered log of events with idempotent consumers achieves the same correctness properties with better fault isolation and looser coupling. A fault in one consumer is contained locally; it does not abort the writes to all other systems.
WHY: The integration mechanism determines whether failures in one part of the system cascade to other parts. Synchronous coupling (dual writes, distributed transactions) amplifies failures across system boundaries. Asynchronous event logs contain failures: a slow or failed consumer falls behind, but the producer and other consumers continue unaffected.
ACTION: Evaluate whether the requirements can be satisfied by a single integrated system or require a composed pipeline of specialized tools.
Apply this test:
Use a single system if:
Use a composed pipeline if:
The unbundling principle: A database internally implements secondary indexes, materialized views, replication logs, and caching. When you compose specialized systems — a primary database plus a search index plus an analytics warehouse plus a feature store — you are "unbundling" those features into separate, independently deployable components. The event log plays the role of the replication log that connects them.
The goal of unbundling is breadth: achieving good performance across a wider range of workloads than any single system supports. It is not to maximize the number of moving parts.
WHY: Many teams reach for multi-system architectures prematurely and add operational complexity without proportionate benefit. The decision must be grounded in actual access pattern requirements that cannot be satisfied by a simpler alternative.
ACTION: If a composed pipeline is warranted, select the processing architecture:
Lambda Architecture:
Kappa Architecture:
Single unified system:
WHY: Lambda architecture was an important idea that made event reprocessing central to data system design. But the practical problems — dual codebases, merged outputs, incremental batch complexity — are significant. The Kappa architecture achieves the same benefits (replayability, correctness, low latency) without the dual-system burden, provided the stream processor supports replay and exactly-once semantics.
ACTION: Identify which data flows require a total order and evaluate whether total order is achievable given your deployment constraints.
Total ordering is required when:
Constraints on total ordering:
| Constraint | Effect |
|---|---|
| Single leader | Total order is feasible; the leader serializes all writes |
| Multiple partitions | Order within a partition is guaranteed; across partitions is not |
| Multiple datacenters | Total order requires synchronous cross-datacenter coordination (high cost) |
| Independent microservices | Events originating in different services have no defined order |
For causal ordering without total order, consider:
consistency-model-selectorWHY: Total order broadcast is equivalent to consensus — it requires a single node (or Raft/Paxos cluster) to serialize all events. This scales well on a single machine but becomes a bottleneck at very high throughput or across geographically distributed systems. Understanding exactly which parts of your data flow require total ordering — versus which can tolerate partial ordering — lets you apply the strong guarantee only where it is necessary.
ACTION: Apply the end-to-end argument to the integration architecture. Verify that correctness guarantees are not assumed from any single layer.
The end-to-end argument: TCP suppresses duplicate packets within a connection. Databases provide transactional atomicity. Stream processors provide exactly-once semantics within the processing framework. But none of these individually prevent a user from submitting a duplicate request after a network timeout. Solving the problem requires passing a unique operation identifier all the way from the end-user client to the final data store.
Correctness checklist — apply to every data flow:
Idempotency at the consumer: Can the consumer safely process the same event twice? If not, implement idempotency using a deduplication table keyed by operation ID.
-- Pattern: unique constraint on request_id suppresses duplicates
ALTER TABLE requests ADD UNIQUE (request_id);
INSERT INTO requests (request_id, ...) VALUES ('uuid-here', ...);
-- Second identical insert fails at the constraint level — safe to retry
Operation identifier propagation: Is a unique operation ID generated by the client (UUID or hash of request fields) and passed through every hop — HTTP request → message broker → stream processor → database write?
Single-message atomicity for multi-partition operations: When an operation must affect multiple partitions (e.g., debit account A and credit account B), do not use distributed atomic commit. Instead:
Timeliness vs. integrity separation: Distinguish what must be linearizable (users must see their own writes immediately) from what requires only integrity (data must not be lost or corrupted). Violations of timeliness are eventually consistent; violations of integrity are permanent corruption.
Design the architecture so integrity is maintained in all cases, even if timeliness guarantees are weak.
Loose constraint enforcement: Not every uniqueness constraint requires synchronous linearizable enforcement. Evaluate whether the business cost of a constraint violation is recoverable:
For recoverable constraints, asynchronous detection and compensation (apology workflow) may be sufficient, enabling coordination-avoiding architectures with better availability and performance.
WHY: This is the most critical step and the most commonly skipped. Engineers assume that because their database provides transactions and their message broker provides exactly-once delivery, their system is correct. The end-to-end argument shows this is false: each layer handles its own scope, but correctness across the entire request path — from client to database — requires an explicit end-to-end mechanism. The operation identifier is the minimal such mechanism.
ACTION: Write an integration-architecture.md with three sections:
Section 1: Component Map List every data store with its role (system of record vs. derived view), the entity types it owns or serves, and the access patterns it satisfies.
[System of Record]
PostgreSQL — users, orders, products
Access: OLTP reads/writes, foreign key enforcement
[Derived Views — propagated via Kafka CDC]
Elasticsearch — full-text search on products and orders
Snowflake — orders_fact, products_fact for OLAP analytics
Redis — user session cache, hot product cache
[Processing Layer]
Kafka — event log (ordered message delivery, 7-day retention)
Debezium — CDC from PostgreSQL to Kafka
Apache Flink — stream processor (CDC → Elasticsearch, Snowflake)
Section 2: Data Flow Diagram For each entity type, show the write path (how new data enters) and the read path (how consumers access the derived view).
Write Path:
User action → Application server → PostgreSQL (system of record)
→ Debezium reads WAL → Kafka topic (ordered)
→ Flink consumer → Elasticsearch (search index)
→ Flink consumer → Snowflake (analytics)
Read Path:
Full-text search → Elasticsearch
OLAP query → Snowflake
OLTP query → PostgreSQL
Session lookup → Redis (populated from PostgreSQL on login)
Section 3: Correctness Analysis Document the ordering guarantees, idempotency mechanisms, and constraint enforcement strategy for each critical data flow.
Scenario: Online retailer with PostgreSQL for orders/products, Elasticsearch for product search, Snowflake for business analytics. Current architecture uses dual writes from application code. Products sometimes appear in search before inventory is updated; analytics dashboards occasionally show orders that do not exist in PostgreSQL.
Trigger: "Our Elasticsearch and PostgreSQL are drifting. Products show in search that are out of stock. How do we fix the architecture?"
Process:
Output: Replace dual writes with Debezium CDC pipeline. Elasticsearch and Snowflake become read-only derived views, populated exclusively from the Kafka event log. Drift is eliminated because both stores process the same ordered event sequence from the same source.
Scenario: Payment processing system where transferring money requires debiting one account (partition A) and crediting another (partition B). Current implementation uses two-phase commit across partitions; this causes availability problems when the coordinator fails.
Trigger: "We're using 2PC for cross-account transfers and it's killing our availability. How do we redesign this?"
Process:
Output: Remove 2PC. Achieve equivalent correctness (every transfer applied exactly once to both accounts) without cross-partition coordination. Availability improves because no coordinator failure mode exists.
Scenario: Social network with friendship status stored in service A, notification delivery in service B. Users report receiving notifications from people they have unfriended — the unfriend event and the message-send event are processed in the wrong order.
Trigger: "Users are getting messages from people they unfriended. The unfriend event seems to arrive after the message sometimes."
Process:
Output: Causal dependency captured via event identifiers. Notification service becomes causally consistent without requiring total ordering across all users.
Cross-skill references:
batch-pipeline-designer — design the batch reprocessing layer (historical data, schema migrations)stream-processing-designer — design the stream propagation layer (CDC, event sourcing, window types, join types)consistency-model-selector — determine per-operation consistency requirements; distinguish linearizability from serializability; select consensus mechanisms where neededreplication-strategy-selector — determine the replication topology of the primary system of recordtransaction-isolation-selector — evaluate isolation requirements for the system of record's OLTP workloaddistributed-failure-analyzer — diagnose correctness failures in the existing integrationSource: Designing Data-Intensive Applications, Martin Kleppmann (O'Reilly, 2017), Chapter 12: The Future of Data Systems, pp. 489–544.
This skill is licensed under CC-BY-SA-4.0. Source: BookForge — Designing Data-Intensive Applications by Martin Kleppmann.
Install related skills from ClawhHub:
clawhub install bookforge-batch-pipeline-designerclawhub install bookforge-stream-processing-designerclawhub install bookforge-consistency-model-selectorclawhub install bookforge-replication-strategy-selectorOr install the full book set from GitHub: bookforge-skills