Install
openclaw skills install bookforge-encoding-format-advisorSelect a data encoding format (JSON, Protobuf, Thrift, or Avro) and design a schema evolution strategy that preserves backward and forward compatibility through rolling upgrades. Use when asked "should I use Protobuf or JSON?", "how do I evolve my schema without breaking old clients?", "how does Avro schema evolution work?", "what's the difference between Thrift and Protocol Buffers?", or "how do I add/remove fields without breaking compatibility?" Also use for: choosing text vs. binary encoding for internal services; checking whether a schema change breaks compatibility; diagnosing unknown field loss bugs during rolling upgrades; planning per-dataflow encoding strategy (database storage vs. REST/RPC vs. message broker). Covers five encoding families: language-specific, JSON/XML/CSV, binary JSON, Thrift/Protobuf, and Avro — with writer/reader schema reconciliation and per-dataflow-mode analysis. For data model selection (relational/document/graph), use data-model-selector instead. For message broker or stream pipeline design, use stream-processing-designer instead.
openclaw skills install bookforge-encoding-format-advisorYou are designing or evolving a system that passes data between processes — over the network, through a message broker, or persisted to disk — and need to choose how to encode that data and how to evolve the schema over time without breaking running services.
This skill applies when:
This skill addresses format selection and schema evolution. For data shape decisions (relational vs. document vs. graph), use data-model-selector first. For stream processing pipeline design using encoded data, see stream-processing-designer.
Before running the selection framework, collect:
.proto, .thrift, .avsc, JSON Schema) or representative payload samples for analysisIf the dataflow mode and rolling upgrade requirement are missing, ask before proceeding. A format recommendation without knowing these two factors is unreliable.
Action: Determine how data flows between the processes you are encoding for. The three dataflow modes impose different constraints on format selection — especially on schema version negotiation and how long compatibility must be maintained.
WHY: Encoding format compatibility is a property of a relationship between a writer process and a reader process. That relationship looks fundamentally different in each dataflow mode. In databases, a process writing data today may be read by the same process five years from now using a different schema version — data outlives code. In synchronous service calls (REST/RPC), you can assume servers upgrade before clients, simplifying the compatibility requirement to backward-only on requests. In asynchronous message passing, a consumer may be processing a message written weeks ago by a producer that has since been upgraded or decommissioned. Each mode changes which compatibility properties you need, and therefore which formats are viable.
Dataflow Mode A — Databases (data-at-rest)
Dataflow Mode B — Synchronous service calls (REST, RPC)
Dataflow Mode C — Asynchronous message passing (message brokers, event streams)
Action: Score each of the five encoding families against six criteria relevant to your system. Score 1–5 per criterion per family. Skip "language-specific" family unless evaluating whether to use it (the answer is almost always no).
WHY: Engineers frequently default to JSON because it is familiar or default to Protocol Buffers because they "heard it's faster," without evaluating the actual criteria relevant to their system. The scoring forces examination of the criteria that change the decision: whether the schema must be dynamically generated (favors Avro over Protobuf), whether human-readability is required for debugging (favors JSON), whether the data crosses organizational boundaries (favors JSON/REST), whether statically typed code generation is valued (favors Thrift/Protobuf), and whether schema evolution flexibility is the primary constraint (favors Avro). Running all families — even the obvious misfits — produces the rationale needed for a technical decision document.
Encoding families:
| Family | Examples | Key characteristic |
|---|---|---|
| Language-specific | Java Serializable, Python pickle, Ruby Marshal | Built into the language; no cross-language support |
| Text-based | JSON, XML, CSV | Human-readable; self-describing; no schema required |
| Binary JSON variants | MessagePack, BSON, CBOR | JSON data model; binary encoding; still no schema |
| Schema-driven binary (tag-based) | Apache Thrift (BinaryProtocol, CompactProtocol), Protocol Buffers | Schema required; field tags in encoded data; compact |
| Schema-driven binary (name-based) | Apache Avro | Schema required; no tags in encoded data; writer and reader schemas resolved at decode time |
Scoring criteria:
1. Cross-language support — Can both writer and reader sides use this format regardless of programming language?
2. Schema evolution safety — Does the format provide explicit mechanisms for backward and forward compatibility as the schema changes?
3. Payload compactness — How large are encoded payloads compared to the logical data?
4. Human-readability and debuggability — Can engineers read and debug encoded data without tooling?
protoc --decode_raw, avro-tools)5. Code generation and type safety — Does the format support generating typed structs or classes from a schema, enabling compile-time type checking?
6. Dynamically generated schema support — Can schemas be generated programmatically (e.g., from a database table definition) without manual field tag assignment?
Score each family:
JSON/XML Binary JSON Thrift/Protobuf Avro
Cross-language support [1-5] [1-5] [1-5] [1-5]
Schema evolution safety [1-5] [1-5] [1-5] [1-5]
Payload compactness [1-5] [1-5] [1-5] [1-5]
Human-readability [1-5] [1-5] [1-5] [1-5]
Code generation/type safety [1-5] [1-5] [1-5] [1-5]
Dynamic schema support [1-5] [1-5] [1-5] [1-5]
Total [6-30] [6-30] [6-30] [6-30]
See references/format-comparison-table.md for pre-filled scores with rationale for each criterion, plus byte counts for the same record encoded in all five formats.
Action: Apply explicit if/then rules to produce a primary format recommendation. These rules encode the structural logic of the format characteristics — they are not heuristics but direct consequences of how each format handles field identification, schema negotiation, and type encoding.
WHY: Scoring produces numbers; decision rules produce a recommendation. The rules encode the non-obvious consequences of format choice: JSON's number ambiguity (integers vs. floats, no precision specification) causes silent data corruption at scale; Thrift/Protobuf's tag-based schema evolution is robust for most cases but breaks when schemas are generated dynamically (because tags must be managed manually); Avro's writer/reader schema resolution is powerful but requires a schema distribution mechanism (file header, schema registry, version negotiation) that must be built or operated. These consequences are expensive to discover after deployment.
Rule 1 — Use JSON (or REST/JSON) if ANY of the following are true:
Rule 2 — Use Protocol Buffers (Protobuf) if ALL of the following are true:
Rule 3 — Use Apache Thrift if ALL of the following are true:
Rule 4 — Use Apache Avro if ANY of the following are true:
Rule 5 — Avoid language-specific encodings (Java Serializable, Python pickle, Ruby Marshal) unless:
Tie-breaker when rules 2 and 4 both apply (schema-driven binary required, but dynamic generation is also needed): Choose Avro if schema generation frequency is high (schemas change when the source schema changes, e.g., database column added). Choose Protobuf if schema changes are infrequent and controlled by your team (field tag management overhead is acceptable).
Action: For each planned or expected schema change, check it against the per-format compatibility rules. Classify each change as: safe (backward and forward compatible), backward-only (new code reads old data, but not vice versa), forward-only (old code reads new data, but not vice versa), or breaking (incompatible in at least one direction).
WHY: The core problem encoding formats solve is not just efficiency — it is allowing old and new versions of code to coexist while reading the same data. During a rolling upgrade, some nodes run new code and some run old code; they write data to the same database or send messages to the same topic. Forward compatibility (old code reads data written by new code) is the harder direction: it requires old code to safely ignore additions made by new code rather than crashing. Each format handles this differently, and the permitted changes differ significantly.
Field tags (the numbers = 1, = 2, = 3 in Protobuf; 1:, 2:, 3: in Thrift) are the identity of a field in the encoded data — not the field name. The encoded data contains only tags and values; names are only in the schema. This is what enables forward compatibility: a reader that sees an unknown tag number can skip that field using the type annotation to determine how many bytes to skip.
Safe changes (backward and forward compatible):
required to optional — safe; required is a runtime check, not an encoding propertyBackward compatible only (new code reads old data; old code cannot read new data):
Breaking changes — never do these:
required — old code that wrote data before the field existed will fail the required check when new code reads it; every new field added after initial deployment must be optional or have a default valueint32 to string) — type mismatch causes a parse error or silent truncationDatatype change rules (Protobuf):
int32 → int64: safe; new code fills missing high bits with zeros; old code reads 64-bit value into 32-bit variable (truncated if value exceeds 32-bit range)optional (single-value) → repeated (multi-value): safe; new code reading old data sees a list with zero or one elements; old code reading new data sees the last element of the listrepeated → optional: safe in the reverse direction only if the new code handles a single-element listAvro does not use field tags. The encoded data contains only values concatenated in schema field order — no type annotations, no tags. The reader must have access to both the writer's schema (which defined the byte layout) and the reader's schema (which defines what the application expects). The Avro library resolves the difference by matching fields by name and filling defaults for missing fields.
Safe changes (backward and forward compatible):
Backward compatible only:
Forward compatible only:
Breaking changes:
Avro null handling: Avro does not allow null as a value for a field unless the field's type is a union that includes null (e.g., union { null, long } favoriteNumber = null). This is more explicit than Protocol Buffers' optional fields and prevents bugs by forcing you to declare nullability in the schema.
Avro schema distribution checklist:
JSON and XML have no built-in compatibility mechanism. Compatibility is achieved by convention and application discipline.
Safe by convention:
Common failure modes:
Action: Apply dataflow-specific guidance for the recommended format. The same format behaves differently in each mode, and there are additional rules and failure modes specific to each.
WHY: The encoded format is not used in isolation — it is used within a dataflow mode that imposes additional constraints. Ignoring these constraints produces systems that are correct in isolation but fail in production: a consumer that reprocesses Kafka messages without schema version handling will fail on old messages; a service that decodes a JSON request into a model object and re-encodes it for storage will silently drop unknown fields added by a newer client; a database that stores model objects will lose unknown fields when old code reads and re-writes a record that contains fields it doesn't understand.
Mode A — Databases:
Mode B — Synchronous service calls (REST, RPC):
/v1/, /v2/) for breaking changes; maintain deprecated versions with explicit sunset dates.Mode C — Asynchronous message passing (message brokers):
Action: Write a structured recommendation covering format selection, compatibility assessment of planned changes, schema evolution plan, and dataflow-specific guidance. See the full output template in the three examples below.
WHY: A recommendation without explicit rationale cannot be reviewed or revised when requirements change. The schema evolution plan is especially important: it must specify not just which format to use, but the exact rules for each type of change the team will make over the system's lifetime, so that engineers making future changes have explicit guidance rather than relying on informal knowledge.
Required sections:
Related decisions: Data model shape → data-model-selector. Stream processing pipeline → stream-processing-designer.
These are the most common failure modes when selecting encoding formats and planning schema evolution. Review each before finalizing a recommendation.
Adding a required field after initial deployment (Thrift/Protobuf).
Every new field added after first deployment must be optional (or have a default). A required field will fail at parse time when reading old records that never wrote it. This is a silent misconfiguration: the schema compiles and tests pass with new test data, but fails at runtime on real old records. Rule: after initial deployment, required fields are permanently forbidden.
Reusing a field tag number (Thrift/Protobuf).
A retired field's tag number must be permanently marked reserved. Reusing a tag for a new field causes old data with the original field's bytes to be misinterpreted as the new field's type — silent data corruption or a parse error. Use reserved 3; reserved "old_field_name"; in every .proto file when removing a field.
Avro field without a default value breaks compatibility in one direction. Adding a field without a default breaks backward compatibility (old writers didn't include it; readers have no fallback). Removing a field without a default breaks forward compatibility (new writers omit it; old readers have no fallback). Rule: every Avro field that may be added or removed across versions must declare a default value.
Unknown field loss in read-modify-write cycles (all formats). Reading a record into a typed model object, modifying one field, and writing back silently drops any fields the model type doesn't know about. Affects databases (old code reads and rewrites new records, drops new fields) and message brokers (consumer republishes a modified message, drops new producer fields). Protobuf parsers preserve unknown fields in a side-channel; Avro resolution ignores writer-only fields safely. In JSON, the struct must include an explicit "unknown fields" passthrough map.
Number precision loss with JSON at scale.
Integers greater than 2^53 cannot be represented exactly in IEEE 754 double-precision float (the JavaScript Number type). Twitter returns tweet IDs as both a JSON number and a decimal string because JavaScript clients parse the numeric form incorrectly. Mitigation: string-encode large integers in JSON APIs, or use a format with explicit 64-bit integer types.
Adopting binary format without schema version management (Avro). Avro requires a mechanism for the reader to obtain the writer's schema — file header, schema registry, or connection negotiation. Without this, Avro is unusable. Retrofitting schema version IDs into records after gigabytes of data have been written is expensive. Choose a schema distribution mechanism before writing the first record.
Switching to binary format to solve a performance problem that isn't encoding. For payloads under 1KB at under 10K requests/second, the encoding/decoding difference between JSON and Protobuf is negligible compared to network latency and business logic. Profile first. The operational cost of binary formats (schema management, debugging complexity) is only worth paying when encoding is confirmed as the bottleneck.
The compatibility direction that matters depends on the dataflow mode. In databases, both backward and forward compatibility are required simultaneously (data outlives code). In service calls, you can assume servers upgrade before clients (backward on requests, forward on responses). In async messaging, full bidirectional compatibility is required (decoupled producers and consumers at independent schema versions).
Field tags are a permanent commitment (Thrift/Protobuf). A field's tag number is its identity in the encoded data for the lifetime of the schema. It cannot be changed, cannot be reused after removal, and a required field cannot be removed. Treat tag assignments as permanent as column IDs in a relational database — they outlive any individual deployment.
Avro's writer/reader schema resolution requires infrastructure. Avro achieves the most compact encoding (32 bytes for the example record vs. 59 for Thrift CompactProtocol, 33 for Protobuf, 66 for MessagePack, 81 for JSON) by omitting all field identification from the encoded bytes. The cost is that the reader must have access to the writer's schema. This is not optional — it is a hard requirement that must be designed for before adopting Avro.
Data outlives code. A database record written today may be read five years from now by code that uses a schema three versions newer. A Kafka message written by a producer that has since been decommissioned may be replayed by a new consumer. The encoding format you choose today must support reading that data with future schema versions — not just today's.
Schemas are documentation. A schema registry of past schema versions is a historical record of every data structure the system has ever used. It serves as documentation that is guaranteed to be accurate (because decoding fails if it is wrong, unlike manually maintained documentation). Build schema versioning infrastructure even if you don't use it for compatibility checking immediately — the documentation value alone is worth it.
Scenario: A platform team is building a new internal recommendation service written in Go, consumed by a Java API gateway and a Python data pipeline. The service will undergo rolling upgrades — no fleet-wide restarts. The team expects to add fields to the recommendation response over time as the ML model evolves (adding score components, confidence intervals, explanation fields). Payload volume is high: ~50K recommendations/second.
Trigger: "Should we use JSON REST or gRPC Protobuf for our internal recommendation service? We need rolling upgrades."
Process:
confidence_interval field: safe — new tag, optional, default 0.0explanation field (list of strings): safe — new tag, repeated fieldlegacy_score field: safe if retired tag is marked reserved; never reuse tag 3Output (abbreviated):
## Encoding Format Decision
**System:** Recommendation Service → API Gateway, Data Pipeline
**Dataflow mode:** B — Synchronous service calls (internal)
### Recommended Format: Protocol Buffers (gRPC)
**Primary rationale:** Internal service with Go and Java consumers undergoing
rolling upgrades. Protobuf's field tags provide explicit backward/forward
compatibility; code generation gives typed structs in both languages; 33-byte
encoding vs 81-byte JSON reduces bandwidth at 50K req/sec by ~60%.
### Schema Evolution Plan
| Change | Safe? | Direction | Procedure |
|--------|-------|-----------|-----------|
| Add confidence_interval (float) | Yes | Both | New tag (e.g., 4), optional, default 0.0 |
| Add explanation (repeated string) | Yes | Both | New tag (e.g., 5), no default needed |
| Remove legacy_score | Yes | Both | Mark tag 3 as reserved; never reuse |
| Rename legacy_score to base_score | Yes | Both | Rename in .proto only; tag unchanged |
| Change score from float to double | Yes (with truncation risk) | Backward | Old readers truncate if value > float range; validate range |
### Ruled Out
**JSON/REST:** No schema-enforced compatibility; number precision issues for
score floats; human-readability benefit outweighed by 50K req/sec bandwidth
cost at this scale.
**Avro:** Dynamic schema generation not needed; schema registry adds
operational overhead not justified by dynamically generated schemas here.
### Implementation Checklist
- [ ] Define .proto schema file; assign initial field tags (never reuse)
- [ ] Add `reserved 3; reserved "legacy_score";` when legacy_score is removed
- [ ] Set up buf lint and buf breaking in CI to catch incompatible changes before merge
- [ ] Python data pipeline: use protobuf Python library (no code gen needed for dynamic language)
- [ ] Watch: if the schema starts being generated from the ML model's feature definition
rather than hand-authored, re-evaluate Avro (dynamic generation requires tag management)
Scenario: An e-commerce platform publishes OrderPlaced events to a Kafka topic. Three consumer services (inventory, fulfillment, analytics) subscribe to the topic. Services are deployed independently — the inventory service may be running version 1 of the schema when the order service publishes version 2. Messages are retained for 7 days. The analytics team wants to schema-dump the Kafka topic to Parquet files in a data lake.
Trigger: "We're publishing order events to Kafka. How do we handle schema changes when consumers deploy at different times?"
Process:
Output (abbreviated):
## Encoding Format Decision
**System:** Order events — Kafka topic, 3 consumers, data lake export
**Dataflow mode:** C — Async message passing
### Recommended Format: Apache Avro + Confluent Schema Registry
**Primary rationale:** Kafka with independent producers and consumers at
different schema versions requires simultaneous backward and forward compatibility.
Avro's name-based resolution handles field additions and removals with defaults;
schema registry provides schema distribution without embedding full schema in
every message; Avro object container files support archival export to data lake.
### Schema Evolution Plan
| Change | Safe? | Direction | Procedure |
|--------|-------|-----------|-----------|
| Add shipping_address field | Yes (if default) | Both | Add with default null; register new schema version first |
| Add discount_codes (array) | Yes (if default) | Both | Add with default [] (empty array) |
| Remove coupon_code (deprecated) | Yes (if had default) | Both | Confirm default exists; remove; add alias for old readers |
| Rename order_id to orderId | Backward only | Backward | Add "orderId" as alias in reader schema; forward breaks |
| Change amount from int to long | Safe | Both | Avro can convert; document range implications |
### Dataflow-Specific Rules
- Register new schema version in registry BEFORE deploying the producer that uses it
- Consumers must implement "preserve unknown fields" pattern when republishing to downstream topics
- Data lake export: use Avro object container files (schema embedded once per file)
- Schema registry: use BACKWARD_TRANSITIVE compatibility mode (new schema must be compatible with ALL previous versions, not just the immediately preceding one)
### Ruled Out
**Protobuf:** Tag management adds friction when analytics team generates schemas
from Parquet column definitions — Avro's name-based approach maps column names
directly to field names.
**JSON:** No schema versioning mechanism; unknown field behavior is parser-dependent
(some parsers drop unknowns, some error); 7-day retention means old records will
definitely be processed by new consumers.
### Implementation Checklist
- [ ] Set up Confluent Schema Registry; configure BACKWARD_TRANSITIVE compatibility mode
- [ ] Write schema as Avro IDL; define default values for all fields
- [ ] Producer: register schema before first publish; include schema ID in message header
- [ ] Consumer: fetch schema by ID; implement unknown field preservation before republish
- [ ] Data lake export: use Avro object container files; Parquet conversion tool reads embedded schema
- [ ] Watch: if a consumer must republish without a schema registry, embed writer's schema version in message metadata instead
Scenario: A SaaS company exposes a REST API for third-party integrations. Clients are external developers who cannot be forced to upgrade. The team needs to add subscription_tier to Workspace and deprecate plan_name.
Trigger: "How do we evolve our REST API schema without breaking external clients?"
Process:
subscription_tier — safe (additive). Deprecate plan_name (keep it populated, mark deprecated in docs) — safe. Remove plan_name — breaking; requires /v2/ with a minimum 12-month sunset period for /v1/. Change user_id int → string — breaking; API version bump required.Output (key sections):
Recommended Format: JSON (REST)
Dataflow mode: B — cross-organizational boundary
Schema Evolution Plan:
- Add subscription_tier: Safe — additive; lenient clients ignore unknown fields
- Deprecate plan_name (keep populated): Safe — document as deprecated in OpenAPI spec
- Remove plan_name: Breaking — /v2/ required; 12-month /v1/ sunset window
- Change user_id int → string: Breaking — /v2/ required; document migration guide
Dataflow rules:
- Never remove response fields without a versioned sunset period
- Adding optional request params is safe; adding required params is breaking
- Watch: if IDs exceed 2^53, return as both JSON number and decimal string (Twitter pattern)
Ruled out — Protobuf/Avro: external clients cannot be required to install code-gen
toolchains; binary format is not curl-testable.
| File | Contents |
|---|---|
references/format-comparison-table.md | Full scoring matrix for all five encoding families; byte counts for the same example record in JSON (81 bytes), MessagePack (66 bytes), Thrift BinaryProtocol (59 bytes), Thrift CompactProtocol (34 bytes), Protocol Buffers (33 bytes), Avro (32 bytes); compatibility matrix comparing each format's handling of add/remove/rename/type-change operations |
references/schema-evolution-rules.md | Complete per-format compatibility rule reference: all Protobuf/Thrift field tag rules, all Avro writer/reader schema resolution rules, JSON convention guidelines, with explicit permitted/prohibited change classification for each change type |
This skill is licensed under CC-BY-SA-4.0. Source: BookForge — Designing Data-Intensive Applications by Martin Kleppmann.
Install related skills from ClawhHub:
clawhub install bookforge-data-model-selectorOr install the full book set from GitHub: bookforge-skills