Updates to your schema aren’t rolled out instantly. For server-side applications, you do a staged rollout (deploy new version to few nodes at a time) to avoid service downtime. For client-side applications, the user decides when to install the update.

That means old and new versions of the code will coexist at the same time and therefore we need to maintain compatibility in both directions:

  • backward compatibility (newer code can read data written by older code)
  • forward compatibility (older code can read data written by newer code)

Programs work usually work with data in two different representations:

  • memory (data structures optimized for manipulation by CPU using pointers)
  • when u want to write data to file or send over network (bytes sequence) Translating between the two is encoding/serialization/marshalling and decoding/parsing/deserialization/unmarshalling.

You shouldn’t use a programming language’s encoding libraries (java.io.Serializible, Ruby’s Marshal, Python’s pickle, etc.) as the encoding is tied to the language, have security risks, don’t have versioning data, and are not efficient.

JSON, XML, and Binary Variants

JSON, XML and CSV are textual formats, with subtle problems:

  • ambiguity around encoding of numbers (xml and csv dont distinguish numbers from strings, json does but doesnt distinguish integers from floats)
  • they dont support binary data unless with base64 but that increases data size by 33%
  • schemas are optional. so encoding/decoding logic is usually hardcoded

JSON and XML can be encoded in binary format, but savings in space and parsing speed are negligible. We can do better with Apache Thrift and Protocol Buffers. These let you define schemas that then are used to generate classes in the programming language of your choice, that can then be used to encode/decode records of the schema.