Updates to your schema arenât rolled out instantly. For server-side applications, you do a staged rollout (deploy new version to few nodes at a time) to avoid service downtime. For client-side applications, the user decides when to install the update.
That means old and new versions of the code will coexist at the same time and therefore we need to maintain compatibility in both directions:
- backward compatibility (newer code can read data written by older code)
- forward compatibility (older code can read data written by newer code)
Programs work usually work with data in two different representations:
- memory (data structures optimized for manipulation by CPU using pointers)
- when u want to write data to file or send over network (bytes sequence) Translating between the two is encoding/serialization/marshalling and decoding/parsing/deserialization/unmarshalling.
You shouldnât use a programming languageâs encoding libraries (java.io.Serializible, Rubyâs Marshal, Pythonâs pickle, etc.) as the encoding is tied to the language, have security risks, donât have versioning data, and are not efficient.
JSON, XML, and Binary Variants
JSON, XML and CSV are textual formats, with subtle problems:
- ambiguity around encoding of numbers (xml and csv dont distinguish numbers from strings, json does but doesnt distinguish integers from floats)
- they dont support binary data unless with base64 but that increases data size by 33%
- schemas are optional. so encoding/decoding logic is usually hardcoded
JSON and XML can be encoded in binary format, but savings in space and parsing speed are negligible. We can do better with Apache Thrift and Protocol Buffers. These let you define schemas that then are used to generate classes in the programming language of your choice, that can then be used to encode/decode records of the schema.
