Monday, 23 November 2015
Recently, while writing an internal protocol, I found myself prospecting for data formats. I'd usually opt for JSON as a great all rounder, being reasonably concise, human-readable, and widely supported, but this time I could afford to dig around for alternatives. After all, JSON makes passable config files, but I'd rather use TOML (which I also wrote a Scheme module for) given the right environment. The same can be true with data, and one of the alternative formats I uncovered, Bencode (pronounced B-Encode), turns out to be pretty neat.
Bencoding is used to format structured data in torrent files. It uses ASCII characters as delimiters and digits, making it easy to debug, and being unaffected by endianness it's easy to support cross-platform. Unlike the more space efficient Protocol Buffers, Bencode does not require a schema definition to parse, a restriction that makes incremental updates painful, it includes all the relevant keys in the message itself. Bencode can safely include binary data too, and as you'll see, is incredibly easy to serialize and parse.
In Bencode, strings are almost identical to Netstrings, a delightfully simple format for bytestrings, where the value is prefixed by it's length in bytes:
This means no character escaping is necessary (eg, NULL characters), and when parsing, the memory required is known in advance. In addition to bytestrings, Bencode provides integers, lists and dictionaries.
Integers use ASCII digits delimited by i and e:
Lists are just multiple bencoded values surrounded by l and e:
Similarly, dictionaries use d and e to wrap a sequence of keys and values:
Dictionary keys must be sorted in lexicographical order, meaning there is only one valid bencoding for a given (possibly complex) value. So bencoded values can be compared or hashed directly without the need to decode.
The simplicity of this system means it's easy to produce correct bencoded messages with minimal effort, even using printf() and friends, a very handy property in embedded environments. It also means a new parser can be added quickly when required, which is exactly what I did for CHICKEN Scheme, implementing a bencode module in very little time:
One advantage to Bencode's simple rules, apart from being easy to implement, is that it's easy to process. Parsing and serializing bencoded data is generally faster than the alternatives I've tried, not because of my great parser code, just as a result of a well considered encoding.
I made a quick test, decoding and then encoding some example data 100,000 times.
This is not chosen to reflect any common usage or benefit any particular parser, I'm just bashing on the keyboard here…
So, that's Bencode, a neat little format I hadn't noticed before. Other JSON alternatives you might consider are: MessagePack, a more complex binary format with better size efficiency though not human-readable, and Tagged Netstrings a backwards compatible extension to Netstrings developed by Zed Shaw as part of the Mongrel2 webserver project. Tagged Netstrings support more data types than Bencode but may not work well for streams since type information is located at the end of each value, not the start. Of course, most of the time I'll still be using JSON, and there's nothing wrong with that!