Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unary-encoded length prefix and big-endian serialization.

There is no good reason to use big-endian serialization. I think they are doing that because they only know about "count leading zeros (CLZ)" to extract the unary length prefix. "Count trailing zeros (CTZ)" is also a commonly supported operation (and can be efficiently emulated with CLZ if it does not exist) which makes it easy to extract the unary length prefix even if it is little-endian.



I can see a valid reason, specifically for variable-length encodings: big-endian makes it easy to compare compound keys lexicographically without decoding (little-endian can still do it, but everything has to be flipped to read from high to low addresses). Unfortunately, encoding continuation bits as 0 instead of 1 prevents that from working.


That is just type-abuse. A compound key is not a single N-bit big-endian value, you are just encoding it as such in memory because they kind of sort of fit. Such automatic and implicit type punning is a very dangerous line to tread in wire formats as it makes them very fragile. You are much better off being extremely precise about the encoding steps and bit patterns and then maybe optimizing the entire wire format into simpler data types if it makes sense as a whole.

A better idea would be defining the individual key formats. You then either serialize them into lexicographically ordered varint keys and then overlay the composite onto a byte stream, or you serialize them lexicographically into a composite "integer" which you then serialize as a "varint (really a variable-sized byte stream serialized like a varint)". These two ideas makes the lexicographical ordering intent explicit and somewhat independent of the underlying byte format while still enabling efficient packed encoding.


> There is no good reason to use big-endian serialization

For this particular use case of variable length integers, I think there are very good reasons.

Easier to debug because the encoding doesn’t change bytes. You encode 0x01FF45 and you will get 21 FF 45 bytes which is very human readable. Little endian will get you 2C FA 0F bytes which is completely unrecognizable in debugger, or in files on disk.

Simpler to implement. If you encode a full uint64_t you gonna get 9 encoded bytes yet computers can’t easily handle 9 bytes integers, they only go up to 8.

The implementation is more efficient because you don’t need to bit shift input numbers. Many modern processors have BZHI instruction to zero out higher bits in an integer which is faster than BEXTR instruction to extract bits.


Is there a good reason not to use big endian?


For starters, all modern machines are little-endian, so you need to flip the bytes. This is made more confusing when you layer another encoding on top since now you need to define how they are layered and when the encodings apply (i.e. Do you encode the value to big-endian then attach the prefix or do you attach the prefix then encode to big-endian? In this case their explanation indicates implicitly that their mental model is the former. Luckily, I think it does not actually matter in this case.)

Big-endian also suffers from problems around streaming unknown-length values as the decoding of a big-endian byte pattern is inherently length-dependent, so even if all machines were big-endian you should still prefer little-endian encodings for wire formats.

If you try to pre-load the bytes in a little-endian encoding then the bytes at higher memory addresses, which you want to process later, can be trivially masked off. If you try to pre-load the bytes in a big-endian encoding then the bytes at higher memory addresses must be masked off and then the bytes you care about need to be shifted so they get interpreted correctly.


When I wrote a bigint library, it was much easier to store the array of digits in little endian for arithmetic operations like addition and multiplication.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: