> Note that even though the goal, and end result, is that the file is smaller, this is not a compression scheme, since it only works when your values are distributed in one particular way
This statement doesn't sound right. I think he's trying to say it's not a general purpose compression scheme, but every compression scheme only works on certain patterns. In fact, most possible streams of bytes are not compressible at all. Even the general purpose approaches only work on very specific types of inputs (such as natural language) which contain a lot of redundancy.
IMHO wasting 12.5% of all bits just to store the length appears wasteful to me.
The encoding in QUIC[1] that encodes the length in the leading to 2bits. Very fast to read and write, just one caveat at most 62bits allowed.
The method in the article adds 14% and then rounds up to the nearest byte. On values up to 63/64 bits (depending on implementation) it wastes 0-1 bytes, and it always wastes 1 byte when you give it a multiple of 8 bits.
The QUIC method adds 2 bits and then rounds up to the nearest power of two bytes. On values up to 62 bits it wastes 0-4 bytes, and it wastes 1-4 bytes when you give it a multiple of 8 bits.
The rounding has a much bigger impact than the bits directly used to encode length.
No, virtually all integer VLE schemes do have a set of optimal distributions where they are particularly good at, and this one is no exception (in fact, its set of optimal distributions is identical to many others). The only thing that can be considered strictly "wasted" here would be the existence of multiple overlong sequences.
Unary-encoded length prefix and big-endian serialization.
There is no good reason to use big-endian serialization. I think they are doing that because they only know about "count leading zeros (CLZ)" to extract the unary length prefix. "Count trailing zeros (CTZ)" is also a commonly supported operation (and can be efficiently emulated with CLZ if it does not exist) which makes it easy to extract the unary length prefix even if it is little-endian.
I can see a valid reason, specifically for variable-length encodings: big-endian makes it easy to compare compound keys lexicographically without decoding (little-endian can still do it, but everything has to be flipped to read from high to low addresses). Unfortunately, encoding continuation bits as 0 instead of 1 prevents that from working.
That is just type-abuse. A compound key is not a single N-bit big-endian value, you are just encoding it as such in memory because they kind of sort of fit. Such automatic and implicit type punning is a very dangerous line to tread in wire formats as it makes them very fragile. You are much better off being extremely precise about the encoding steps and bit patterns and then maybe optimizing the entire wire format into simpler data types if it makes sense as a whole.
A better idea would be defining the individual key formats. You then either serialize them into lexicographically ordered varint keys and then overlay the composite onto a byte stream, or you serialize them lexicographically into a composite "integer" which you then serialize as a "varint (really a variable-sized byte stream serialized like a varint)". These two ideas makes the lexicographical ordering intent explicit and somewhat independent of the underlying byte format while still enabling efficient packed encoding.
> There is no good reason to use big-endian serialization
For this particular use case of variable length integers, I think there are very good reasons.
Easier to debug because the encoding doesn’t change bytes. You encode 0x01FF45 and you will get 21 FF 45 bytes which is very human readable. Little endian will get you 2C FA 0F bytes which is completely unrecognizable in debugger, or in files on disk.
Simpler to implement. If you encode a full uint64_t you gonna get 9 encoded bytes yet computers can’t easily handle 9 bytes integers, they only go up to 8.
The implementation is more efficient because you don’t need to bit shift input numbers. Many modern processors have BZHI instruction to zero out higher bits in an integer which is faster than BEXTR instruction to extract bits.
For starters, all modern machines are little-endian, so you need to flip the bytes. This is made more confusing when you layer another encoding on top since now you need to define how they are layered and when the encodings apply (i.e. Do you encode the value to big-endian then attach the prefix or do you attach the prefix then encode to big-endian? In this case their explanation indicates implicitly that their mental model is the former. Luckily, I think it does not actually matter in this case.)
Big-endian also suffers from problems around streaming unknown-length values as the decoding of a big-endian byte pattern is inherently length-dependent, so even if all machines were big-endian you should still prefer little-endian encodings for wire formats.
If you try to pre-load the bytes in a little-endian encoding then the bytes at higher memory addresses, which you want to process later, can be trivially masked off. If you try to pre-load the bytes in a big-endian encoding then the bytes at higher memory addresses must be masked off and then the bytes you care about need to be shifted so they get interpreted correctly.
When I wrote a bigint library, it was much easier to store the array of digits in little endian for arithmetic operations like addition and multiplication.
My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case.
There are multiple reasons including super simple implementation:
Seems pretty OK too and doesn’t force a branch per byte. (Imagine a memcpy instead of the first line if the strict-aliasing UB annoys you.) I guess it does do somewhat more work if the majority of the inputs fits in a byte.
So, UTF-8 without the self-synchronization part. It’s a reasonable encoding, sure. I’m not sure what the cost of making the whole thing big endian is, though (guess I need to look at simdutf8!); at first blush it feels like making this wholly little-endian-adapted, with the continuation bits = byte count in unary located in the low bits, would result in very cheap decoding on little-endian machines.
A third approach not mentioned in the article is to use some form of length prefix (which is itself a fixed number of bits long). This is usually cheaper to decode, but a fixed-length prefix disproportionally "wastes" space in short numbers.
The approach in the article gives the best of both worlds, all while being very cheap to decode. I'm guessing it might also be fairly SIMD-friendly.
I wonder how well branch predictors handle CLZ vs just masking and comparing bytes. I expect the latter to work really well, but who knows how optimized CLZ-derived branches are.
It seems like a more SIMD-friendly version would separately store a byte mask and the packed byte data. Then you can simultaneously expand 16 int32s using e.g. AVX512's _mm512_maskz_expand_epi8(mask,_mm512_loadu_epi8(data)).
Stream VByte[1,2,3] is not far away from that idea, storing the packed lengths in a separate stream from the data bytes. (There’s also a more conventional interleaved design in varint-G8IU, but Stepanov decided to fuck everyone over and patented that.)
The encoding proposed by that blog post matches the encoding used in MKV container, see section #4 “Variable-Size Integer” in that specification: https://www.rfc-editor.org/rfc/rfc8794.html
All modern CPUs have instruction to swap order of bytes in the integers to big endian. And many programming languages have an intrinsic function to emit these instructions. Quite useful for both encoder and decoder of these integers.
> And, for reasons we’ll see in a bit, we flip the meaning of the continuation bits: we’ll use 0 to mean “another byte follows”, and 1 to mean “this is the last byte”.
EDIT: just noticed that I'd skim-read over: "imperial varint places all of the continuation bits at the beginning of the encoded value", which makes my comment completely wrong.
~Doesn't even mention possibly one of the most useful features of encoding this way, which is that you can never [1] end up with a 0 byte in the output, meaning that you can treat the output as a C string.~
~[1] or at least never assuming you're trying to create the shortest encoding~
Yeah, I'd re-read that section of the article and edited my post already to reflect this before I saw your reply.
My fault for reading the first proposed solution, thinking it was stupid for having the continuation bits the other way round, and skipping ahead to where I saw they'd inverted them and stopped reading at that point.
Unary length prefix is a solid technique. I would be careful with the performance claims, though. Sometimes what the machine can do is surprising and the performance of a system that looks like it needs to loop and branch is faster than you expected. The protobuf project, unsurprisingly for a project of its age, has several different varint parsing strategies. One is described at https://github.com/protocolbuffers/protobuf/blob/main/src/go... and another at https://github.com/protocolbuffers/protobuf/blob/main/src/go...
This statement doesn't sound right. I think he's trying to say it's not a general purpose compression scheme, but every compression scheme only works on certain patterns. In fact, most possible streams of bytes are not compressible at all. Even the general purpose approaches only work on very specific types of inputs (such as natural language) which contain a lot of redundancy.