Another variable-length integer encoding (2021)

sethev · on Aug 11, 2024

> Note that even though the goal, and end result, is that the file is smaller, this is not a compression scheme, since it only works when your values are distributed in one particular way

This statement doesn't sound right. I think he's trying to say it's not a general purpose compression scheme, but every compression scheme only works on certain patterns. In fact, most possible streams of bytes are not compressible at all. Even the general purpose approaches only work on very specific types of inputs (such as natural language) which contain a lot of redundancy.

Retr0id · on Aug 11, 2024

Not only that, but I believe these "imperial varints" could be described as special-case of huffman coding.

marginalia_nu · on Aug 13, 2024

This is true for all varints. The choice of encoding just changes the shape of the decision tree.

atiedebee · on Aug 12, 2024

It's as much of a compression scheme as all the other variable width integer methods are.

maxmuen · on Aug 11, 2024

IMHO wasting 12.5% of all bits just to store the length appears wasteful to me. The encoding in QUIC[1] that encodes the length in the leading to 2bits. Very fast to read and write, just one caveat at most 62bits allowed.

[1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...

Dylan16807 · on Aug 11, 2024

The method in the article adds 14% and then rounds up to the nearest byte. On values up to 63/64 bits (depending on implementation) it wastes 0-1 bytes, and it always wastes 1 byte when you give it a multiple of 8 bits.

The QUIC method adds 2 bits and then rounds up to the nearest power of two bytes. On values up to 62 bits it wastes 0-4 bytes, and it wastes 1-4 bytes when you give it a multiple of 8 bits.

The rounding has a much bigger impact than the bits directly used to encode length.

Retr0id · on Aug 11, 2024

It depends on your probability distribution of values.

Storing the number 100 (decimal) under the QUIC scheme would require 2 bytes, but only 1 byte under either "metric" or "imperial" varints.

(same for any value between 64 and 127, I believe (I might be off-by-one there...))

Retr0id · on Aug 12, 2024

Weird, this comment got downvoted, twice.

Which led me to believe I might be wrong, but I just double-checked and I'm definitely right. Anyone care to explain?

lifthrasiir · on Aug 12, 2024

No, virtually all integer VLE schemes do have a set of optimal distributions where they are particularly good at, and this one is no exception (in fact, its set of optimal distributions is identical to many others). The only thing that can be considered strictly "wasted" here would be the existence of multiple overlong sequences.

Veserv · on Aug 11, 2024

Unary-encoded length prefix and big-endian serialization.

There is no good reason to use big-endian serialization. I think they are doing that because they only know about "count leading zeros (CLZ)" to extract the unary length prefix. "Count trailing zeros (CTZ)" is also a commonly supported operation (and can be efficiently emulated with CLZ if it does not exist) which makes it easy to extract the unary length prefix even if it is little-endian.

pkhuong · on Aug 11, 2024

I can see a valid reason, specifically for variable-length encodings: big-endian makes it easy to compare compound keys lexicographically without decoding (little-endian can still do it, but everything has to be flipped to read from high to low addresses). Unfortunately, encoding continuation bits as 0 instead of 1 prevents that from working.

Veserv · on Aug 11, 2024

That is just type-abuse. A compound key is not a single N-bit big-endian value, you are just encoding it as such in memory because they kind of sort of fit. Such automatic and implicit type punning is a very dangerous line to tread in wire formats as it makes them very fragile. You are much better off being extremely precise about the encoding steps and bit patterns and then maybe optimizing the entire wire format into simpler data types if it makes sense as a whole.

A better idea would be defining the individual key formats. You then either serialize them into lexicographically ordered varint keys and then overlay the composite onto a byte stream, or you serialize them lexicographically into a composite "integer" which you then serialize as a "varint (really a variable-sized byte stream serialized like a varint)". These two ideas makes the lexicographical ordering intent explicit and somewhat independent of the underlying byte format while still enabling efficient packed encoding.

Const-me · on Aug 12, 2024

> There is no good reason to use big-endian serialization

For this particular use case of variable length integers, I think there are very good reasons.

Easier to debug because the encoding doesn’t change bytes. You encode 0x01FF45 and you will get 21 FF 45 bytes which is very human readable. Little endian will get you 2C FA 0F bytes which is completely unrecognizable in debugger, or in files on disk.

Simpler to implement. If you encode a full uint64_t you gonna get 9 encoded bytes yet computers can’t easily handle 9 bytes integers, they only go up to 8.

The implementation is more efficient because you don’t need to bit shift input numbers. Many modern processors have BZHI instruction to zero out higher bits in an integer which is faster than BEXTR instruction to extract bits.

rocqua · on Aug 11, 2024

Is there a good reason not to use big endian?

Veserv · on Aug 11, 2024

For starters, all modern machines are little-endian, so you need to flip the bytes. This is made more confusing when you layer another encoding on top since now you need to define how they are layered and when the encodings apply (i.e. Do you encode the value to big-endian then attach the prefix or do you attach the prefix then encode to big-endian? In this case their explanation indicates implicitly that their mental model is the former. Luckily, I think it does not actually matter in this case.)

Big-endian also suffers from problems around streaming unknown-length values as the decoding of a big-endian byte pattern is inherently length-dependent, so even if all machines were big-endian you should still prefer little-endian encodings for wire formats.

If you try to pre-load the bytes in a little-endian encoding then the bytes at higher memory addresses, which you want to process later, can be trivially masked off. If you try to pre-load the bytes in a big-endian encoding then the bytes at higher memory addresses must be masked off and then the bytes you care about need to be shifted so they get interpreted correctly.

nayuki · on Aug 12, 2024

When I wrote a bigint library, it was much easier to store the array of digits in little endian for arithmetic operations like addition and multiplication.

MaskRay · on Aug 12, 2024

The scheme proposed in this blog post is also called PrefixVarInt.

Signed integers can be represented with either zigzag encoding or sign extension. For the most common one-byte encoding, zigzag encoding is a worse scheme. https://maskray.me/blog/2024-03-09-a-compact-relocation-form...

My blog post is about a relocation format. I investigated a few schemes and concluded that LEB128 is the best for my use case. There are multiple reasons including super simple implementation:

    static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
      uint64_t acc = 0, shift = 0, byte;
      do {
        byte = *(*buf)++;
        acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
        shift += 7;
      } while (byte >= 128);
      return acc;
    }
    
    uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
    int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }

mananaysiempre · on Aug 12, 2024

  static int64_t read_sprefix(unsigned char **buf) {
    uint64_t x = *(uint64_t *)*buf;
    unsigned n = stdc_trailing_zeros(x) + 1;
    assert(n <= 8); /* handles values up to 2**56 - 1 */
    *buf += n;
    return (int64_t)(x << 64 - 8*n) >> 64 - 7*n;
  }

Seems pretty OK too and doesn’t force a branch per byte. (Imagine a memcpy instead of the first line if the strict-aliasing UB annoys you.) I guess it does do somewhat more work if the majority of the inputs fits in a byte.

mananaysiempre · on Aug 11, 2024

So, UTF-8 without the self-synchronization part. It’s a reasonable encoding, sure. I’m not sure what the cost of making the whole thing big endian is, though (guess I need to look at simdutf8!); at first blush it feels like making this wholly little-endian-adapted, with the continuation bits = byte count in unary located in the low bits, would result in very cheap decoding on little-endian machines.

Retr0id · on Aug 11, 2024

I like this a lot.

A third approach not mentioned in the article is to use some form of length prefix (which is itself a fixed number of bits long). This is usually cheaper to decode, but a fixed-length prefix disproportionally "wastes" space in short numbers.

The approach in the article gives the best of both worlds, all while being very cheap to decode. I'm guessing it might also be fairly SIMD-friendly.

loeg · on Aug 11, 2024

I wonder how well branch predictors handle CLZ vs just masking and comparing bytes. I expect the latter to work really well, but who knows how optimized CLZ-derived branches are.

theamk · on Aug 11, 2024

that seems to be "vu128: Efficient variable-length integers" method, as mentioned at the very end of the post?

Retr0id · on Aug 11, 2024

It's also a bit of a hybrid, just a different trade-off. It's not quite a fixed-length prefix because they have a special-case for single byte values.

fritzo · on Aug 11, 2024

It seems like a more SIMD-friendly version would separately store a byte mask and the packed byte data. Then you can simultaneously expand 16 int32s using e.g. AVX512's _mm512_maskz_expand_epi8(mask,_mm512_loadu_epi8(data)).

mananaysiempre · on Aug 11, 2024

Stream VByte[1,2,3] is not far away from that idea, storing the packed lengths in a separate stream from the data bytes. (There’s also a more conventional interleaved design in varint-G8IU, but Stepanov decided to fuck everyone over and patented that.)

[1] https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-...

[2] https://arxiv.org/abs/1709.08990

[3] https://github.com/lemire/streamvbyte

Const-me · on Aug 12, 2024

The encoding proposed by that blog post matches the encoding used in MKV container, see section #4 “Variable-Size Integer” in that specification: https://www.rfc-editor.org/rfc/rfc8794.html

All modern CPUs have instruction to swap order of bytes in the integers to big endian. And many programming languages have an intrinsic function to emit these instructions. Quite useful for both encoder and decoder of these integers.

HackerThemAll · on Aug 13, 2024

It's just irony of the tech industry that all modern architectures are little-endian.

rocqua · on Aug 11, 2024

Cool, certainly makes sense to let the parser know ahead of time how many bytes to read. Not just for cleaner code, but for faster code aswell.

ivanjermakov · on Aug 12, 2024

How this compares in compression ratio to VLQ and VU128?

ralferoo · on Aug 11, 2024

> And, for reasons we’ll see in a bit, we flip the meaning of the continuation bits: we’ll use 0 to mean “another byte follows”, and 1 to mean “this is the last byte”.

EDIT: just noticed that I'd skim-read over: "imperial varint places all of the continuation bits at the beginning of the encoded value", which makes my comment completely wrong.

~Doesn't even mention possibly one of the most useful features of encoding this way, which is that you can never [1] end up with a 0 byte in the output, meaning that you can treat the output as a C string.~

~[1] or at least never assuming you're trying to create the shortest encoding~

layer8 · on Aug 11, 2024

This is not true for the described new encoding, as the continuation bits are collected together at the beginning.

ralferoo · on Aug 11, 2024

Yeah, I'd re-read that section of the article and edited my post already to reflect this before I saw your reply.

My fault for reading the first proposed solution, thinking it was stupid for having the continuation bits the other way round, and skipping ahead to where I saw they'd inverted them and stopped reading at that point.

jeffbee · on Aug 11, 2024

Unary length prefix is a solid technique. I would be careful with the performance claims, though. Sometimes what the machine can do is surprising and the performance of a system that looks like it needs to loop and branch is faster than you expected. The protobuf project, unsurprisingly for a project of its age, has several different varint parsing strategies. One is described at https://github.com/protocolbuffers/protobuf/blob/main/src/go... and another at https://github.com/protocolbuffers/protobuf/blob/main/src/go...