[HN Gopher] vu128: Efficient variable-length integers
       ___________________________________________________________________
        
       vu128: Efficient variable-length integers
        
       Author : jmillikin
       Score  : 77 points
       Date   : 2024-05-21 10:48 UTC (2 days ago)
        
 (HTM) web link (john-millikin.com)
 (TXT) w3m dump (john-millikin.com)
        
       | lifthrasiir wrote:
       | As with many other variable-length integer encodings, its
       | efficiency really depends on the expected input distribution. One
       | glaring issue in this regard is that its 2-byte encoding is
       | rather wasted: only 6.25% out of all possible encodings are
       | useful. This can be possibly fixed by making the 2-byte encoding
       | represent 240..495 instead of 0..255. You can even make use of
       | some bits from the first byte, so that for example the sequence
       | goes like: 0=00 1=01 2=02 .. BF=BF C0=C0C0 C1=C0C1 .. 30FF=F0FF
       | 3100=F13100 .. (all in hex).
       | 
       | A radically different design, used by for example 7z, is also
       | possible if you put all continuation bits into the first byte.
       | For example, instead of having `1xxxxxxx 1yyyyyyy 0zzzzzzz` for
       | 3-byte encoding, use `110xxxxx yyyyyyyy zzzzzzzz`. This is
       | essentially as compact as more common VLQ and LEB128 encodings,
       | but you can determine the length of the whole sequence from the
       | first byte alone! Unfortunately this is less suitable for
       | integers larger than 64 bits, because then you run out of bits in
       | the first byte. But it ends up with a rather elegant encoding for
       | 64-bit integers, where `11111110` indicates 7 bytes and
       | `11111111` indicates 8 bytes with no additional 0 bit needed.
       | 
       | > An afternoon of web searching has failed to uncover a prior
       | name for this particular encoding, so I'm going to name it vu128
       | and describe it below. If any reader knows of prior art, please
       | send it my way.
       | 
       | I don't know if it has other names, but a similar encoding can be
       | seen from X.690 [1] and the "additional information" bit field of
       | CBOR [2].
       | 
       | [1] https://en.wikipedia.org/wiki/X.690#Length_octets
       | 
       | [2] https://www.rfc-editor.org/rfc/rfc8949.html#section-3
        
         | klabb3 wrote:
         | > Unfortunately this is less suitable for integers larger than
         | 64 bits, because then you run out of bits in the first byte.
         | 
         | Just spitballing here but is it necessary to support every
         | increment of octets? Do we really need a 7 octet/56bit width
         | for instance? Instead couldn't we just allow widths of
         | log2(n_octets)? An int64 would need 2 bits of length info (8,
         | 16, 32, 64). Or maybe I'm thinking wrong cause not all bits are
         | can be used.
         | 
         | > its efficiency really depends on the expected input
         | distribution
         | 
         | Right. But everything else in computers is rounded to the
         | nearest power of 2, so maybe it works here too haha.
        
         | vlovich123 wrote:
         | If you don't need the full 64-bit space and can go 61-bit, you
         | can actually encode variableness in 3-bits and at most have an
         | 8 byte value vs the 9byte value here (ie 0 overhead). Three
         | bits to indicate whether the length is 1 to 8 bytes and the
         | remainder for the values. This is still 5 bits more efficient
         | if you used the full 64-bits but unless you can make use of
         | those extra 5 bits somehow (eg tightly packing another value
         | that's also partially sliced), it's not buying you anything.
        
           | repelsteeltje wrote:
           | Clever. But..
           | 
           | If you're (worst case) is 8 bytes for 61 bits, that's no 0
           | overhead, but 3 bits overhead.
           | 
           | Another detail is that vu128 is double 64 bits, ie: 17 bytes
           | worst case or 8 bits overhead. Would presumably be 4 bits in
           | your example to represent 124 bit payload.
        
             | vlovich123 wrote:
             | 0 overhead in that it takes 8 bytes in memory and up to 8
             | bytes serialized. You can't slice a byte partially unless
             | you're doing packing so I think it's fair to characterize
             | it as 0 overhead, but sure it's 3 bits of overhead vs 8. It
             | came up because I had a counter and just arbitrarily cut
             | off the range cause I couldn't possibly ever have it count
             | that high. But yeah, the same idea could be extended to 128
             | bits.
        
               | repelsteeltje wrote:
               | Fair enough. In case this was a counter, best make sure
               | you get the overflow logic right. The normal 64 bit
               | wrapping won't kick in if you're just able to (de)
               | seriaize only 61 bits.
               | 
               | Or (depending on the context of this counter) you may
               | never want to overflow at all - in which case you're best
               | of with a variable length representation that is open
               | ended (like r or python arbitrary precision)
        
               | vlovich123 wrote:
               | It would take your absolute top of the line CPU doing
               | nothing but incrementing the counter for more than 12
               | years to worry about overflowing a 61 bit number. And
               | what this is counting is much more expensive than
               | incrementing a counter so you can't run this at 6Ghz.
               | Maybe ~500mhz which gets you to > 100 years if that's all
               | you're doing but this counter is a side operation that
               | won't see much traffic even at scale so you're talking
               | about millennia in practice I suspect.
               | 
               | So it's just not a practical concern - I just panic if it
               | hits the limit. I think you may be failing to comprehend
               | just how large numbers like this get and how long it
               | takes to count because it's such a very real concern with
               | 32 bit counters. Addition of course can always overflow
               | but this is a straight up strictly monotonic counter that
               | counts from 0 to 2^61.
        
               | repelsteeltje wrote:
               | :-) fair enough.
               | 
               | And panicing is exactly the kind of mitigation I had in
               | mind initially; just don't silently do the unexpected.
               | 
               | > I think you may be failing to comprehend just how large
               | numbers like this get
               | 
               | On the other hand, us programmers are also notorious for
               | failing to comprehend just how long our software will
               | live on...
        
               | vlovich123 wrote:
               | Honestly if any code I've written survives 100 years to
               | cause problems I'll be impressed. 1000 years I'll be
               | ecstatic.
        
         | plesner wrote:
         | I implemented a slight variation of this a while back which
         | worked as described up to 11110xxx but then a prefix byte of
         | 11111xxx meant a payload of 64*2^xxx bits. So 11111000 is
         | followed by a 64-bit value, 11111001 by a 128-bit, up to
         | 11111111 which is 8192 bits.
        
         | Dylan16807 wrote:
         | > its efficiency really depends on the expected input
         | distribution. One glaring issue in this regard is that its
         | 2-byte encoding is rather wasted
         | 
         | Yeah, and it's worth noting that this encoding is very close to
         | just length then value. It only saves space for numbers up to
         | about +-100. In a lot of situations it wouldn't actually do
         | better than tag-length-value.
        
         | lostmsu wrote:
         | Also, real-world numbers tend to swarm around powers of 10 and,
         | in computers, powers of 2. It might make sense to encode those
         | in a special way. E.g. <pow2-prefix><var-len-power><0/1 - to
         | subtract one> and <pow10-prefix><var-len-power><0/1 - to
         | subtract one>. E.g. assuming 10 prefix reserved for pow2 and 11
         | for pow10, 255 would be encoded as 10_1000_1, 65535 would be
         | 10_10000_1, 65536 would be 10_10000_0, and 10000 would be
         | 11_11_0.
        
       | chubot wrote:
       | FWIW the lobste.rs thread surfaced this 2016 comment thread about
       | PrefixVarint vs. LEB128 varint in protobufs, which has lots of
       | good info, and makes the UTF-8 analogy:
       | 
       | https://news.ycombinator.com/item?id=11263378
       | 
       | Also sqlite varint:
       | 
       | https://sqlite.org/src4/doc/trunk/www/varint.wiki
        
         | akira2501 wrote:
         | > Also sqlite varint:
         | 
         | That's not actually in use. The SQLite3 format is far more
         | basic:
         | 
         | "A variable-length integer or "varint" is a static Huffman
         | encoding of 64-bit twos-complement integers that uses less
         | space for small positive values. A varint is between 1 and 9
         | bytes in length. The varint consists of either zero or more
         | bytes which have the high-order bit set followed by a single
         | byte with the high-order bit clear, or nine bytes, whichever is
         | shorter. The lower seven bits of each of the first eight bytes
         | and all 8 bits of the ninth byte are used to reconstruct the
         | 64-bit twos-complement integer. Varints are big-endian: bits
         | taken from the earlier byte of the varint are more significant
         | than bits taken from the later bytes."
         | 
         | From: https://www.sqlite.org/fileformat2.html Section 1.6
        
       | vitiral wrote:
       | zig-zag encoding is brilliant. Basically you can determine
       | positive or negative based on odd/even and divide by 2 to get the
       | value. I tried to come up with a similar encoding and never
       | thought of this
        
         | repelsteeltje wrote:
         | Yeah basically it's ones complement (or twos complement) but
         | with the sign bit on the right, rather than the left.
        
         | edflsafoiewq wrote:
         | It's the same as enumerating 2D points in a spiral, but
         | restricted to 1D.
        
       | Const-me wrote:
       | When I need them I usually implement the codec used in MKV
       | format: https://www.rfc-editor.org/rfc/rfc8794.html
       | 
       | Similar performance characteristics, i.e. just the first byte
       | tells the encoded length, no need to test every byte. Luckily,
       | most programming languages have intrinsic functions for BSWAP
       | instruction.
        
         | repelsteeltje wrote:
         | Mkv (and av1, vp8, vp9) use the leb128 format which the article
         | mentions is inefficient parse for off the shelf CPUs.
         | 
         | Leb128 is relatively efficient in bitstream and sidesteps
         | potential mpeg-la patent issues. (H264, maybe 14496 part 14)
        
           | Const-me wrote:
           | I don't think they do. These formats are based on EBML, see
           | the RFC link in my previous comment. That VINT encoding is
           | not LEB128, it's better.
           | 
           | BTW I once implemented my own parser in C#, see that source
           | file: https://github.com/Const-
           | me/Vrmac/blob/master/VrmacVideo/Con...
        
       | kunley wrote:
       | There is also SQLite varint,
       | https://sqlite.org/src4/doc/trunk/www/varint.wiki
       | 
       | IIRC different from both Leb128 and VLQ, and quite neat. It's for
       | 64bit (u)ints, though
        
       | terrelln wrote:
       | Variable-length integer schemes are great when you are encoding
       | one integer. But when you are encoding a list of integers, you
       | should really consider a different scheme.
       | 
       | Variable-length integer schemes generally interleave the control
       | bits & data bits. This means you don't know where the next
       | integer starts until you at least partially decode the current
       | integer.
       | 
       | When encoding a list of integers, you would rather put all the
       | control information together, and all the data together. This
       | generally allows for significantly more efficient decoding using
       | SIMD.
        
         | tatersolid wrote:
         | Don't forget column-stores. Lists of floats and ints are
         | generally "full size" but then XORed with the preceding value
         | then the result is compressed in chunks with a high-speed
         | compressor like LZ4. We see 20x or higher compression ratios on
         | many of our column-stores which contain data such as invoice
         | amounts as metrics. Decompression overhead is below zero (much
         | faster than operating on the raw uncompressed data which is
         | bound by memory bandwidth or IO)
        
       | zigzag312 wrote:
       | How does it compare to vint64[0] performance-wise? For values
       | that fit the 64-bit space.
       | 
       | [0] https://crates.io/crates/vint64
        
       | danking00 wrote:
       | If I'm following, this presents an alternative format and
       | implementation to LEB128 which encodes and decodes substantially
       | faster. Notably, the implementation is quite simple. Cool! And
       | agreed that modern CPUs really suffer from branches.
       | 
       | Should I interpret the plot to mean the average elapsed wall
       | clock time per integer decoded/encoded? And can I conclude the
       | throughput is the reciprocal? So about 100,000 integers per
       | second or around a 1 MB/s of decompressed data.
       | 
       | I know this is a bit unfair because the implementation is much
       | more complex, but my first thought is why I would use vu128
       | instead of Lemire's Stream VByte:
       | https://arxiv.org/abs/1709.08990
       | 
       | A slight tangent but I stumbled on this library which stores
       | floats XOR'ed with the previous float in the stream:
       | https://github.com/velvia/compressed-vec it seems really clever
       | to me! They reference "Gorilla: A Fast, Scalable, In-Memory Time
       | Series Database" which in turn references two 2006 papers: "Fast
       | Lossless Compression of Scientific Floating-Point Data" and "Fast
       | and Efficient Compression of Floating-Point Data". Frustratingly,
       | the FB paper doesn't benchmark their XOR-based floating point
       | encoding but the earlier two papers do.
        
         | jmillikin wrote:
         | The plot is the time to encode/decode a list of 10,000 random
         | integers in a uniform distribution. The throughput is 2-10
         | GiB/s[0] on my desktop, with portable scalar operations (no
         | SIMD). Distributions skewed toward small values are slightly
         | faster[1].
         | 
         | I believe a decoder than used SIMD would perform even faster
         | since the format is friendly to the BMI instructions found on
         | newer CPUs. You could also store the data as
         | concat(prefix_bytes) + concat(payloads), which would really get
         | things going.
         | 
         | For scientific data with lots of multi-byte f64 values, the
         | single branch for < 0xF0 might be unnecessary and undesirable.
         | There are other formats that can be decoded by completely
         | branchless SIMD, at the cost of larger sizes for very small
         | values. For example, Google has a format that uses the first
         | byte as a bitmask:
         | https://static.googleusercontent.com/media/research.google.c...
         | 
         | [0] vu128 is less affected by value size than VLQ/LEB128, so on
         | fast hardware it decodes 10,000 8-bit values in about the same
         | time as 10,000 64-bit values.
         | 
         | [1] In the discussion on lobste.rs[2] a user wondered what the
         | results would look like with a distribution containing lots of
         | small values. A zipf distribution's results looks like
         | <https://i.imgur.com/WGFtz6f.png>.
         | 
         | [2]
         | https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...
        
       ___________________________________________________________________
       (page generated 2024-05-23 23:01 UTC)