[HN Gopher] vu128: Efficient variable-length integers
___________________________________________________________________
vu128: Efficient variable-length integers
Author : jmillikin
Score : 77 points
Date : 2024-05-21 10:48 UTC (2 days ago)
(HTM) web link (john-millikin.com)
(TXT) w3m dump (john-millikin.com)
| lifthrasiir wrote:
| As with many other variable-length integer encodings, its
| efficiency really depends on the expected input distribution. One
| glaring issue in this regard is that its 2-byte encoding is
| rather wasted: only 6.25% out of all possible encodings are
| useful. This can be possibly fixed by making the 2-byte encoding
| represent 240..495 instead of 0..255. You can even make use of
| some bits from the first byte, so that for example the sequence
| goes like: 0=00 1=01 2=02 .. BF=BF C0=C0C0 C1=C0C1 .. 30FF=F0FF
| 3100=F13100 .. (all in hex).
|
| A radically different design, used by for example 7z, is also
| possible if you put all continuation bits into the first byte.
| For example, instead of having `1xxxxxxx 1yyyyyyy 0zzzzzzz` for
| 3-byte encoding, use `110xxxxx yyyyyyyy zzzzzzzz`. This is
| essentially as compact as more common VLQ and LEB128 encodings,
| but you can determine the length of the whole sequence from the
| first byte alone! Unfortunately this is less suitable for
| integers larger than 64 bits, because then you run out of bits in
| the first byte. But it ends up with a rather elegant encoding for
| 64-bit integers, where `11111110` indicates 7 bytes and
| `11111111` indicates 8 bytes with no additional 0 bit needed.
|
| > An afternoon of web searching has failed to uncover a prior
| name for this particular encoding, so I'm going to name it vu128
| and describe it below. If any reader knows of prior art, please
| send it my way.
|
| I don't know if it has other names, but a similar encoding can be
| seen from X.690 [1] and the "additional information" bit field of
| CBOR [2].
|
| [1] https://en.wikipedia.org/wiki/X.690#Length_octets
|
| [2] https://www.rfc-editor.org/rfc/rfc8949.html#section-3
| klabb3 wrote:
| > Unfortunately this is less suitable for integers larger than
| 64 bits, because then you run out of bits in the first byte.
|
| Just spitballing here but is it necessary to support every
| increment of octets? Do we really need a 7 octet/56bit width
| for instance? Instead couldn't we just allow widths of
| log2(n_octets)? An int64 would need 2 bits of length info (8,
| 16, 32, 64). Or maybe I'm thinking wrong cause not all bits are
| can be used.
|
| > its efficiency really depends on the expected input
| distribution
|
| Right. But everything else in computers is rounded to the
| nearest power of 2, so maybe it works here too haha.
| vlovich123 wrote:
| If you don't need the full 64-bit space and can go 61-bit, you
| can actually encode variableness in 3-bits and at most have an
| 8 byte value vs the 9byte value here (ie 0 overhead). Three
| bits to indicate whether the length is 1 to 8 bytes and the
| remainder for the values. This is still 5 bits more efficient
| if you used the full 64-bits but unless you can make use of
| those extra 5 bits somehow (eg tightly packing another value
| that's also partially sliced), it's not buying you anything.
| repelsteeltje wrote:
| Clever. But..
|
| If you're (worst case) is 8 bytes for 61 bits, that's no 0
| overhead, but 3 bits overhead.
|
| Another detail is that vu128 is double 64 bits, ie: 17 bytes
| worst case or 8 bits overhead. Would presumably be 4 bits in
| your example to represent 124 bit payload.
| vlovich123 wrote:
| 0 overhead in that it takes 8 bytes in memory and up to 8
| bytes serialized. You can't slice a byte partially unless
| you're doing packing so I think it's fair to characterize
| it as 0 overhead, but sure it's 3 bits of overhead vs 8. It
| came up because I had a counter and just arbitrarily cut
| off the range cause I couldn't possibly ever have it count
| that high. But yeah, the same idea could be extended to 128
| bits.
| repelsteeltje wrote:
| Fair enough. In case this was a counter, best make sure
| you get the overflow logic right. The normal 64 bit
| wrapping won't kick in if you're just able to (de)
| seriaize only 61 bits.
|
| Or (depending on the context of this counter) you may
| never want to overflow at all - in which case you're best
| of with a variable length representation that is open
| ended (like r or python arbitrary precision)
| vlovich123 wrote:
| It would take your absolute top of the line CPU doing
| nothing but incrementing the counter for more than 12
| years to worry about overflowing a 61 bit number. And
| what this is counting is much more expensive than
| incrementing a counter so you can't run this at 6Ghz.
| Maybe ~500mhz which gets you to > 100 years if that's all
| you're doing but this counter is a side operation that
| won't see much traffic even at scale so you're talking
| about millennia in practice I suspect.
|
| So it's just not a practical concern - I just panic if it
| hits the limit. I think you may be failing to comprehend
| just how large numbers like this get and how long it
| takes to count because it's such a very real concern with
| 32 bit counters. Addition of course can always overflow
| but this is a straight up strictly monotonic counter that
| counts from 0 to 2^61.
| repelsteeltje wrote:
| :-) fair enough.
|
| And panicing is exactly the kind of mitigation I had in
| mind initially; just don't silently do the unexpected.
|
| > I think you may be failing to comprehend just how large
| numbers like this get
|
| On the other hand, us programmers are also notorious for
| failing to comprehend just how long our software will
| live on...
| vlovich123 wrote:
| Honestly if any code I've written survives 100 years to
| cause problems I'll be impressed. 1000 years I'll be
| ecstatic.
| plesner wrote:
| I implemented a slight variation of this a while back which
| worked as described up to 11110xxx but then a prefix byte of
| 11111xxx meant a payload of 64*2^xxx bits. So 11111000 is
| followed by a 64-bit value, 11111001 by a 128-bit, up to
| 11111111 which is 8192 bits.
| Dylan16807 wrote:
| > its efficiency really depends on the expected input
| distribution. One glaring issue in this regard is that its
| 2-byte encoding is rather wasted
|
| Yeah, and it's worth noting that this encoding is very close to
| just length then value. It only saves space for numbers up to
| about +-100. In a lot of situations it wouldn't actually do
| better than tag-length-value.
| lostmsu wrote:
| Also, real-world numbers tend to swarm around powers of 10 and,
| in computers, powers of 2. It might make sense to encode those
| in a special way. E.g. <pow2-prefix><var-len-power><0/1 - to
| subtract one> and <pow10-prefix><var-len-power><0/1 - to
| subtract one>. E.g. assuming 10 prefix reserved for pow2 and 11
| for pow10, 255 would be encoded as 10_1000_1, 65535 would be
| 10_10000_1, 65536 would be 10_10000_0, and 10000 would be
| 11_11_0.
| chubot wrote:
| FWIW the lobste.rs thread surfaced this 2016 comment thread about
| PrefixVarint vs. LEB128 varint in protobufs, which has lots of
| good info, and makes the UTF-8 analogy:
|
| https://news.ycombinator.com/item?id=11263378
|
| Also sqlite varint:
|
| https://sqlite.org/src4/doc/trunk/www/varint.wiki
| akira2501 wrote:
| > Also sqlite varint:
|
| That's not actually in use. The SQLite3 format is far more
| basic:
|
| "A variable-length integer or "varint" is a static Huffman
| encoding of 64-bit twos-complement integers that uses less
| space for small positive values. A varint is between 1 and 9
| bytes in length. The varint consists of either zero or more
| bytes which have the high-order bit set followed by a single
| byte with the high-order bit clear, or nine bytes, whichever is
| shorter. The lower seven bits of each of the first eight bytes
| and all 8 bits of the ninth byte are used to reconstruct the
| 64-bit twos-complement integer. Varints are big-endian: bits
| taken from the earlier byte of the varint are more significant
| than bits taken from the later bytes."
|
| From: https://www.sqlite.org/fileformat2.html Section 1.6
| vitiral wrote:
| zig-zag encoding is brilliant. Basically you can determine
| positive or negative based on odd/even and divide by 2 to get the
| value. I tried to come up with a similar encoding and never
| thought of this
| repelsteeltje wrote:
| Yeah basically it's ones complement (or twos complement) but
| with the sign bit on the right, rather than the left.
| edflsafoiewq wrote:
| It's the same as enumerating 2D points in a spiral, but
| restricted to 1D.
| Const-me wrote:
| When I need them I usually implement the codec used in MKV
| format: https://www.rfc-editor.org/rfc/rfc8794.html
|
| Similar performance characteristics, i.e. just the first byte
| tells the encoded length, no need to test every byte. Luckily,
| most programming languages have intrinsic functions for BSWAP
| instruction.
| repelsteeltje wrote:
| Mkv (and av1, vp8, vp9) use the leb128 format which the article
| mentions is inefficient parse for off the shelf CPUs.
|
| Leb128 is relatively efficient in bitstream and sidesteps
| potential mpeg-la patent issues. (H264, maybe 14496 part 14)
| Const-me wrote:
| I don't think they do. These formats are based on EBML, see
| the RFC link in my previous comment. That VINT encoding is
| not LEB128, it's better.
|
| BTW I once implemented my own parser in C#, see that source
| file: https://github.com/Const-
| me/Vrmac/blob/master/VrmacVideo/Con...
| kunley wrote:
| There is also SQLite varint,
| https://sqlite.org/src4/doc/trunk/www/varint.wiki
|
| IIRC different from both Leb128 and VLQ, and quite neat. It's for
| 64bit (u)ints, though
| terrelln wrote:
| Variable-length integer schemes are great when you are encoding
| one integer. But when you are encoding a list of integers, you
| should really consider a different scheme.
|
| Variable-length integer schemes generally interleave the control
| bits & data bits. This means you don't know where the next
| integer starts until you at least partially decode the current
| integer.
|
| When encoding a list of integers, you would rather put all the
| control information together, and all the data together. This
| generally allows for significantly more efficient decoding using
| SIMD.
| tatersolid wrote:
| Don't forget column-stores. Lists of floats and ints are
| generally "full size" but then XORed with the preceding value
| then the result is compressed in chunks with a high-speed
| compressor like LZ4. We see 20x or higher compression ratios on
| many of our column-stores which contain data such as invoice
| amounts as metrics. Decompression overhead is below zero (much
| faster than operating on the raw uncompressed data which is
| bound by memory bandwidth or IO)
| zigzag312 wrote:
| How does it compare to vint64[0] performance-wise? For values
| that fit the 64-bit space.
|
| [0] https://crates.io/crates/vint64
| danking00 wrote:
| If I'm following, this presents an alternative format and
| implementation to LEB128 which encodes and decodes substantially
| faster. Notably, the implementation is quite simple. Cool! And
| agreed that modern CPUs really suffer from branches.
|
| Should I interpret the plot to mean the average elapsed wall
| clock time per integer decoded/encoded? And can I conclude the
| throughput is the reciprocal? So about 100,000 integers per
| second or around a 1 MB/s of decompressed data.
|
| I know this is a bit unfair because the implementation is much
| more complex, but my first thought is why I would use vu128
| instead of Lemire's Stream VByte:
| https://arxiv.org/abs/1709.08990
|
| A slight tangent but I stumbled on this library which stores
| floats XOR'ed with the previous float in the stream:
| https://github.com/velvia/compressed-vec it seems really clever
| to me! They reference "Gorilla: A Fast, Scalable, In-Memory Time
| Series Database" which in turn references two 2006 papers: "Fast
| Lossless Compression of Scientific Floating-Point Data" and "Fast
| and Efficient Compression of Floating-Point Data". Frustratingly,
| the FB paper doesn't benchmark their XOR-based floating point
| encoding but the earlier two papers do.
| jmillikin wrote:
| The plot is the time to encode/decode a list of 10,000 random
| integers in a uniform distribution. The throughput is 2-10
| GiB/s[0] on my desktop, with portable scalar operations (no
| SIMD). Distributions skewed toward small values are slightly
| faster[1].
|
| I believe a decoder than used SIMD would perform even faster
| since the format is friendly to the BMI instructions found on
| newer CPUs. You could also store the data as
| concat(prefix_bytes) + concat(payloads), which would really get
| things going.
|
| For scientific data with lots of multi-byte f64 values, the
| single branch for < 0xF0 might be unnecessary and undesirable.
| There are other formats that can be decoded by completely
| branchless SIMD, at the cost of larger sizes for very small
| values. For example, Google has a format that uses the first
| byte as a bitmask:
| https://static.googleusercontent.com/media/research.google.c...
|
| [0] vu128 is less affected by value size than VLQ/LEB128, so on
| fast hardware it decodes 10,000 8-bit values in about the same
| time as 10,000 64-bit values.
|
| [1] In the discussion on lobste.rs[2] a user wondered what the
| results would look like with a distribution containing lots of
| small values. A zipf distribution's results looks like
| <https://i.imgur.com/WGFtz6f.png>.
|
| [2]
| https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...
___________________________________________________________________
(page generated 2024-05-23 23:01 UTC)