[HN Gopher] Another variable-length integer encoding (2021)
       ___________________________________________________________________
        
       Another variable-length integer encoding (2021)
        
       Author : fanf2
       Score  : 34 points
       Date   : 2024-08-11 17:42 UTC (5 hours ago)
        
 (HTM) web link (dcreager.net)
 (TXT) w3m dump (dcreager.net)
        
       | Retr0id wrote:
       | I like this a lot.
       | 
       | A third approach not mentioned in the article is to use some form
       | of length prefix (which is itself a fixed number of bits long).
       | This is usually cheaper to decode, but a fixed-length prefix
       | disproportionally "wastes" space in short numbers.
       | 
       | The approach in the article gives the best of both worlds, all
       | while being very cheap to decode. I'm guessing it might also be
       | fairly SIMD-friendly.
        
         | loeg wrote:
         | I wonder how well branch predictors handle CLZ vs just masking
         | and comparing bytes. I expect the latter to work really well,
         | but who knows how optimized CLZ-derived branches are.
        
         | theamk wrote:
         | [delayed]
        
       | Veserv wrote:
       | Unary-encoded length prefix and big-endian serialization.
       | 
       | There is no good reason to use big-endian serialization. I think
       | they are doing that because they only know about "count leading
       | zeros (CLZ)" to extract the unary length prefix. "Count trailing
       | zeros (CTZ)" is also a commonly supported operation (and can be
       | efficiently emulated with CLZ if it does not exist) which makes
       | it easy to extract the unary length prefix even if it is little-
       | endian.
        
         | rocqua wrote:
         | Is there a good reason not to use big endian?
        
           | Veserv wrote:
           | For starters, all modern machines are little-endian, so you
           | need to flip the bytes. This is made more confusing when you
           | layer another encoding on top since now you need to define
           | how they are layered and when the encodings apply (i.e. Do
           | you encode the value to big-endian then attach the prefix or
           | do you attach the prefix then encode to big-endian? In this
           | case their explanation indicates implicitly that their mental
           | model is the former. Luckily, I think it does not actually
           | matter in this case.)
           | 
           | Big-endian also suffers from problems around streaming
           | unknown-length values as the decoding of a big-endian byte
           | pattern is inherently length-dependent, so even if all
           | machines were big-endian you should still prefer little-
           | endian encodings for wire formats.
           | 
           | If you try to pre-load the bytes in a little-endian encoding
           | then the bytes at higher memory addresses, which you want to
           | process later, can be trivially masked off. If you try to
           | pre-load the bytes in a big-endian encoding then the bytes at
           | higher memory addresses must be masked off and then the bytes
           | you care about need to be shifted so they get interpreted
           | correctly.
        
         | pkhuong wrote:
         | I can see a valid reason, specifically for variable-length
         | encodings: big-endian makes it easy to compare compound keys
         | lexicographically without decoding (little-endian can still do
         | it, but everything has to be flipped to read from high to low
         | addresses). Unfortunately, encoding continuation bits as 0
         | instead of 1 prevents that from working.
        
       | mananaysiempre wrote:
       | So, UTF-8 without the self-synchronization part. It's a
       | reasonable encoding, sure. I'm not sure what the cost of making
       | the whole thing big endian is, though (guess I need to look at
       | simdutf8!); at first blush it feels like making this wholly
       | little-endian-adapted, with the continuation bits = byte count in
       | unary located in the _low_ bits, would result in very cheap
       | decoding on little-endian machines.
        
       | rocqua wrote:
       | Cool, certainly makes sense to let the parser know ahead of time
       | how many bytes to read. Not just for cleaner code, but for faster
       | code aswell.
        
       | fritzo wrote:
       | It seems like a more SIMD-friendly version would separately store
       | a byte mask and the packed byte data. Then you can simultaneously
       | expand 16 int32s using e.g. AVX512's
       | _mm512_maskz_expand_epi8(mask,_mm512_loadu_epi8(data)).
        
         | mananaysiempre wrote:
         | Stream VByte[1,2,3] is not far away from that idea, storing the
         | packed lengths in a separate stream from the data bytes.
         | (There's also a more conventional interleaved design in
         | varint-G8IU, but Stepanov decided to fuck everyone over and
         | patented that.)
         | 
         | [1] https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-
         | new-...
         | 
         | [2] https://arxiv.org/abs/1709.08990
         | 
         | [3] https://github.com/lemire/streamvbyte
        
       | ralferoo wrote:
       | > And, for reasons we'll see in a bit, we flip the meaning of the
       | continuation bits: we'll use 0 to mean "another byte follows",
       | and 1 to mean "this is the last byte".
       | 
       | EDIT: just noticed that I'd skim-read over: "imperial varint
       | places _all_ of the continuation bits at the beginning of the
       | encoded value ", which makes my comment completely wrong.
       | 
       | ~Doesn't even mention possibly one of the most useful features of
       | encoding this way, which is that you can never [1] end up with a
       | 0 byte in the output, meaning that you can treat the output as a
       | C string.~
       | 
       | ~[1] or at least never assuming you're trying to create the
       | shortest encoding~
        
         | layer8 wrote:
         | This is not true for the described new encoding, as the
         | continuation bits are collected together at the beginning.
        
           | ralferoo wrote:
           | Yeah, I'd re-read that section of the article and edited my
           | post already to reflect this before I saw your reply.
           | 
           | My fault for reading the first proposed solution, thinking it
           | was stupid for having the continuation bits the other way
           | round, and skipping ahead to where I saw they'd inverted them
           | and stopped reading at that point.
        
       | maxmuen wrote:
       | IMHO wasting 12.5% of all bits just to store the length appears
       | wasteful to me. The encoding in QUIC[1] that encodes the length
       | in the leading to 2bits. Very fast to read and write, just one
       | caveat at most 62bits allowed.
       | 
       | [1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-
       | le...
        
         | Retr0id wrote:
         | It depends on your probability distribution of values.
         | 
         | Storing the number 100 (decimal) under the QUIC scheme would
         | require 2 bytes, but only 1 byte under either "metric" or
         | "imperial" varints.
         | 
         | (same for any value between 64 and 127, I believe (I might be
         | off-by-one there...))
        
       | jeffbee wrote:
       | Unary length prefix is a solid technique. I would be careful with
       | the performance claims, though. Sometimes what the machine can do
       | is surprising and the performance of a system that looks like it
       | needs to loop and branch is faster than you expected. The
       | protobuf project, unsurprisingly for a project of its age, has
       | several different varint parsing strategies. One is described at
       | https://github.com/protocolbuffers/protobuf/blob/main/src/go...
       | and another at
       | https://github.com/protocolbuffers/protobuf/blob/main/src/go...
        
       | sethev wrote:
       | > Note that even though the goal, and end result, is that the
       | file is smaller, this is not a compression scheme, since it only
       | works when your values are distributed in one particular way
       | 
       | This statement doesn't sound right. I think he's trying to say
       | it's not a general purpose compression scheme, but every
       | compression scheme only works on certain patterns. In fact, most
       | possible streams of bytes are not compressible at all. Even the
       | general purpose approaches only work on very specific types of
       | inputs (such as natural language) which contain a lot of
       | redundancy.
        
         | Retr0id wrote:
         | Not only that, but I believe these "imperial varints" could be
         | described as special-case of huffman coding.
        
       ___________________________________________________________________
       (page generated 2024-08-11 23:00 UTC)