[HN Gopher] Another variable-length integer encoding (2021)
___________________________________________________________________
Another variable-length integer encoding (2021)
Author : fanf2
Score : 34 points
Date : 2024-08-11 17:42 UTC (5 hours ago)
(HTM) web link (dcreager.net)
(TXT) w3m dump (dcreager.net)
| Retr0id wrote:
| I like this a lot.
|
| A third approach not mentioned in the article is to use some form
| of length prefix (which is itself a fixed number of bits long).
| This is usually cheaper to decode, but a fixed-length prefix
| disproportionally "wastes" space in short numbers.
|
| The approach in the article gives the best of both worlds, all
| while being very cheap to decode. I'm guessing it might also be
| fairly SIMD-friendly.
| loeg wrote:
| I wonder how well branch predictors handle CLZ vs just masking
| and comparing bytes. I expect the latter to work really well,
| but who knows how optimized CLZ-derived branches are.
| theamk wrote:
| [delayed]
| Veserv wrote:
| Unary-encoded length prefix and big-endian serialization.
|
| There is no good reason to use big-endian serialization. I think
| they are doing that because they only know about "count leading
| zeros (CLZ)" to extract the unary length prefix. "Count trailing
| zeros (CTZ)" is also a commonly supported operation (and can be
| efficiently emulated with CLZ if it does not exist) which makes
| it easy to extract the unary length prefix even if it is little-
| endian.
| rocqua wrote:
| Is there a good reason not to use big endian?
| Veserv wrote:
| For starters, all modern machines are little-endian, so you
| need to flip the bytes. This is made more confusing when you
| layer another encoding on top since now you need to define
| how they are layered and when the encodings apply (i.e. Do
| you encode the value to big-endian then attach the prefix or
| do you attach the prefix then encode to big-endian? In this
| case their explanation indicates implicitly that their mental
| model is the former. Luckily, I think it does not actually
| matter in this case.)
|
| Big-endian also suffers from problems around streaming
| unknown-length values as the decoding of a big-endian byte
| pattern is inherently length-dependent, so even if all
| machines were big-endian you should still prefer little-
| endian encodings for wire formats.
|
| If you try to pre-load the bytes in a little-endian encoding
| then the bytes at higher memory addresses, which you want to
| process later, can be trivially masked off. If you try to
| pre-load the bytes in a big-endian encoding then the bytes at
| higher memory addresses must be masked off and then the bytes
| you care about need to be shifted so they get interpreted
| correctly.
| pkhuong wrote:
| I can see a valid reason, specifically for variable-length
| encodings: big-endian makes it easy to compare compound keys
| lexicographically without decoding (little-endian can still do
| it, but everything has to be flipped to read from high to low
| addresses). Unfortunately, encoding continuation bits as 0
| instead of 1 prevents that from working.
| mananaysiempre wrote:
| So, UTF-8 without the self-synchronization part. It's a
| reasonable encoding, sure. I'm not sure what the cost of making
| the whole thing big endian is, though (guess I need to look at
| simdutf8!); at first blush it feels like making this wholly
| little-endian-adapted, with the continuation bits = byte count in
| unary located in the _low_ bits, would result in very cheap
| decoding on little-endian machines.
| rocqua wrote:
| Cool, certainly makes sense to let the parser know ahead of time
| how many bytes to read. Not just for cleaner code, but for faster
| code aswell.
| fritzo wrote:
| It seems like a more SIMD-friendly version would separately store
| a byte mask and the packed byte data. Then you can simultaneously
| expand 16 int32s using e.g. AVX512's
| _mm512_maskz_expand_epi8(mask,_mm512_loadu_epi8(data)).
| mananaysiempre wrote:
| Stream VByte[1,2,3] is not far away from that idea, storing the
| packed lengths in a separate stream from the data bytes.
| (There's also a more conventional interleaved design in
| varint-G8IU, but Stepanov decided to fuck everyone over and
| patented that.)
|
| [1] https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-
| new-...
|
| [2] https://arxiv.org/abs/1709.08990
|
| [3] https://github.com/lemire/streamvbyte
| ralferoo wrote:
| > And, for reasons we'll see in a bit, we flip the meaning of the
| continuation bits: we'll use 0 to mean "another byte follows",
| and 1 to mean "this is the last byte".
|
| EDIT: just noticed that I'd skim-read over: "imperial varint
| places _all_ of the continuation bits at the beginning of the
| encoded value ", which makes my comment completely wrong.
|
| ~Doesn't even mention possibly one of the most useful features of
| encoding this way, which is that you can never [1] end up with a
| 0 byte in the output, meaning that you can treat the output as a
| C string.~
|
| ~[1] or at least never assuming you're trying to create the
| shortest encoding~
| layer8 wrote:
| This is not true for the described new encoding, as the
| continuation bits are collected together at the beginning.
| ralferoo wrote:
| Yeah, I'd re-read that section of the article and edited my
| post already to reflect this before I saw your reply.
|
| My fault for reading the first proposed solution, thinking it
| was stupid for having the continuation bits the other way
| round, and skipping ahead to where I saw they'd inverted them
| and stopped reading at that point.
| maxmuen wrote:
| IMHO wasting 12.5% of all bits just to store the length appears
| wasteful to me. The encoding in QUIC[1] that encodes the length
| in the leading to 2bits. Very fast to read and write, just one
| caveat at most 62bits allowed.
|
| [1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-
| le...
| Retr0id wrote:
| It depends on your probability distribution of values.
|
| Storing the number 100 (decimal) under the QUIC scheme would
| require 2 bytes, but only 1 byte under either "metric" or
| "imperial" varints.
|
| (same for any value between 64 and 127, I believe (I might be
| off-by-one there...))
| jeffbee wrote:
| Unary length prefix is a solid technique. I would be careful with
| the performance claims, though. Sometimes what the machine can do
| is surprising and the performance of a system that looks like it
| needs to loop and branch is faster than you expected. The
| protobuf project, unsurprisingly for a project of its age, has
| several different varint parsing strategies. One is described at
| https://github.com/protocolbuffers/protobuf/blob/main/src/go...
| and another at
| https://github.com/protocolbuffers/protobuf/blob/main/src/go...
| sethev wrote:
| > Note that even though the goal, and end result, is that the
| file is smaller, this is not a compression scheme, since it only
| works when your values are distributed in one particular way
|
| This statement doesn't sound right. I think he's trying to say
| it's not a general purpose compression scheme, but every
| compression scheme only works on certain patterns. In fact, most
| possible streams of bytes are not compressible at all. Even the
| general purpose approaches only work on very specific types of
| inputs (such as natural language) which contain a lot of
| redundancy.
| Retr0id wrote:
| Not only that, but I believe these "imperial varints" could be
| described as special-case of huffman coding.
___________________________________________________________________
(page generated 2024-08-11 23:00 UTC)