[HN Gopher] Making all your integers positive with zigzag encoding
___________________________________________________________________
Making all your integers positive with zigzag encoding
Author : jjgreen
Score : 44 points
Date : 2022-11-25 17:55 UTC (5 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| bugfix-66 wrote:
| A really tremendous varint/VLQ encoder (using a zig-zag encoding
| and an generalized base):
|
| https://bugfix-66.com/2c1df73cab89ec76d6fa10caf8a27c1fbe4d16...
|
| and the decoder:
|
| https://bugfix-66.com/1efa93a5eb0cc12b3de7cd1dab8e471a2cc95e...
|
| The common varint, which you see in applications everywhere
| (e.g., git), is just base-128 version of the above general
| scheme!
|
| But base-32 or base-8 or base-64 varints can be a big win for
| some purposes. Remove the zig-zag encoding if negative integers
| don't occur.
| bo1024 wrote:
| I have to fill out a captcha just to read an article / blog post?
| (Edit: and enable cookies)
| pwg wrote:
| Or install Ublock Origin in Firefox blocking the javascript and
| read the entire article without a captcha getting in the way.
| bo1024 wrote:
| I had uBlock Origin and uMatrix installed. Not sure what
| changed, but I can access the article now.
| nynx wrote:
| Used in the [postcard](https://docs.rs/postcard/latest/postcard/)
| crate for encoding variably sized signed integers.
| phoboslab wrote:
| I first encountered this in the Flac source[1]. They do the bit
| twiddling a bit differently than in this article:
| uval = (val<<1) ^ (val>>31);
|
| Variations of this have probably been used countless of times
| in other libraries.
|
| [1]
| https://github.com/LordJZ/libflac/blob/master/src/libFLAC/bi...
| jamesmunns wrote:
| Author of postcard here, I don't remember exactly where I got
| my bit twiddling example from, lemme see if I have notes from
| that.
|
| postcard's zigzag encoding matches phoboslab's psuedocode.
|
| Edit, not totally sure, but this wiki page rings a bell, and
| is probably where I got my impl from:
| https://en.wikipedia.org/wiki/Variable-
| length_quantity#Zigza...
|
| Edit 2: I also explain why I do this (it compresses better)
| in my wire format specification:
| https://postcard.jamesmunns.com/wire-format.html#signed-
| inte...
| deleterofworlds wrote:
| A nice bijection from integers to natural numbers. Mappings exist
| for rationals -> natural numbers and other countable sets, but
| may be less practical.
| fluoridation wrote:
| Easy enough. If you have a rational class represented as a pair
| of irreducible int32_ts, you can simply do ((u64)numerator <<
| 32) | (u64)denominator.
| Someone wrote:
| That either doesn't map to rationals (making 1/1 and 2/2 two
| different numbers) or not a bijection (0x0000000100000001 and
| 0x0000000200000002 both map back to 1) and doesn't work on
| "the integers".
| fluoridation wrote:
| Most people just care about having a lossless round-trip
| conversion, from the source representation to a storage
| representation and then back to the source representation,
| not about having every possible bit sequence in the storage
| representation being valid. Univocally mapping the
| rationals onto a set of contiguous naturals is extremely
| tricky, computationally.
| bob1029 wrote:
| It's pretty cool how a simple encoding scheme can have profound
| effects on compression. This is a big part of what makes DCT-
| style block compression work as well as it does. You rig the game
| such that all of your higher frequency components wind up at the
| end - ideally as zeroes after quantization. So, a simple RLE
| scheme can do a lot more work than it otherwise could.
| benj111 wrote:
| Sorry. Are you inferring this is how DCT works or are you
| making a comparison?
|
| Ive just read up about DCT and don't really have a clue, but it
| doesn't seems applicable?
|
| That being so. I'm not sure how RLE at the bit level would
| help. Surely encoding run lengths of 5ish bits isn't going to
| compress much of anything.
| bob1029 wrote:
| Zigzag is a phase _after_ DCT+quantization where you iterate
| through the resulting matrix in a particular order. Nothing
| to do with the DCT itself.
|
| https://commons.wikimedia.org/wiki/File:JPEG_example_zigzag..
| ..
| oakwhiz wrote:
| The idea behind this is to scan diagonally by frequency so
| that big numbers are grouped together at the start. You can
| choose to zero out some of the bits near the end since those
| only contain high frequency components that are harder to
| notice. RLE then has the ability to work on the trailing
| zeroes, or other methods can be used instead.
| IanWard1 wrote:
| reminds me of this video on using "-2" as a base for numbers (can
| represent all positive and negative integers without a sign bit)
| https://www.youtube.com/watch?v=GWA_NraxOUw
| raphlinus wrote:
| The `fast_decode` as written is undefined behavior for some
| inputs (INT_MAX for example) because of the integer overflow in
| `2*x`. This can be fixed by casting to an unsigned int first.
|
| Also a warning, right shifting a negative number gives the
| expected result (arithmetic right shift) in C++20, but is still
| implementation defined even in the C23 draft spec. Likely you're
| good, but the spec doesn't guarantee it.
| eloff wrote:
| Protobuf was the first major project I was aware of that uses
| zigzag encoding for integers. They then encode those integers
| using the typical 7 bit encoding scheme where the msb indicates
| there's another byte following. I'm sure they weren't the first
| by any means though.
|
| I'm currently using this in a binary format for serializing
| dynamic JSON-like values that I invented and am implementing in
| Rust. I will release it as open source sometime next year.
| tikhonj wrote:
| I first ran into this encoding scheme in Avro, which you could
| also describe as a format for dynamic JSON-like values :)
| [deleted]
| ramranch wrote:
| Does your protocol provide any advantages over bincode?
| DontchaKnowit wrote:
| Found this recently in propreitary code base for (in abstract
| terms) storing enormous numbers of price values. Very interesting
| shoo wrote:
| Can anyone suggest high-quality resources for learning about
| compression?
| k0k0r0 wrote:
| They use a variant of that in most Modern SAT solvers for the
| literal encoding by the way, i.e. in order to encode negation of
| variables, which traditionally are represented with negative
| integers. Mostly, because they then use the encoded literal as
| indeces in an array. Historically, this also had performance
| reasons if I remember correctly. I feel it has no benefit
| anymore, and people simply got used to it (and never touch that
| part of the code anyway). But I might be wrong. I did not yet
| benchmarked any alternative, but it is on my todo-list (woth low
| priority).
| Ferrotin wrote:
| Wouldn't they use a plain bitmask instead, the lsb indicating
| negation, with 3 being the negation of 2? That's zigzag
| encoding with a negative zero.
| k0k0r0 wrote:
| If I understand you correctly, then that exactly what I am
| talking about. You might see this as bitmask. That's correct.
___________________________________________________________________
(page generated 2022-11-25 23:00 UTC)