[HN Gopher] The simple beauty of XOR floating point compression
       ___________________________________________________________________
        
       The simple beauty of XOR floating point compression
        
       Author : cwinter
       Score  : 111 points
       Date   : 2024-04-09 15:15 UTC (2 days ago)
        
 (HTM) web link (clemenswinter.com)
 (TXT) w3m dump (clemenswinter.com)
        
       | sameoldtune wrote:
       | This algorithm is pretty neat. I wonder if for the use case
       | outlined by the post they could get even better compression
       | through the use of fewer significant digits. For the purposes of
       | recording times, 0. 00768 is likely just as useful as
       | 0.0076792240142822266
        
         | Chabsff wrote:
         | That's already (one of) the main property that the algorithm is
         | exploiting: That f64 is an horribly overkill representation for
         | the data since it was generated using a microsecond clock,
         | leading to a lot of trailing zeros.
         | 
         | So yeah, dropping more significant digits should lead to even
         | better compression.
         | 
         | But if we are going to be massaging the data based on what we
         | know/need, there are better ways to go about compressing it.
         | e.g. I'd expect that simple fixed-point quantization with a
         | delta encoding would compress better than this. What I find
         | cool about this is that it dynamically figures things out
         | without having to deal with any priors.
        
         | cwinter wrote:
         | yes! the very last example in the post shows what happens if
         | you truncate the mantissa to just the 4 most significant bits
        
       | theamk wrote:
       | I wonder how it compares to traditional compression?
       | 
       | I've tried their "Integer that increments by 1 at every
       | timestep." case on 10000 integer. The algorithm described said
       | "6.98x compression"
       | 
       | bzip2 gave 80000/8642 = 9.2x compression
       | 
       | zstd gave 80000/11316 = 7.1x compression
        
         | Chabsff wrote:
         | I'm not sure that's a very fair comparison, as this algorithm
         | does point-wise streaming compression/decompression. I could
         | see myself using something like this in a FPGA context or
         | within a tight acquisition inner loop for example.
        
       | mgaunard wrote:
       | A normal network runs at 10Gbps or 1.25GiB/s.
       | 
       | 200MiB/s is too slow by a factor of 7. Though obviously some
       | people have optimised it.
        
         | helf wrote:
         | Uh no? Most not-corporate networks are gigabit to 2.5gbit at
         | best. Higher end home users may be doing 5gbit and 10gbit+.
         | 
         | Even top end wifi tends to only be around 1.2gbit.
        
           | vardump wrote:
           | Higher end home users are doing 25, 50 and 100 Gbps. Hardware
           | is actually rather affordable now.
           | 
           | Although there's just not enough PCIe bandwidth in current
           | consumer hardware for 100 Gbps...
        
             | saagarjha wrote:
             | Your high end is like the top 0.0001% of home users.
        
               | vardump wrote:
               | You can get to this under $1k nowadays. This Mikrotik
               | switch changed everything:
               | https://mikrotik.com/product/crs504_4xq_in
               | 
               | You can buy second hand 100g networking cards pretty
               | cheaply. The caveat is that they only support older PCIe
               | standards, so you need PCIe 16x to use them at full
               | speed.
               | 
               | Please also don't forget typical NVMe drives are pretty
               | fast now. Just _two_ of them can nearly saturate a 100g
               | link.
               | 
               | Personally I know a _lot_ of people in IT field who have
               | single mode fiber SFP28 networks. I 'd say 0.001% - 0.01%
               | is closer to reality.
               | 
               | Remember we were talking about high end.
        
               | saagarjha wrote:
               | Like 99% of the people I know who work in IT don't have
               | anything near what you're mentioning. They spend $100 on
               | _all_ their networking equipment, not just a switch.
               | They're not laying fiber in their house. And outside of
               | that group the number is basically 0.
        
               | vardump wrote:
               | I guess typical gateway drug is to get a 10g internet
               | connection or 10g NAS at home and find out copper based
               | cat6/7 is a lousy solution to actually make use of it.
               | 
               | Surprisingly high latencies (compared to fiber, that is)
               | and noisy fans due to a lot of heat generated.
               | 
               | Shrug. I guess your mileage may vary.
        
             | jdsully wrote:
             | How many homes have fibre runs everywhere. Most homes don't
             | even have cat-5
        
         | wizzwizz4 wrote:
         | According to https://www.speedtest.net/global-index, 200MiB/s
         | is over five times faster than Singapore's broadband speed -
         | and that's the country with the fastest broadband in the world.
         | 
         | 200Mbps puts you in the top 15 countries by broadband speed
         | (above Japan) and the top 4 by mobile speed (above South
         | Korea). Not that that's relevant to anything. (I haven't edited
         | out _any_ mistakes. The child comment is lying.)
        
           | zamadatix wrote:
           | You're conflating MiB/s with Mbps. Though I believe the GP
           | was actually talking about local server networks, not
           | internet speeds.
           | 
           | If you need to stream sequential floating point data out an
           | internet link something more space efficient (and probably
           | chunked instead) would probably be worthwhile over the
           | simplicity.
        
         | geon wrote:
         | I'd expect an actually optimized implementation to be at least
         | that much faster.
        
       | shin_lao wrote:
       | How does this compare to regular dictionary compression?
        
         | duskwuff wrote:
         | Dictionary compression isn't really applicable to floating-
         | point data, because it isn't made up of commonly repeated
         | "words" like text is.
        
       | porphyra wrote:
       | In the first several examples, the input seems to have a mantissa
       | with all zeros beyond the first 15 bits. Why isn't the author
       | using float32 for this instead of float64?
        
         | Chabsff wrote:
         | Not having to make that kind of decision in the first place is
         | inherently valuable in of itself, which I believe is what's
         | being showcased here.
        
       | mzs wrote:
       | Here's discussion of an article from 2020 that explains this and
       | other time series compression approaches:
       | https://news.ycombinator.com/item?id=31379344
        
       | markrages wrote:
       | For integer values, just subtracting bytes is enough to allow
       | much better compression ratios. See the table at
       | https://opensource.quarq.com/zfit/zfit.pdf
        
         | hcs wrote:
         | Very handy to do with 7-Zip: -mf=Delta:4 (or just f=Delta:4 in
         | the GUI config window). This is for 4 byte little endian
         | integers, which works pretty well with 16-bit stereo PCM even
         | if there's some unwanted overflow between the channels, if you
         | can't use flac in some context (raw data, etc).
        
       | duskwuff wrote:
       | The biggest issue with this approach is that it's bit-aligned,
       | and the alignment is determined by fields within the data. It's
       | hard to make this run fast on most CPUs; a typical implementation
       | ends up having to maintain a bit buffer, and shifts bits in/out
       | of it a couple times for every codeword.
       | 
       | I bet you could improve the performance of this algorithm
       | significantly, without harming its compression too badly, by one
       | or both of:
       | 
       | A) Adjusting the sizes of some bit fields to make them evenly
       | divide the system word size, e.g. requiring the run length of the
       | 11-prefix to be a multiple of 4, so that it can be encoded in 4
       | bits instead of 6. (This is a net space savings of half a bit;
       | you have to include an average of 1.5 more bits of nonzero data,
       | but the length is two bits shorter.)
       | 
       | B) Writing the streams of bits for e.g. "mode", "number of
       | leading zeroes", and "bit subsequences" separately for each block
       | of compressed data, rather than combining them into a single
       | stream. As a bonus, each of those streams will now be more
       | amenable to compression by generic compressors.
        
       | jhj wrote:
       | People in the HPC/classical supercomputing space have done this
       | sort of thing for a while. There's a fair amount of literature on
       | lossless floating point compression, such as Martin Burtscher's
       | work or stuff out of LLNL (fpzip):
       | 
       | https://userweb.cs.txstate.edu/~burtscher/
       | https://computing.llnl.gov/projects/floating-point-compressi...
       | 
       | but it tends to be very application specific, where there tends
       | to be high correlation / small deltas between neighboring values
       | in a 2d/3d/4d/etc floating point array (e.g., you are compressing
       | neighboring temperature grid points in a PDE weather simulation
       | model; temperature differences in neighboring cells won't differ
       | by that much).
       | 
       | In a lot of other cases (e.g., machine learning) the floating
       | point significand bits (and sometimes the sign bit) tends to be
       | incompressible noise. The exponent is the only thing that is
       | really compressible, and the xor trick does not help you as much
       | because neighboring values could still vary a bit in terms of
       | exponents. An entropy encoder instead works well for that (encode
       | closer to the actual underlying data distribution/entropy), and
       | you also don't depend upon neighboring floats having similar
       | exponents as well.
       | 
       | In 2022, I created dietgpu, a library to losslessly
       | compress/decompress floating point data at up to 400 GB/s on an
       | A100. It uses a general-purpose asymmetric numeral system
       | encoder/decoder on GPU (the first such implementation of general
       | ANS on GPU, predating nvCOMP) for exponent compression.
       | 
       | We have used this to losslessly compress floating point data
       | between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in
       | training massive ML models to speed up overall wall clock time of
       | training across 100s/1000s of GPUs without changing anything
       | about how the training works (it's lossless compression, it
       | computes the same thing that it did before).
       | 
       | https://github.com/facebookresearch/dietgpu
        
         | dragontamer wrote:
         | I'm not exceptionally good with compression, but I did a few
         | experiments.
         | 
         | I've noticed that array-of-structs form is harder to compress
         | than struct-of-arrays form. This is relevant because a
         | floating-point number is really a struct containing 3 values:
         | 1-bit sign, 8-bit exponent, and 23-bit mantissa.
         | 
         | --------
         | 
         | If you have 1000 floats, I would expect 1000-sign bits in order
         | to be more compressible (whatever compression algorithm you
         | use, it would better determine the pattern of signs).
         | 
         | Then 8000-bits of exponent would also be more compressible,
         | because all the exponents would likely follow its own pattern.
         | 
         | Finally, the 23-bits of mantissa would probably be difficult to
         | impossible for compression. But "extracting it out" before
         | running a compression algorithm will give the sign + exponents
         | more opportunity for comrpession.
         | 
         | --------
         | 
         | So yeah, just think about your data. And favor structure-of-
         | arrays form for better compression.
        
         | tines wrote:
         | Wow, cool to see Dr. Burtscher's name here! I was a student in
         | his undergraduate parallel computing course and was invited to
         | work in his lab. Very smart guy.
        
       | EdSchouten wrote:
       | As the article states, you need to apply some heuristics to
       | choose between cases (2) and (3). I'm not a big fan of that,
       | because it means the encoding is not deterministic (as in, it
       | leaves room for interpretation/optimization by encoders).
       | 
       | Just for fun I implemented this algorithm:
       | https://go.dev/play/p/72GyYmVnwyc
       | 
       | The algorithm described in the article uses prefixes 0, 10, and
       | 11. My algorithm uses different prefixes: 0, 10, 110, 1110,
       | 11110, [...]. By default, the subsequence that is used is
       | identical to that containing differing bits during the previous
       | iteration. The prefixes describe previously can be used to widen
       | the subsequence:
       | 
       | - 0: Keep the subsequence the same width.
       | 
       | - 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the
       | left and right side.
       | 
       | - 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the
       | left and right side.
       | 
       | - 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the
       | left and right side.
       | 
       | - ...
       | 
       | For the sequence of numbers presented under "Tweaking the
       | algorithm", the algorithm in the article produces 969 bits of
       | data. My algorithm produces just 823.
       | 
       | Edit: Tiny bug in original code.
       | 
       | Edit 2: maybe it's smarter to widen the subsequence using
       | triangular numbers. Doing it exponentially is a bit too
       | aggressive, it seems.
        
         | addaon wrote:
         | > By default, the subsequence that is used is identical to that
         | containing differing bits during the previous iteration
         | 
         | If I'm understanding correctly, this means that if the first
         | choice a sized subsection is oversized, this will carry through
         | to future samples (unless the encoder chooses to re-size it
         | explicitly). I wonder if, with your cheaper "loosen" encoding,
         | it is worth automatically tightening the bit delta. For
         | example, if I go to an initial window of "5 zeros, the 7 bits
         | 1001101, remainder zeros" (which I'll call 5;7), and the next
         | encoding is "same, 0101110", the leading and trailing zeros
         | adjust the window to (in this example) 6;5, "tightening" around
         | the non-zero bits. With your low-cost "loosen" doing this
         | aggressively (instead of, say, if it happens for multiple
         | samples in a row) seems sane. It also means that the fact that
         | your loosen encoding is symmetric (which makes sense in terms
         | of bit efficiency) is somewhat mitigated because the implicit
         | tighten can break the symmetry.
        
           | EdSchouten wrote:
           | > > By default, the subsequence that is used is identical to
           | that containing differing bits during the previous iteration
           | 
           | > If I'm understanding correctly, this means that if the
           | first choice a sized subsection is oversized, this will carry
           | through to future samples (unless the encoder chooses to re-
           | size it explicitly).
           | 
           | And that's exactly what I do: I resize it. The subsequence
           | that is picked during round n is the one that _actually_
           | contained the differing bits in round n-1.
        
       | Manabu-eo wrote:
       | For those interested in the topic, I highly recommend the series
       | of posts from Charles Bloom:
       | 
       | http://cbloomrants.blogspot.com/2023/07/notes-on-float-and-m...
       | 
       | > TLDR:
       | 
       | > For S16 deltas, bias by 0x8080 , for S32 deltas, bias by
       | 0x80808080.
       | 
       | > De-interleaving multi-byte integers into homogenous streams can
       | help, particularly with weaker back-end compressors.
       | 
       | > TLDR:
       | 
       | > For floats, just reinterpret F32 as U32.
       | 
       | > If the floats have a mix of positive and negative, convert the
       | sign bit to two's-complement signed S32.
       | 
       | > Consider lossy elimination of negative zero -0.f
       | 
       | > If you didn't actually want the huge precision for floats near
       | zero, a simple lossy encoding is just to do float += constant,
       | which works for non-negative floats where you don't know the high
       | end of the range so you can't just use fixed point.
       | 
       | > TLDR:
       | 
       | > The best way to compress numeric data that is larger than bytes
       | (F32,F16,S32,S16) is usually to delta them in their original size
       | integer, then de-interleave after the delta.
       | 
       | > Sometimes no filter or no deinterleave is best, particularly
       | with stronger compressors, so being able to select filter on/off
       | per-file can give big wins.
        
       ___________________________________________________________________
       (page generated 2024-04-11 23:00 UTC)