[HN Gopher] The simple beauty of XOR floating point compression
___________________________________________________________________
The simple beauty of XOR floating point compression
Author : cwinter
Score : 111 points
Date : 2024-04-09 15:15 UTC (2 days ago)
(HTM) web link (clemenswinter.com)
(TXT) w3m dump (clemenswinter.com)
| sameoldtune wrote:
| This algorithm is pretty neat. I wonder if for the use case
| outlined by the post they could get even better compression
| through the use of fewer significant digits. For the purposes of
| recording times, 0. 00768 is likely just as useful as
| 0.0076792240142822266
| Chabsff wrote:
| That's already (one of) the main property that the algorithm is
| exploiting: That f64 is an horribly overkill representation for
| the data since it was generated using a microsecond clock,
| leading to a lot of trailing zeros.
|
| So yeah, dropping more significant digits should lead to even
| better compression.
|
| But if we are going to be massaging the data based on what we
| know/need, there are better ways to go about compressing it.
| e.g. I'd expect that simple fixed-point quantization with a
| delta encoding would compress better than this. What I find
| cool about this is that it dynamically figures things out
| without having to deal with any priors.
| cwinter wrote:
| yes! the very last example in the post shows what happens if
| you truncate the mantissa to just the 4 most significant bits
| theamk wrote:
| I wonder how it compares to traditional compression?
|
| I've tried their "Integer that increments by 1 at every
| timestep." case on 10000 integer. The algorithm described said
| "6.98x compression"
|
| bzip2 gave 80000/8642 = 9.2x compression
|
| zstd gave 80000/11316 = 7.1x compression
| Chabsff wrote:
| I'm not sure that's a very fair comparison, as this algorithm
| does point-wise streaming compression/decompression. I could
| see myself using something like this in a FPGA context or
| within a tight acquisition inner loop for example.
| mgaunard wrote:
| A normal network runs at 10Gbps or 1.25GiB/s.
|
| 200MiB/s is too slow by a factor of 7. Though obviously some
| people have optimised it.
| helf wrote:
| Uh no? Most not-corporate networks are gigabit to 2.5gbit at
| best. Higher end home users may be doing 5gbit and 10gbit+.
|
| Even top end wifi tends to only be around 1.2gbit.
| vardump wrote:
| Higher end home users are doing 25, 50 and 100 Gbps. Hardware
| is actually rather affordable now.
|
| Although there's just not enough PCIe bandwidth in current
| consumer hardware for 100 Gbps...
| saagarjha wrote:
| Your high end is like the top 0.0001% of home users.
| vardump wrote:
| You can get to this under $1k nowadays. This Mikrotik
| switch changed everything:
| https://mikrotik.com/product/crs504_4xq_in
|
| You can buy second hand 100g networking cards pretty
| cheaply. The caveat is that they only support older PCIe
| standards, so you need PCIe 16x to use them at full
| speed.
|
| Please also don't forget typical NVMe drives are pretty
| fast now. Just _two_ of them can nearly saturate a 100g
| link.
|
| Personally I know a _lot_ of people in IT field who have
| single mode fiber SFP28 networks. I 'd say 0.001% - 0.01%
| is closer to reality.
|
| Remember we were talking about high end.
| saagarjha wrote:
| Like 99% of the people I know who work in IT don't have
| anything near what you're mentioning. They spend $100 on
| _all_ their networking equipment, not just a switch.
| They're not laying fiber in their house. And outside of
| that group the number is basically 0.
| vardump wrote:
| I guess typical gateway drug is to get a 10g internet
| connection or 10g NAS at home and find out copper based
| cat6/7 is a lousy solution to actually make use of it.
|
| Surprisingly high latencies (compared to fiber, that is)
| and noisy fans due to a lot of heat generated.
|
| Shrug. I guess your mileage may vary.
| jdsully wrote:
| How many homes have fibre runs everywhere. Most homes don't
| even have cat-5
| wizzwizz4 wrote:
| According to https://www.speedtest.net/global-index, 200MiB/s
| is over five times faster than Singapore's broadband speed -
| and that's the country with the fastest broadband in the world.
|
| 200Mbps puts you in the top 15 countries by broadband speed
| (above Japan) and the top 4 by mobile speed (above South
| Korea). Not that that's relevant to anything. (I haven't edited
| out _any_ mistakes. The child comment is lying.)
| zamadatix wrote:
| You're conflating MiB/s with Mbps. Though I believe the GP
| was actually talking about local server networks, not
| internet speeds.
|
| If you need to stream sequential floating point data out an
| internet link something more space efficient (and probably
| chunked instead) would probably be worthwhile over the
| simplicity.
| geon wrote:
| I'd expect an actually optimized implementation to be at least
| that much faster.
| shin_lao wrote:
| How does this compare to regular dictionary compression?
| duskwuff wrote:
| Dictionary compression isn't really applicable to floating-
| point data, because it isn't made up of commonly repeated
| "words" like text is.
| porphyra wrote:
| In the first several examples, the input seems to have a mantissa
| with all zeros beyond the first 15 bits. Why isn't the author
| using float32 for this instead of float64?
| Chabsff wrote:
| Not having to make that kind of decision in the first place is
| inherently valuable in of itself, which I believe is what's
| being showcased here.
| mzs wrote:
| Here's discussion of an article from 2020 that explains this and
| other time series compression approaches:
| https://news.ycombinator.com/item?id=31379344
| markrages wrote:
| For integer values, just subtracting bytes is enough to allow
| much better compression ratios. See the table at
| https://opensource.quarq.com/zfit/zfit.pdf
| hcs wrote:
| Very handy to do with 7-Zip: -mf=Delta:4 (or just f=Delta:4 in
| the GUI config window). This is for 4 byte little endian
| integers, which works pretty well with 16-bit stereo PCM even
| if there's some unwanted overflow between the channels, if you
| can't use flac in some context (raw data, etc).
| duskwuff wrote:
| The biggest issue with this approach is that it's bit-aligned,
| and the alignment is determined by fields within the data. It's
| hard to make this run fast on most CPUs; a typical implementation
| ends up having to maintain a bit buffer, and shifts bits in/out
| of it a couple times for every codeword.
|
| I bet you could improve the performance of this algorithm
| significantly, without harming its compression too badly, by one
| or both of:
|
| A) Adjusting the sizes of some bit fields to make them evenly
| divide the system word size, e.g. requiring the run length of the
| 11-prefix to be a multiple of 4, so that it can be encoded in 4
| bits instead of 6. (This is a net space savings of half a bit;
| you have to include an average of 1.5 more bits of nonzero data,
| but the length is two bits shorter.)
|
| B) Writing the streams of bits for e.g. "mode", "number of
| leading zeroes", and "bit subsequences" separately for each block
| of compressed data, rather than combining them into a single
| stream. As a bonus, each of those streams will now be more
| amenable to compression by generic compressors.
| jhj wrote:
| People in the HPC/classical supercomputing space have done this
| sort of thing for a while. There's a fair amount of literature on
| lossless floating point compression, such as Martin Burtscher's
| work or stuff out of LLNL (fpzip):
|
| https://userweb.cs.txstate.edu/~burtscher/
| https://computing.llnl.gov/projects/floating-point-compressi...
|
| but it tends to be very application specific, where there tends
| to be high correlation / small deltas between neighboring values
| in a 2d/3d/4d/etc floating point array (e.g., you are compressing
| neighboring temperature grid points in a PDE weather simulation
| model; temperature differences in neighboring cells won't differ
| by that much).
|
| In a lot of other cases (e.g., machine learning) the floating
| point significand bits (and sometimes the sign bit) tends to be
| incompressible noise. The exponent is the only thing that is
| really compressible, and the xor trick does not help you as much
| because neighboring values could still vary a bit in terms of
| exponents. An entropy encoder instead works well for that (encode
| closer to the actual underlying data distribution/entropy), and
| you also don't depend upon neighboring floats having similar
| exponents as well.
|
| In 2022, I created dietgpu, a library to losslessly
| compress/decompress floating point data at up to 400 GB/s on an
| A100. It uses a general-purpose asymmetric numeral system
| encoder/decoder on GPU (the first such implementation of general
| ANS on GPU, predating nvCOMP) for exponent compression.
|
| We have used this to losslessly compress floating point data
| between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in
| training massive ML models to speed up overall wall clock time of
| training across 100s/1000s of GPUs without changing anything
| about how the training works (it's lossless compression, it
| computes the same thing that it did before).
|
| https://github.com/facebookresearch/dietgpu
| dragontamer wrote:
| I'm not exceptionally good with compression, but I did a few
| experiments.
|
| I've noticed that array-of-structs form is harder to compress
| than struct-of-arrays form. This is relevant because a
| floating-point number is really a struct containing 3 values:
| 1-bit sign, 8-bit exponent, and 23-bit mantissa.
|
| --------
|
| If you have 1000 floats, I would expect 1000-sign bits in order
| to be more compressible (whatever compression algorithm you
| use, it would better determine the pattern of signs).
|
| Then 8000-bits of exponent would also be more compressible,
| because all the exponents would likely follow its own pattern.
|
| Finally, the 23-bits of mantissa would probably be difficult to
| impossible for compression. But "extracting it out" before
| running a compression algorithm will give the sign + exponents
| more opportunity for comrpession.
|
| --------
|
| So yeah, just think about your data. And favor structure-of-
| arrays form for better compression.
| tines wrote:
| Wow, cool to see Dr. Burtscher's name here! I was a student in
| his undergraduate parallel computing course and was invited to
| work in his lab. Very smart guy.
| EdSchouten wrote:
| As the article states, you need to apply some heuristics to
| choose between cases (2) and (3). I'm not a big fan of that,
| because it means the encoding is not deterministic (as in, it
| leaves room for interpretation/optimization by encoders).
|
| Just for fun I implemented this algorithm:
| https://go.dev/play/p/72GyYmVnwyc
|
| The algorithm described in the article uses prefixes 0, 10, and
| 11. My algorithm uses different prefixes: 0, 10, 110, 1110,
| 11110, [...]. By default, the subsequence that is used is
| identical to that containing differing bits during the previous
| iteration. The prefixes describe previously can be used to widen
| the subsequence:
|
| - 0: Keep the subsequence the same width.
|
| - 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the
| left and right side.
|
| - 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the
| left and right side.
|
| - 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the
| left and right side.
|
| - ...
|
| For the sequence of numbers presented under "Tweaking the
| algorithm", the algorithm in the article produces 969 bits of
| data. My algorithm produces just 823.
|
| Edit: Tiny bug in original code.
|
| Edit 2: maybe it's smarter to widen the subsequence using
| triangular numbers. Doing it exponentially is a bit too
| aggressive, it seems.
| addaon wrote:
| > By default, the subsequence that is used is identical to that
| containing differing bits during the previous iteration
|
| If I'm understanding correctly, this means that if the first
| choice a sized subsection is oversized, this will carry through
| to future samples (unless the encoder chooses to re-size it
| explicitly). I wonder if, with your cheaper "loosen" encoding,
| it is worth automatically tightening the bit delta. For
| example, if I go to an initial window of "5 zeros, the 7 bits
| 1001101, remainder zeros" (which I'll call 5;7), and the next
| encoding is "same, 0101110", the leading and trailing zeros
| adjust the window to (in this example) 6;5, "tightening" around
| the non-zero bits. With your low-cost "loosen" doing this
| aggressively (instead of, say, if it happens for multiple
| samples in a row) seems sane. It also means that the fact that
| your loosen encoding is symmetric (which makes sense in terms
| of bit efficiency) is somewhat mitigated because the implicit
| tighten can break the symmetry.
| EdSchouten wrote:
| > > By default, the subsequence that is used is identical to
| that containing differing bits during the previous iteration
|
| > If I'm understanding correctly, this means that if the
| first choice a sized subsection is oversized, this will carry
| through to future samples (unless the encoder chooses to re-
| size it explicitly).
|
| And that's exactly what I do: I resize it. The subsequence
| that is picked during round n is the one that _actually_
| contained the differing bits in round n-1.
| Manabu-eo wrote:
| For those interested in the topic, I highly recommend the series
| of posts from Charles Bloom:
|
| http://cbloomrants.blogspot.com/2023/07/notes-on-float-and-m...
|
| > TLDR:
|
| > For S16 deltas, bias by 0x8080 , for S32 deltas, bias by
| 0x80808080.
|
| > De-interleaving multi-byte integers into homogenous streams can
| help, particularly with weaker back-end compressors.
|
| > TLDR:
|
| > For floats, just reinterpret F32 as U32.
|
| > If the floats have a mix of positive and negative, convert the
| sign bit to two's-complement signed S32.
|
| > Consider lossy elimination of negative zero -0.f
|
| > If you didn't actually want the huge precision for floats near
| zero, a simple lossy encoding is just to do float += constant,
| which works for non-negative floats where you don't know the high
| end of the range so you can't just use fixed point.
|
| > TLDR:
|
| > The best way to compress numeric data that is larger than bytes
| (F32,F16,S32,S16) is usually to delta them in their original size
| integer, then de-interleave after the delta.
|
| > Sometimes no filter or no deinterleave is best, particularly
| with stronger compressors, so being able to select filter on/off
| per-file can give big wins.
___________________________________________________________________
(page generated 2024-04-11 23:00 UTC)