hngopher.com

       [HN Gopher] The simple beauty of XOR floating point compression
       ___________________________________________________________________
        
       The simple beauty of XOR floating point compression
        
       Author : cwinter
       Score  : 269 points
       Date   : 2024-04-09 15:15 UTC (3 days ago)
        
 (HTM) web link (clemenswinter.com)
 (TXT) w3m dump (clemenswinter.com)
        
       | sameoldtune wrote:
       | This algorithm is pretty neat. I wonder if for the use case
       | outlined by the post they could get even better compression
       | through the use of fewer significant digits. For the purposes of
       | recording times, 0. 00768 is likely just as useful as
       | 0.0076792240142822266
        
         | Chabsff wrote:
         | That's already (one of) the main property that the algorithm is
         | exploiting: That f64 is an horribly overkill representation for
         | the data since it was generated using a microsecond clock,
         | leading to a lot of trailing zeros.
         | 
         | So yeah, dropping more significant digits should lead to even
         | better compression.
         | 
         | But if we are going to be massaging the data based on what we
         | know/need, there are better ways to go about compressing it.
         | e.g. I'd expect that simple fixed-point quantization with a
         | delta encoding would compress better than this. What I find
         | cool about this is that it dynamically figures things out
         | without having to deal with any priors.
        
         | cwinter wrote:
         | yes! the very last example in the post shows what happens if
         | you truncate the mantissa to just the 4 most significant bits
        
       | theamk wrote:
       | I wonder how it compares to traditional compression?
       | 
       | I've tried their "Integer that increments by 1 at every
       | timestep." case on 10000 integer. The algorithm described said
       | "6.98x compression"
       | 
       | bzip2 gave 80000/8642 = 9.2x compression
       | 
       | zstd gave 80000/11316 = 7.1x compression
        
         | Chabsff wrote:
         | I'm not sure that's a very fair comparison, as this algorithm
         | does point-wise streaming compression/decompression. I could
         | see myself using something like this in a FPGA context or
         | within a tight acquisition inner loop for example.
        
       | mgaunard wrote:
       | A normal network runs at 10Gbps or 1.25GiB/s.
       | 
       | 200MiB/s is too slow by a factor of 7. Though obviously some
       | people have optimised it.
        
         | helf wrote:
         | Uh no? Most not-corporate networks are gigabit to 2.5gbit at
         | best. Higher end home users may be doing 5gbit and 10gbit+.
         | 
         | Even top end wifi tends to only be around 1.2gbit.
        
           | vardump wrote:
           | Higher end home users are doing 25, 50 and 100 Gbps. Hardware
           | is actually rather affordable now.
           | 
           | Although there's just not enough PCIe bandwidth in current
           | consumer hardware for 100 Gbps...
        
             | saagarjha wrote:
             | Your high end is like the top 0.0001% of home users.
        
               | vardump wrote:
               | You can get to this under $1k nowadays. This Mikrotik
               | switch changed everything:
               | https://mikrotik.com/product/crs504_4xq_in
               | 
               | You can buy second hand 100g networking cards pretty
               | cheaply. The caveat is that they only support older PCIe
               | standards, so you need PCIe 16x to use them at full
               | speed.
               | 
               | Please also don't forget typical NVMe drives are pretty
               | fast now. Just _two_ of them can nearly saturate a 100g
               | link.
               | 
               | Personally I know a _lot_ of people in IT field who have
               | single mode fiber SFP28 networks. I 'd say 0.001% - 0.01%
               | is closer to reality.
               | 
               | Remember we were talking about high end.
        
               | saagarjha wrote:
               | Like 99% of the people I know who work in IT don't have
               | anything near what you're mentioning. They spend $100 on
               | _all_ their networking equipment, not just a switch.
               | They're not laying fiber in their house. And outside of
               | that group the number is basically 0.
        
               | vardump wrote:
               | I guess typical gateway drug is to get a 10g internet
               | connection or 10g NAS at home and find out copper based
               | cat6/7 is a lousy solution to actually make use of it.
               | 
               | Surprisingly high latencies (compared to fiber, that is)
               | and noisy fans due to a lot of heat generated.
               | 
               | Shrug. I guess your mileage may vary.
        
               | saagarjha wrote:
               | I say this because I am almost at this point (I run
               | copper through my house) but was scared off on going full
               | fiber. I know some who have, but it's definitely really,
               | really rare.
        
               | vardump wrote:
               | I had (well, still have) cat 7 in the house, while
               | venturing into fiber as well. 10GBASE-T switches are
               | noisy and surprisingly power hungry.
               | 
               | One big reason for fiber is a home lab. Transferring
               | large VM images etc. is so much faster. Ability to run
               | iSCSI _fast_ is also amazing.
               | 
               | Single mode fiber is pretty nice. It's fast, has low
               | power requirements and no need to worry about electrical
               | issues. Prices have declined significantly. The only
               | minus is that you can't have PoE, so the best option is
               | to wire both.
               | 
               | (Ok, there's actually Power over Fiber, but that's just
               | crazy stuff https://en.wikipedia.org/wiki/Power-over-
               | fiber. A huge fire risk.)
        
               | Dylan16807 wrote:
               | It's hard for me to picture doing something with an
               | internet/NAS connection and caring about the extra
               | percents of a millisecond copper gives you. And all the
               | 10g cards amazon is showing me are fanless. Even if you
               | need a fan, it should be enough at whisper-quiet speeds.
        
             | jdsully wrote:
             | How many homes have fibre runs everywhere. Most homes don't
             | even have cat-5
        
             | helf wrote:
             | Yeah but that is the extreme minority. Claiming it isnt is
             | absurd.
        
         | wizzwizz4 wrote:
         | According to https://www.speedtest.net/global-index, 200MiB/s
         | is over five times faster than Singapore's broadband speed -
         | and that's the country with the fastest broadband in the world.
         | 
         | 200Mbps puts you in the top 15 countries by broadband speed
         | (above Japan) and the top 4 by mobile speed (above South
         | Korea). Not that that's relevant to anything. (I haven't edited
         | out _any_ mistakes. The child comment is lying.)
        
           | zamadatix wrote:
           | You're conflating MiB/s with Mbps. Though I believe the GP
           | was actually talking about local server networks, not
           | internet speeds.
           | 
           | If you need to stream sequential floating point data out an
           | internet link something more space efficient (and probably
           | chunked instead) would probably be worthwhile over the
           | simplicity.
        
         | geon wrote:
         | I'd expect an actually optimized implementation to be at least
         | that much faster.
        
       | shin_lao wrote:
       | How does this compare to regular dictionary compression?
        
         | duskwuff wrote:
         | Dictionary compression isn't really applicable to floating-
         | point data, because it isn't made up of commonly repeated
         | "words" like text is.
        
           | GuB-42 wrote:
           | Dictionary compression, combined with a form of entropy
           | coding can work. As shown in the article the exponents often
           | repeat themselves, and the low order bits of the mantissa are
           | often zero.
           | 
           | Advanced context modeling compression algorithms can work
           | even better than the scheme presented in the article. These
           | are like mini-AIs, and they can recognize patterns more
           | complex than just a string of unchanged bits, and with the
           | appropriate entropy coder, give excellent compression. Or
           | course the processing cost is huge compare to a few bit
           | manipulations, and wouldn't make much sense unless you need
           | extreme compression.
        
             | gliptic wrote:
             | But context modelling isn't dictionary compression. Parent
             | is correct that dictionary compression itself (e.g. LZ4)
             | isn't very useful for floating point.
        
       | porphyra wrote:
       | In the first several examples, the input seems to have a mantissa
       | with all zeros beyond the first 15 bits. Why isn't the author
       | using float32 for this instead of float64?
        
         | Chabsff wrote:
         | Not having to make that kind of decision in the first place is
         | inherently valuable in of itself, which I believe is what's
         | being showcased here.
        
           | porphyra wrote:
           | I guess, but most compression algorithms like DEFLATE or zstd
           | would compress those big sequences of zeros basically to
           | nothing.
        
       | mzs wrote:
       | Here's discussion of an article from 2020 that explains this and
       | other time series compression approaches:
       | https://news.ycombinator.com/item?id=31379344
        
       | markrages wrote:
       | For integer values, just subtracting bytes is enough to allow
       | much better compression ratios. See the table at
       | https://opensource.quarq.com/zfit/zfit.pdf
        
         | hcs wrote:
         | Very handy to do with 7-Zip: -mf=Delta:4 (or just f=Delta:4 in
         | the GUI config window). This is for 4 byte little endian
         | integers, which works pretty well with 16-bit stereo PCM even
         | if there's some unwanted overflow between the channels, if you
         | can't use flac in some context (raw data, etc).
        
       | duskwuff wrote:
       | The biggest issue with this approach is that it's bit-aligned,
       | and the alignment is determined by fields within the data. It's
       | hard to make this run fast on most CPUs; a typical implementation
       | ends up having to maintain a bit buffer, and shifts bits in/out
       | of it a couple times for every codeword.
       | 
       | I bet you could improve the performance of this algorithm
       | significantly, without harming its compression too badly, by one
       | or both of:
       | 
       | A) Adjusting the sizes of some bit fields to make them evenly
       | divide the system word size, e.g. requiring the run length of the
       | 11-prefix to be a multiple of 4, so that it can be encoded in 4
       | bits instead of 6. (This is a net space savings of half a bit;
       | you have to include an average of 1.5 more bits of nonzero data,
       | but the length is two bits shorter.)
       | 
       | B) Writing the streams of bits for e.g. "mode", "number of
       | leading zeroes", and "bit subsequences" separately for each block
       | of compressed data, rather than combining them into a single
       | stream. As a bonus, each of those streams will now be more
       | amenable to compression by generic compressors.
        
       | jhj wrote:
       | People in the HPC/classical supercomputing space have done this
       | sort of thing for a while. There's a fair amount of literature on
       | lossless floating point compression, such as Martin Burtscher's
       | work or stuff out of LLNL (fpzip):
       | 
       | https://userweb.cs.txstate.edu/~burtscher/
       | https://computing.llnl.gov/projects/floating-point-compressi...
       | 
       | but it tends to be very application specific, where there tends
       | to be high correlation / small deltas between neighboring values
       | in a 2d/3d/4d/etc floating point array (e.g., you are compressing
       | neighboring temperature grid points in a PDE weather simulation
       | model; temperature differences in neighboring cells won't differ
       | by that much).
       | 
       | In a lot of other cases (e.g., machine learning) the floating
       | point significand bits (and sometimes the sign bit) tends to be
       | incompressible noise. The exponent is the only thing that is
       | really compressible, and the xor trick does not help you as much
       | because neighboring values could still vary a bit in terms of
       | exponents. An entropy encoder instead works well for that (encode
       | closer to the actual underlying data distribution/entropy), and
       | you also don't depend upon neighboring floats having similar
       | exponents as well.
       | 
       | In 2022, I created dietgpu, a library to losslessly
       | compress/decompress floating point data at up to 400 GB/s on an
       | A100. It uses a general-purpose asymmetric numeral system
       | encoder/decoder on GPU (the first such implementation of general
       | ANS on GPU, predating nvCOMP) for exponent compression.
       | 
       | We have used this to losslessly compress floating point data
       | between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in
       | training massive ML models to speed up overall wall clock time of
       | training across 100s/1000s of GPUs without changing anything
       | about how the training works (it's lossless compression, it
       | computes the same thing that it did before).
       | 
       | https://github.com/facebookresearch/dietgpu
        
         | dragontamer wrote:
         | I'm not exceptionally good with compression, but I did a few
         | experiments.
         | 
         | I've noticed that array-of-structs form is harder to compress
         | than struct-of-arrays form. This is relevant because a
         | floating-point number is really a struct containing 3 values:
         | 1-bit sign, 8-bit exponent, and 23-bit mantissa.
         | 
         | --------
         | 
         | If you have 1000 floats, I would expect 1000-sign bits in order
         | to be more compressible (whatever compression algorithm you
         | use, it would better determine the pattern of signs).
         | 
         | Then 8000-bits of exponent would also be more compressible,
         | because all the exponents would likely follow its own pattern.
         | 
         | Finally, the 23-bits of mantissa would probably be difficult to
         | impossible for compression. But "extracting it out" before
         | running a compression algorithm will give the sign + exponents
         | more opportunity for comrpession.
         | 
         | --------
         | 
         | So yeah, just think about your data. And favor structure-of-
         | arrays form for better compression.
        
           | dahart wrote:
           | Totally, SoAs are often easier to compress. I hadn't thought
           | about floats as being AoSs, and then unpacking the sign,
           | exponent and mantissas of a batch of float into separate
           | arrays, that's an interesting way to look at it. You could
           | probably take that 1 step further with floats or ints, and
           | treat an array of numbers as an array of structures of 32 or
           | 64 fields where every field is 1 bit. So to compress a batch
           | of 1000 32-bit numbers, take all bit 31s in a row, then all
           | bit 30s, etc.. Or another way to look at it is to stack the
           | numbers in a column, transpose it, and read out the bits.
           | 
           | It's worth noting that the algorithm in the article, and it's
           | cousin, delta encoding, both do already take advantage of any
           | cases where the sign & exponent bits don't change with every
           | sample, and/or where the mantissa ULPs are constant for a
           | while at a time. It'd be interesting to compare the explicit
           | SoA of floats compression idea to XOR or delta compression,
           | but I think it's also interesting in a meta kinda of way that
           | these two seemingly different ideas have similar outcomes.
        
         | tines wrote:
         | Wow, cool to see Dr. Burtscher's name here! I was a student in
         | his undergraduate parallel computing course and was invited to
         | work in his lab. Very smart guy.
        
         | dekhn wrote:
         | I've talked to several world class computer
         | scientists/bioinformaticians who realized, on their own, that
         | decompression is the fastest way to fill the cache with data
         | (because the amount of data streaming in is smaller, and CPUs
         | are VERY fast for decompression).
         | 
         | One of the key innovations in the AMBER MD engine that made it
         | work OK on cheaper systems was lossless floating point
         | compression. It still impresses me that you can compress
         | floats, send them over MPI, and decompress them, all
         | faster/lower latency than the transport can send the
         | uncompressed data.
        
           | jhj wrote:
           | Not just MPI over a network. We can compress floats, send
           | them over NVLink or PCIe to another GPU in the same host, and
           | decompress and it can be faster than sending data raw between
           | GPUs, that's the premise behind dietgpu even (it's cheap
           | compression, not a great compression ratio, like 0.6-0.9x of
           | original size, but it's extremely fast, 100s of GB/s
           | throughput, with the idea that you're trying to race
           | something that is similarly as fast. General floating point
           | data could be quite incompressible or highly compressible, it
           | really just depends upon what is being passed around).
           | 
           | The interconnects are improving at a slower rate in general
           | than compute on the CPU/GPU is and it can be exploited.
        
         | throwiforgtnlzy wrote:
         | _Waves across I-35 to Reality Labs._
         | 
         | Sounds like something Numba should include perhaps.
         | 
         | I retired from HPC around the era of iPython, Open MPI,
         | Infiniband, OFED, GPFS/Panasas/Lustre, Rocks Cluster, and
         | Condor/PBS/SGE.
        
         | mandarax8 wrote:
         | I'm drawing my entire VRAM worth of Morton sorted points, do
         | you think this compression technique lends itself well to
         | streaming decompression? Ie decompress one warp of points at a
         | time right before drawing? Or do I need larger batches to make
         | it viable?
        
         | gcr wrote:
         | One of my fun projects was compressing all community levels for
         | SRB2Kart (based on the Doom II engine) into a single archive
         | suitable for WebGL rendering of map previews.
         | 
         | I was able to get all 220MB of the default SRB2kart maps down
         | into 12.9MB, and could even get down to 5MB with heavy texture
         | compression. LZMA alone could only do 37MB, so I was pretty
         | pumped.
         | 
         | One of the harder parts is compressing the vertex data and
         | adjacency information (which linedefs connect which vertices).
         | I never figured out a satisfying way of doing that, but this
         | makes me want to try again...
        
       | EdSchouten wrote:
       | As the article states, you need to apply some heuristics to
       | choose between cases (2) and (3). I'm not a big fan of that,
       | because it means the encoding is not deterministic (as in, it
       | leaves room for interpretation/optimization by encoders).
       | 
       | Just for fun I implemented this algorithm:
       | https://go.dev/play/p/72GyYmVnwyc
       | 
       | The algorithm described in the article uses prefixes 0, 10, and
       | 11. My algorithm uses different prefixes: 0, 10, 110, 1110,
       | 11110, [...]. By default, the subsequence that is used is
       | identical to that containing differing bits during the previous
       | iteration. The prefixes describe previously can be used to widen
       | the subsequence:
       | 
       | - 0: Keep the subsequence the same width.
       | 
       | - 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the
       | left and right side.
       | 
       | - 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the
       | left and right side.
       | 
       | - 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the
       | left and right side.
       | 
       | - ...
       | 
       | For the sequence of numbers presented under "Tweaking the
       | algorithm", the algorithm in the article produces 969 bits of
       | data. My algorithm produces just 823.
       | 
       | Edit: Tiny bug in original code.
       | 
       | Edit 2: maybe it's smarter to widen the subsequence using
       | triangular numbers. Doing it exponentially is a bit too
       | aggressive, it seems.
        
         | addaon wrote:
         | > By default, the subsequence that is used is identical to that
         | containing differing bits during the previous iteration
         | 
         | If I'm understanding correctly, this means that if the first
         | choice a sized subsection is oversized, this will carry through
         | to future samples (unless the encoder chooses to re-size it
         | explicitly). I wonder if, with your cheaper "loosen" encoding,
         | it is worth automatically tightening the bit delta. For
         | example, if I go to an initial window of "5 zeros, the 7 bits
         | 1001101, remainder zeros" (which I'll call 5;7), and the next
         | encoding is "same, 0101110", the leading and trailing zeros
         | adjust the window to (in this example) 6;5, "tightening" around
         | the non-zero bits. With your low-cost "loosen" doing this
         | aggressively (instead of, say, if it happens for multiple
         | samples in a row) seems sane. It also means that the fact that
         | your loosen encoding is symmetric (which makes sense in terms
         | of bit efficiency) is somewhat mitigated because the implicit
         | tighten can break the symmetry.
        
           | EdSchouten wrote:
           | > > By default, the subsequence that is used is identical to
           | that containing differing bits during the previous iteration
           | 
           | > If I'm understanding correctly, this means that if the
           | first choice a sized subsection is oversized, this will carry
           | through to future samples (unless the encoder chooses to re-
           | size it explicitly).
           | 
           | And that's exactly what I do: I resize it. The subsequence
           | that is picked during round n is the one that _actually_
           | contained the differing bits in round n-1.
        
         | duskwuff wrote:
         | What you're working towards here is essentially Golomb coding.
         | The optimal divisor will depend on how the data is distributed.
         | 
         | https://en.wikipedia.org/wiki/Golomb_coding
        
         | teo_zero wrote:
         | > The prefixes describe previously can be used to widen the
         | subsequence
         | 
         | So you can only widen the subsequence, never shrink it.
         | Shrinking would address the issue raised by TFA:
         | 
         | > If a series contains an outlier value that requires a very
         | large window to encode, and all subsequent values have nonzero
         | values only in a much smaller subwindow, the inefficient larger
         | window will get locked in because we never hit the condition
         | that would trigger a reset of the window size.
        
           | EdSchouten wrote:
           | Shrinking happens automatically, because every iteration
           | takes the actually meaningful width of the previous
           | iteration.
        
         | EdSchouten wrote:
         | As I'm not allowed to edit the post above: using a unary number
         | system instead of exponential/triangular makes the
         | implementation even simpler, and cuts down the output of the
         | example output even further to 798 bits.
        
       | Manabu-eo wrote:
       | For those interested in the topic, I highly recommend the series
       | of posts from Charles Bloom:
       | 
       | http://cbloomrants.blogspot.com/2023/07/notes-on-float-and-m...
       | 
       | > TLDR:
       | 
       | > For S16 deltas, bias by 0x8080 , for S32 deltas, bias by
       | 0x80808080.
       | 
       | > De-interleaving multi-byte integers into homogenous streams can
       | help, particularly with weaker back-end compressors.
       | 
       | > TLDR:
       | 
       | > For floats, just reinterpret F32 as U32.
       | 
       | > If the floats have a mix of positive and negative, convert the
       | sign bit to two's-complement signed S32.
       | 
       | > Consider lossy elimination of negative zero -0.f
       | 
       | > If you didn't actually want the huge precision for floats near
       | zero, a simple lossy encoding is just to do float += constant,
       | which works for non-negative floats where you don't know the high
       | end of the range so you can't just use fixed point.
       | 
       | > TLDR:
       | 
       | > The best way to compress numeric data that is larger than bytes
       | (F32,F16,S32,S16) is usually to delta them in their original size
       | integer, then de-interleave after the delta.
       | 
       | > Sometimes no filter or no deinterleave is best, particularly
       | with stronger compressors, so being able to select filter on/off
       | per-file can give big wins.
        
       | gafferongames wrote:
       | Excellent stuff
        
       | StillBored wrote:
       | Its largely a variation of delta encoding
       | (https://en.wikipedia.org/wiki/Delta_encoding) which is using xor
       | to determine the variation between values. Both are exploiting
       | the fact that in certain kinds of series/samples the range
       | (relative entropy) between values is far less than the range of
       | the underlying value representation.
       | 
       | Whether or not the resulting ranges are encoded in a space saving
       | manner (as this algorithm is doing) or used as input into a BW
       | transform/Huffman/etc sorta depends on the resulting value
       | sequences.
       | 
       | Much of this though should apply the understanding that data
       | being compressed can frequently lead to data specific
       | preprocessing that can significantly increase the compression
       | ratios over generic encoders. Just taking the output of this
       | routine and running it through deflate might yield another couple
       | percent if it were tuned. Or put another way, all the bit
       | prefix/shifting decisions might be redundant for certain cases
       | because the 0's would end up being single bit tokens in a Huffman
       | encoder.
        
         | selcuka wrote:
         | > Its largely a variation of delta encoding
         | 
         | Amiga famously used a version of hardware backed delta encoding
         | [1] to display more colours on the screen than normally
         | possible within its memory limits.
         | 
         | [1] https://en.wikipedia.org/wiki/Hold-And-Modify
        
           | bcrl wrote:
           | HAM would have worked even better if it had been in the
           | intended hue-saturation-value (HSV) colour space as Jay Miner
           | originally intended. Sadly, HSV support was dropped, but due
           | to schedule constraints, there wasn't enough time to rework
           | the design and remove HAM. I found it fascinating to learn
           | these things long after I was no longer using my Amiga for
           | day to day work.
        
       | GuB-42 wrote:
       | What if the floating point number is, for example, a number of
       | seconds coming from a millisecond time source? Binary floating
       | point numbers notoriously can't represent decimals precisely.
       | 
       | You would obviously be better off timing in the proper unit so
       | that you have an integer number of milliseconds but that's the
       | kind of situation where performance starts to drop, no one
       | understands why until some dev starts digging deeply into the
       | encoding and maybe even write a blog post about how he got a
       | massive speedup by multiplying numbers by 1000.
        
       | bcrl wrote:
       | I did something similar in a messaging application... We had a
       | data structure consisting of several tables describing various
       | message queues in 4GB of persistent storage. The initialization
       | of those tables was quite regular, as they were just pointers
       | into blocks in the persistent memory. When the application was
       | ported to run on AWS instances, people decided to try to run the
       | application on instances with the crappiest 20MB/s storage. I had
       | specified during implementation that the journalling layer would
       | not work well on crappy storage, as the actual persistent memory
       | the system was originally designed for had gobs of bandwidth (it
       | was supercap backed DRAM on a PCIe card built using an FPGA and a
       | pair of 10Gbps SFP+ ports to mirror to another host -- you had a
       | couple of GB/s of write throughput). Customers ignored the
       | requirement, and opened a bug saying the system couldn't
       | reinitialize the table within the 30 second limit for commands on
       | the CLI. To fix the "bug", I ended up delta encoding the
       | persistent memory tables to make the data more regular, ran it
       | through libz at the lowest compression level, then wrote the
       | compressed data out to disk. 4GB of data ended up taking less
       | than 100MB in the journal. gzip compression on its own was not
       | able to get under the 600MB requirement. It was a total hack, but
       | it worked and took a few hours to implement. Domain specific
       | knowledge really helps improve compression ratios!
        
         | jart wrote:
         | If you zig zag encode the deltas you might be able to get it
         | down to 10mb.
         | 
         | dzd is the best. https://justine.lol/sizetricks/#dzd
        
       | throwiforgtnlzy wrote:
       | IIRC, there was a WAV-era audio format that used PCM delta
       | compression by recording differences rather than absolute values.
        
         | adzm wrote:
         | Differential PCM did this basically versus the predicted value.
         | Delta modulation is the other scheme.
        
           | throwiforgtnlzy wrote:
           | That's probably it. Thanks for easing my perpetual
           | forgetfulness!
        
       | chrchang523 wrote:
       | One question: it is possible for the XOR of two consecutive
       | floating-point numbers to have 32-63 leading zeros; the numbers
       | 32-63 do not fit in 5 bits. I imagine this is treated by Gorilla
       | like 31 leading zeros?
        
       | infogulch wrote:
       | One of the things we've learned from columnar data stores and
       | compression formats is that choosing the correct stride order
       | _really_ matters.
       | 
       | I noticed this a decade ago when I stumbled on a voxel engine
       | that used some complex height map compression scheme for game map
       | data. It seemed gratuitous to me, so I tried the simplest thing I
       | could come up with: gzipping a 3D array of voxel colors. It
       | didn't really work until I chose the right stride order (xyz,
       | zyx, yzx, etc) and then suddenly it was 5x better.
       | 
       | The parquet format has the BYTE_STREAM_SPLIT encoding strategy
       | which basically splits an array of floats/doubles into 4/8 arrays
       | of bytes so they can be compressed independently.
       | https://parquet.apache.org/docs/file-format/data-pages/encod...
       | 
       | Just choosing the correct stride order before compression is
       | still underutilized. It's also undersupported in our tools too;
       | e.g. that voxel engine used numpy, but it was a pita to get numpy
       | to transparently change the stride order in the memory
       | representation without also affecting the indexing order.
        
         | hugh-avherald wrote:
         | What is stride order?
        
           | pxeger1 wrote:
           | It means the order in which you serialise a multi-dimensional
           | array. So for this 2D table:                   1262
           | 3418         9076
           | 
           | Because it's 2D it has two stride orders, xy and yx, i.e.
           | row-major and column-major, which would be 126234189076 and
           | 139240617286 respectively.
        
         | enizor2 wrote:
         | The BYTE_STREAM_SPLIT is a Z-order (also known as Morton
         | encoding). As it is better a preserving locality, it usually
         | performs really well compared to classical "orthogonal" orders.
         | It also is really simple to implement as it boils down to
         | interleaving bits or bytes.
        
       | einpoklum wrote:
       | XOR'ing consecutive elements is the mark of people used to
       | sequential, rather than parallel, processing.
       | 
       | Most of the approaches described in that paper should work well
       | enough with a base + offset scheme, where the base is RLE-
       | compressed or just set once every N elements.
        
       | remram wrote:
       | > We then immediately write out the subsequence of the XORed
       | value that has the same starting index and length as the last
       | subsequence we wrote.
       | 
       | I don't really understand this. Is there an example?
        
       | queuebert wrote:
       | This appears to be working because the data are highly
       | correlated, i.e. similar magnitudes across the vector. I wonder
       | how well it would do with completely random input, not just
       | uniform floats in the [0, 1] interval, but floats across all
       | exponent combinations.
        
       ___________________________________________________________________
       (page generated 2024-04-12 23:02 UTC)