[HN Gopher] The simple beauty of XOR floating point compression
___________________________________________________________________
The simple beauty of XOR floating point compression
Author : cwinter
Score : 269 points
Date : 2024-04-09 15:15 UTC (3 days ago)
(HTM) web link (clemenswinter.com)
(TXT) w3m dump (clemenswinter.com)
| sameoldtune wrote:
| This algorithm is pretty neat. I wonder if for the use case
| outlined by the post they could get even better compression
| through the use of fewer significant digits. For the purposes of
| recording times, 0. 00768 is likely just as useful as
| 0.0076792240142822266
| Chabsff wrote:
| That's already (one of) the main property that the algorithm is
| exploiting: That f64 is an horribly overkill representation for
| the data since it was generated using a microsecond clock,
| leading to a lot of trailing zeros.
|
| So yeah, dropping more significant digits should lead to even
| better compression.
|
| But if we are going to be massaging the data based on what we
| know/need, there are better ways to go about compressing it.
| e.g. I'd expect that simple fixed-point quantization with a
| delta encoding would compress better than this. What I find
| cool about this is that it dynamically figures things out
| without having to deal with any priors.
| cwinter wrote:
| yes! the very last example in the post shows what happens if
| you truncate the mantissa to just the 4 most significant bits
| theamk wrote:
| I wonder how it compares to traditional compression?
|
| I've tried their "Integer that increments by 1 at every
| timestep." case on 10000 integer. The algorithm described said
| "6.98x compression"
|
| bzip2 gave 80000/8642 = 9.2x compression
|
| zstd gave 80000/11316 = 7.1x compression
| Chabsff wrote:
| I'm not sure that's a very fair comparison, as this algorithm
| does point-wise streaming compression/decompression. I could
| see myself using something like this in a FPGA context or
| within a tight acquisition inner loop for example.
| mgaunard wrote:
| A normal network runs at 10Gbps or 1.25GiB/s.
|
| 200MiB/s is too slow by a factor of 7. Though obviously some
| people have optimised it.
| helf wrote:
| Uh no? Most not-corporate networks are gigabit to 2.5gbit at
| best. Higher end home users may be doing 5gbit and 10gbit+.
|
| Even top end wifi tends to only be around 1.2gbit.
| vardump wrote:
| Higher end home users are doing 25, 50 and 100 Gbps. Hardware
| is actually rather affordable now.
|
| Although there's just not enough PCIe bandwidth in current
| consumer hardware for 100 Gbps...
| saagarjha wrote:
| Your high end is like the top 0.0001% of home users.
| vardump wrote:
| You can get to this under $1k nowadays. This Mikrotik
| switch changed everything:
| https://mikrotik.com/product/crs504_4xq_in
|
| You can buy second hand 100g networking cards pretty
| cheaply. The caveat is that they only support older PCIe
| standards, so you need PCIe 16x to use them at full
| speed.
|
| Please also don't forget typical NVMe drives are pretty
| fast now. Just _two_ of them can nearly saturate a 100g
| link.
|
| Personally I know a _lot_ of people in IT field who have
| single mode fiber SFP28 networks. I 'd say 0.001% - 0.01%
| is closer to reality.
|
| Remember we were talking about high end.
| saagarjha wrote:
| Like 99% of the people I know who work in IT don't have
| anything near what you're mentioning. They spend $100 on
| _all_ their networking equipment, not just a switch.
| They're not laying fiber in their house. And outside of
| that group the number is basically 0.
| vardump wrote:
| I guess typical gateway drug is to get a 10g internet
| connection or 10g NAS at home and find out copper based
| cat6/7 is a lousy solution to actually make use of it.
|
| Surprisingly high latencies (compared to fiber, that is)
| and noisy fans due to a lot of heat generated.
|
| Shrug. I guess your mileage may vary.
| saagarjha wrote:
| I say this because I am almost at this point (I run
| copper through my house) but was scared off on going full
| fiber. I know some who have, but it's definitely really,
| really rare.
| vardump wrote:
| I had (well, still have) cat 7 in the house, while
| venturing into fiber as well. 10GBASE-T switches are
| noisy and surprisingly power hungry.
|
| One big reason for fiber is a home lab. Transferring
| large VM images etc. is so much faster. Ability to run
| iSCSI _fast_ is also amazing.
|
| Single mode fiber is pretty nice. It's fast, has low
| power requirements and no need to worry about electrical
| issues. Prices have declined significantly. The only
| minus is that you can't have PoE, so the best option is
| to wire both.
|
| (Ok, there's actually Power over Fiber, but that's just
| crazy stuff https://en.wikipedia.org/wiki/Power-over-
| fiber. A huge fire risk.)
| Dylan16807 wrote:
| It's hard for me to picture doing something with an
| internet/NAS connection and caring about the extra
| percents of a millisecond copper gives you. And all the
| 10g cards amazon is showing me are fanless. Even if you
| need a fan, it should be enough at whisper-quiet speeds.
| jdsully wrote:
| How many homes have fibre runs everywhere. Most homes don't
| even have cat-5
| helf wrote:
| Yeah but that is the extreme minority. Claiming it isnt is
| absurd.
| wizzwizz4 wrote:
| According to https://www.speedtest.net/global-index, 200MiB/s
| is over five times faster than Singapore's broadband speed -
| and that's the country with the fastest broadband in the world.
|
| 200Mbps puts you in the top 15 countries by broadband speed
| (above Japan) and the top 4 by mobile speed (above South
| Korea). Not that that's relevant to anything. (I haven't edited
| out _any_ mistakes. The child comment is lying.)
| zamadatix wrote:
| You're conflating MiB/s with Mbps. Though I believe the GP
| was actually talking about local server networks, not
| internet speeds.
|
| If you need to stream sequential floating point data out an
| internet link something more space efficient (and probably
| chunked instead) would probably be worthwhile over the
| simplicity.
| geon wrote:
| I'd expect an actually optimized implementation to be at least
| that much faster.
| shin_lao wrote:
| How does this compare to regular dictionary compression?
| duskwuff wrote:
| Dictionary compression isn't really applicable to floating-
| point data, because it isn't made up of commonly repeated
| "words" like text is.
| GuB-42 wrote:
| Dictionary compression, combined with a form of entropy
| coding can work. As shown in the article the exponents often
| repeat themselves, and the low order bits of the mantissa are
| often zero.
|
| Advanced context modeling compression algorithms can work
| even better than the scheme presented in the article. These
| are like mini-AIs, and they can recognize patterns more
| complex than just a string of unchanged bits, and with the
| appropriate entropy coder, give excellent compression. Or
| course the processing cost is huge compare to a few bit
| manipulations, and wouldn't make much sense unless you need
| extreme compression.
| gliptic wrote:
| But context modelling isn't dictionary compression. Parent
| is correct that dictionary compression itself (e.g. LZ4)
| isn't very useful for floating point.
| porphyra wrote:
| In the first several examples, the input seems to have a mantissa
| with all zeros beyond the first 15 bits. Why isn't the author
| using float32 for this instead of float64?
| Chabsff wrote:
| Not having to make that kind of decision in the first place is
| inherently valuable in of itself, which I believe is what's
| being showcased here.
| porphyra wrote:
| I guess, but most compression algorithms like DEFLATE or zstd
| would compress those big sequences of zeros basically to
| nothing.
| mzs wrote:
| Here's discussion of an article from 2020 that explains this and
| other time series compression approaches:
| https://news.ycombinator.com/item?id=31379344
| markrages wrote:
| For integer values, just subtracting bytes is enough to allow
| much better compression ratios. See the table at
| https://opensource.quarq.com/zfit/zfit.pdf
| hcs wrote:
| Very handy to do with 7-Zip: -mf=Delta:4 (or just f=Delta:4 in
| the GUI config window). This is for 4 byte little endian
| integers, which works pretty well with 16-bit stereo PCM even
| if there's some unwanted overflow between the channels, if you
| can't use flac in some context (raw data, etc).
| duskwuff wrote:
| The biggest issue with this approach is that it's bit-aligned,
| and the alignment is determined by fields within the data. It's
| hard to make this run fast on most CPUs; a typical implementation
| ends up having to maintain a bit buffer, and shifts bits in/out
| of it a couple times for every codeword.
|
| I bet you could improve the performance of this algorithm
| significantly, without harming its compression too badly, by one
| or both of:
|
| A) Adjusting the sizes of some bit fields to make them evenly
| divide the system word size, e.g. requiring the run length of the
| 11-prefix to be a multiple of 4, so that it can be encoded in 4
| bits instead of 6. (This is a net space savings of half a bit;
| you have to include an average of 1.5 more bits of nonzero data,
| but the length is two bits shorter.)
|
| B) Writing the streams of bits for e.g. "mode", "number of
| leading zeroes", and "bit subsequences" separately for each block
| of compressed data, rather than combining them into a single
| stream. As a bonus, each of those streams will now be more
| amenable to compression by generic compressors.
| jhj wrote:
| People in the HPC/classical supercomputing space have done this
| sort of thing for a while. There's a fair amount of literature on
| lossless floating point compression, such as Martin Burtscher's
| work or stuff out of LLNL (fpzip):
|
| https://userweb.cs.txstate.edu/~burtscher/
| https://computing.llnl.gov/projects/floating-point-compressi...
|
| but it tends to be very application specific, where there tends
| to be high correlation / small deltas between neighboring values
| in a 2d/3d/4d/etc floating point array (e.g., you are compressing
| neighboring temperature grid points in a PDE weather simulation
| model; temperature differences in neighboring cells won't differ
| by that much).
|
| In a lot of other cases (e.g., machine learning) the floating
| point significand bits (and sometimes the sign bit) tends to be
| incompressible noise. The exponent is the only thing that is
| really compressible, and the xor trick does not help you as much
| because neighboring values could still vary a bit in terms of
| exponents. An entropy encoder instead works well for that (encode
| closer to the actual underlying data distribution/entropy), and
| you also don't depend upon neighboring floats having similar
| exponents as well.
|
| In 2022, I created dietgpu, a library to losslessly
| compress/decompress floating point data at up to 400 GB/s on an
| A100. It uses a general-purpose asymmetric numeral system
| encoder/decoder on GPU (the first such implementation of general
| ANS on GPU, predating nvCOMP) for exponent compression.
|
| We have used this to losslessly compress floating point data
| between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in
| training massive ML models to speed up overall wall clock time of
| training across 100s/1000s of GPUs without changing anything
| about how the training works (it's lossless compression, it
| computes the same thing that it did before).
|
| https://github.com/facebookresearch/dietgpu
| dragontamer wrote:
| I'm not exceptionally good with compression, but I did a few
| experiments.
|
| I've noticed that array-of-structs form is harder to compress
| than struct-of-arrays form. This is relevant because a
| floating-point number is really a struct containing 3 values:
| 1-bit sign, 8-bit exponent, and 23-bit mantissa.
|
| --------
|
| If you have 1000 floats, I would expect 1000-sign bits in order
| to be more compressible (whatever compression algorithm you
| use, it would better determine the pattern of signs).
|
| Then 8000-bits of exponent would also be more compressible,
| because all the exponents would likely follow its own pattern.
|
| Finally, the 23-bits of mantissa would probably be difficult to
| impossible for compression. But "extracting it out" before
| running a compression algorithm will give the sign + exponents
| more opportunity for comrpession.
|
| --------
|
| So yeah, just think about your data. And favor structure-of-
| arrays form for better compression.
| dahart wrote:
| Totally, SoAs are often easier to compress. I hadn't thought
| about floats as being AoSs, and then unpacking the sign,
| exponent and mantissas of a batch of float into separate
| arrays, that's an interesting way to look at it. You could
| probably take that 1 step further with floats or ints, and
| treat an array of numbers as an array of structures of 32 or
| 64 fields where every field is 1 bit. So to compress a batch
| of 1000 32-bit numbers, take all bit 31s in a row, then all
| bit 30s, etc.. Or another way to look at it is to stack the
| numbers in a column, transpose it, and read out the bits.
|
| It's worth noting that the algorithm in the article, and it's
| cousin, delta encoding, both do already take advantage of any
| cases where the sign & exponent bits don't change with every
| sample, and/or where the mantissa ULPs are constant for a
| while at a time. It'd be interesting to compare the explicit
| SoA of floats compression idea to XOR or delta compression,
| but I think it's also interesting in a meta kinda of way that
| these two seemingly different ideas have similar outcomes.
| tines wrote:
| Wow, cool to see Dr. Burtscher's name here! I was a student in
| his undergraduate parallel computing course and was invited to
| work in his lab. Very smart guy.
| dekhn wrote:
| I've talked to several world class computer
| scientists/bioinformaticians who realized, on their own, that
| decompression is the fastest way to fill the cache with data
| (because the amount of data streaming in is smaller, and CPUs
| are VERY fast for decompression).
|
| One of the key innovations in the AMBER MD engine that made it
| work OK on cheaper systems was lossless floating point
| compression. It still impresses me that you can compress
| floats, send them over MPI, and decompress them, all
| faster/lower latency than the transport can send the
| uncompressed data.
| jhj wrote:
| Not just MPI over a network. We can compress floats, send
| them over NVLink or PCIe to another GPU in the same host, and
| decompress and it can be faster than sending data raw between
| GPUs, that's the premise behind dietgpu even (it's cheap
| compression, not a great compression ratio, like 0.6-0.9x of
| original size, but it's extremely fast, 100s of GB/s
| throughput, with the idea that you're trying to race
| something that is similarly as fast. General floating point
| data could be quite incompressible or highly compressible, it
| really just depends upon what is being passed around).
|
| The interconnects are improving at a slower rate in general
| than compute on the CPU/GPU is and it can be exploited.
| throwiforgtnlzy wrote:
| _Waves across I-35 to Reality Labs._
|
| Sounds like something Numba should include perhaps.
|
| I retired from HPC around the era of iPython, Open MPI,
| Infiniband, OFED, GPFS/Panasas/Lustre, Rocks Cluster, and
| Condor/PBS/SGE.
| mandarax8 wrote:
| I'm drawing my entire VRAM worth of Morton sorted points, do
| you think this compression technique lends itself well to
| streaming decompression? Ie decompress one warp of points at a
| time right before drawing? Or do I need larger batches to make
| it viable?
| gcr wrote:
| One of my fun projects was compressing all community levels for
| SRB2Kart (based on the Doom II engine) into a single archive
| suitable for WebGL rendering of map previews.
|
| I was able to get all 220MB of the default SRB2kart maps down
| into 12.9MB, and could even get down to 5MB with heavy texture
| compression. LZMA alone could only do 37MB, so I was pretty
| pumped.
|
| One of the harder parts is compressing the vertex data and
| adjacency information (which linedefs connect which vertices).
| I never figured out a satisfying way of doing that, but this
| makes me want to try again...
| EdSchouten wrote:
| As the article states, you need to apply some heuristics to
| choose between cases (2) and (3). I'm not a big fan of that,
| because it means the encoding is not deterministic (as in, it
| leaves room for interpretation/optimization by encoders).
|
| Just for fun I implemented this algorithm:
| https://go.dev/play/p/72GyYmVnwyc
|
| The algorithm described in the article uses prefixes 0, 10, and
| 11. My algorithm uses different prefixes: 0, 10, 110, 1110,
| 11110, [...]. By default, the subsequence that is used is
| identical to that containing differing bits during the previous
| iteration. The prefixes describe previously can be used to widen
| the subsequence:
|
| - 0: Keep the subsequence the same width.
|
| - 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the
| left and right side.
|
| - 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the
| left and right side.
|
| - 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the
| left and right side.
|
| - ...
|
| For the sequence of numbers presented under "Tweaking the
| algorithm", the algorithm in the article produces 969 bits of
| data. My algorithm produces just 823.
|
| Edit: Tiny bug in original code.
|
| Edit 2: maybe it's smarter to widen the subsequence using
| triangular numbers. Doing it exponentially is a bit too
| aggressive, it seems.
| addaon wrote:
| > By default, the subsequence that is used is identical to that
| containing differing bits during the previous iteration
|
| If I'm understanding correctly, this means that if the first
| choice a sized subsection is oversized, this will carry through
| to future samples (unless the encoder chooses to re-size it
| explicitly). I wonder if, with your cheaper "loosen" encoding,
| it is worth automatically tightening the bit delta. For
| example, if I go to an initial window of "5 zeros, the 7 bits
| 1001101, remainder zeros" (which I'll call 5;7), and the next
| encoding is "same, 0101110", the leading and trailing zeros
| adjust the window to (in this example) 6;5, "tightening" around
| the non-zero bits. With your low-cost "loosen" doing this
| aggressively (instead of, say, if it happens for multiple
| samples in a row) seems sane. It also means that the fact that
| your loosen encoding is symmetric (which makes sense in terms
| of bit efficiency) is somewhat mitigated because the implicit
| tighten can break the symmetry.
| EdSchouten wrote:
| > > By default, the subsequence that is used is identical to
| that containing differing bits during the previous iteration
|
| > If I'm understanding correctly, this means that if the
| first choice a sized subsection is oversized, this will carry
| through to future samples (unless the encoder chooses to re-
| size it explicitly).
|
| And that's exactly what I do: I resize it. The subsequence
| that is picked during round n is the one that _actually_
| contained the differing bits in round n-1.
| duskwuff wrote:
| What you're working towards here is essentially Golomb coding.
| The optimal divisor will depend on how the data is distributed.
|
| https://en.wikipedia.org/wiki/Golomb_coding
| teo_zero wrote:
| > The prefixes describe previously can be used to widen the
| subsequence
|
| So you can only widen the subsequence, never shrink it.
| Shrinking would address the issue raised by TFA:
|
| > If a series contains an outlier value that requires a very
| large window to encode, and all subsequent values have nonzero
| values only in a much smaller subwindow, the inefficient larger
| window will get locked in because we never hit the condition
| that would trigger a reset of the window size.
| EdSchouten wrote:
| Shrinking happens automatically, because every iteration
| takes the actually meaningful width of the previous
| iteration.
| EdSchouten wrote:
| As I'm not allowed to edit the post above: using a unary number
| system instead of exponential/triangular makes the
| implementation even simpler, and cuts down the output of the
| example output even further to 798 bits.
| Manabu-eo wrote:
| For those interested in the topic, I highly recommend the series
| of posts from Charles Bloom:
|
| http://cbloomrants.blogspot.com/2023/07/notes-on-float-and-m...
|
| > TLDR:
|
| > For S16 deltas, bias by 0x8080 , for S32 deltas, bias by
| 0x80808080.
|
| > De-interleaving multi-byte integers into homogenous streams can
| help, particularly with weaker back-end compressors.
|
| > TLDR:
|
| > For floats, just reinterpret F32 as U32.
|
| > If the floats have a mix of positive and negative, convert the
| sign bit to two's-complement signed S32.
|
| > Consider lossy elimination of negative zero -0.f
|
| > If you didn't actually want the huge precision for floats near
| zero, a simple lossy encoding is just to do float += constant,
| which works for non-negative floats where you don't know the high
| end of the range so you can't just use fixed point.
|
| > TLDR:
|
| > The best way to compress numeric data that is larger than bytes
| (F32,F16,S32,S16) is usually to delta them in their original size
| integer, then de-interleave after the delta.
|
| > Sometimes no filter or no deinterleave is best, particularly
| with stronger compressors, so being able to select filter on/off
| per-file can give big wins.
| gafferongames wrote:
| Excellent stuff
| StillBored wrote:
| Its largely a variation of delta encoding
| (https://en.wikipedia.org/wiki/Delta_encoding) which is using xor
| to determine the variation between values. Both are exploiting
| the fact that in certain kinds of series/samples the range
| (relative entropy) between values is far less than the range of
| the underlying value representation.
|
| Whether or not the resulting ranges are encoded in a space saving
| manner (as this algorithm is doing) or used as input into a BW
| transform/Huffman/etc sorta depends on the resulting value
| sequences.
|
| Much of this though should apply the understanding that data
| being compressed can frequently lead to data specific
| preprocessing that can significantly increase the compression
| ratios over generic encoders. Just taking the output of this
| routine and running it through deflate might yield another couple
| percent if it were tuned. Or put another way, all the bit
| prefix/shifting decisions might be redundant for certain cases
| because the 0's would end up being single bit tokens in a Huffman
| encoder.
| selcuka wrote:
| > Its largely a variation of delta encoding
|
| Amiga famously used a version of hardware backed delta encoding
| [1] to display more colours on the screen than normally
| possible within its memory limits.
|
| [1] https://en.wikipedia.org/wiki/Hold-And-Modify
| bcrl wrote:
| HAM would have worked even better if it had been in the
| intended hue-saturation-value (HSV) colour space as Jay Miner
| originally intended. Sadly, HSV support was dropped, but due
| to schedule constraints, there wasn't enough time to rework
| the design and remove HAM. I found it fascinating to learn
| these things long after I was no longer using my Amiga for
| day to day work.
| GuB-42 wrote:
| What if the floating point number is, for example, a number of
| seconds coming from a millisecond time source? Binary floating
| point numbers notoriously can't represent decimals precisely.
|
| You would obviously be better off timing in the proper unit so
| that you have an integer number of milliseconds but that's the
| kind of situation where performance starts to drop, no one
| understands why until some dev starts digging deeply into the
| encoding and maybe even write a blog post about how he got a
| massive speedup by multiplying numbers by 1000.
| bcrl wrote:
| I did something similar in a messaging application... We had a
| data structure consisting of several tables describing various
| message queues in 4GB of persistent storage. The initialization
| of those tables was quite regular, as they were just pointers
| into blocks in the persistent memory. When the application was
| ported to run on AWS instances, people decided to try to run the
| application on instances with the crappiest 20MB/s storage. I had
| specified during implementation that the journalling layer would
| not work well on crappy storage, as the actual persistent memory
| the system was originally designed for had gobs of bandwidth (it
| was supercap backed DRAM on a PCIe card built using an FPGA and a
| pair of 10Gbps SFP+ ports to mirror to another host -- you had a
| couple of GB/s of write throughput). Customers ignored the
| requirement, and opened a bug saying the system couldn't
| reinitialize the table within the 30 second limit for commands on
| the CLI. To fix the "bug", I ended up delta encoding the
| persistent memory tables to make the data more regular, ran it
| through libz at the lowest compression level, then wrote the
| compressed data out to disk. 4GB of data ended up taking less
| than 100MB in the journal. gzip compression on its own was not
| able to get under the 600MB requirement. It was a total hack, but
| it worked and took a few hours to implement. Domain specific
| knowledge really helps improve compression ratios!
| jart wrote:
| If you zig zag encode the deltas you might be able to get it
| down to 10mb.
|
| dzd is the best. https://justine.lol/sizetricks/#dzd
| throwiforgtnlzy wrote:
| IIRC, there was a WAV-era audio format that used PCM delta
| compression by recording differences rather than absolute values.
| adzm wrote:
| Differential PCM did this basically versus the predicted value.
| Delta modulation is the other scheme.
| throwiforgtnlzy wrote:
| That's probably it. Thanks for easing my perpetual
| forgetfulness!
| chrchang523 wrote:
| One question: it is possible for the XOR of two consecutive
| floating-point numbers to have 32-63 leading zeros; the numbers
| 32-63 do not fit in 5 bits. I imagine this is treated by Gorilla
| like 31 leading zeros?
| infogulch wrote:
| One of the things we've learned from columnar data stores and
| compression formats is that choosing the correct stride order
| _really_ matters.
|
| I noticed this a decade ago when I stumbled on a voxel engine
| that used some complex height map compression scheme for game map
| data. It seemed gratuitous to me, so I tried the simplest thing I
| could come up with: gzipping a 3D array of voxel colors. It
| didn't really work until I chose the right stride order (xyz,
| zyx, yzx, etc) and then suddenly it was 5x better.
|
| The parquet format has the BYTE_STREAM_SPLIT encoding strategy
| which basically splits an array of floats/doubles into 4/8 arrays
| of bytes so they can be compressed independently.
| https://parquet.apache.org/docs/file-format/data-pages/encod...
|
| Just choosing the correct stride order before compression is
| still underutilized. It's also undersupported in our tools too;
| e.g. that voxel engine used numpy, but it was a pita to get numpy
| to transparently change the stride order in the memory
| representation without also affecting the indexing order.
| hugh-avherald wrote:
| What is stride order?
| pxeger1 wrote:
| It means the order in which you serialise a multi-dimensional
| array. So for this 2D table: 1262
| 3418 9076
|
| Because it's 2D it has two stride orders, xy and yx, i.e.
| row-major and column-major, which would be 126234189076 and
| 139240617286 respectively.
| enizor2 wrote:
| The BYTE_STREAM_SPLIT is a Z-order (also known as Morton
| encoding). As it is better a preserving locality, it usually
| performs really well compared to classical "orthogonal" orders.
| It also is really simple to implement as it boils down to
| interleaving bits or bytes.
| einpoklum wrote:
| XOR'ing consecutive elements is the mark of people used to
| sequential, rather than parallel, processing.
|
| Most of the approaches described in that paper should work well
| enough with a base + offset scheme, where the base is RLE-
| compressed or just set once every N elements.
| remram wrote:
| > We then immediately write out the subsequence of the XORed
| value that has the same starting index and length as the last
| subsequence we wrote.
|
| I don't really understand this. Is there an example?
| queuebert wrote:
| This appears to be working because the data are highly
| correlated, i.e. similar magnitudes across the vector. I wonder
| how well it would do with completely random input, not just
| uniform floats in the [0, 1] interval, but floats across all
| exponent combinations.
___________________________________________________________________
(page generated 2024-04-12 23:02 UTC)