[HN Gopher] Parallelising Huffman decoding and x86 disassembly b...
___________________________________________________________________
Parallelising Huffman decoding and x86 disassembly by synchronising
prefix codes
Author : hasheddan
Score : 51 points
Date : 2022-07-30 16:47 UTC (6 hours ago)
(HTM) web link (dougallj.wordpress.com)
(TXT) w3m dump (dougallj.wordpress.com)
| kyleplum wrote:
| Without taking a concrete stance on the current state of patents
| - I developed something similar using GPU compute shaders while
| employed at AMD/ATI
|
| https://patentimages.storage.googleapis.com/ef/d2/2c/b54bbdf...
|
| Shameless self promotion aside, I think the patent application
| includes some helpful diagrams to understand what is being done
| here - specifically page 25/26.
| londons_explore wrote:
| Could the same be done with arithmetic codes?
|
| Wouldn't they also self synchronize but after far more bits?
| hinkley wrote:
| These days arithmetic coding has been surpassed by asymmetric
| numerical systems (ANS) except that those are slower still.
|
| Parallelizing ANS would be huge.
|
| Notable uses of ANS: zstandard, JPEG XL. Apple and Dropbox also
| use it.
| jhj wrote:
| ANS is super fast and trivially parallelizable, faster than
| Huffman or especially arithmetic encoding, and is superior to
| Huffman at least as far as compression ratios are concerned
| (symbol probabilities are not restricted to powers of 2, but
| unlike arithmetic encoding one is usually limited to
| something close to multiples of 2^-9 to 2^-13 for symbol
| probabilities, the limitation being due to required table
| sizes for rANS or tANS in high speed memories or in L1
| cache).
|
| It is fast because it can be machine word oriented (you can
| read/write whole machine word sizes at a time, not
| arbitrary/variable bit length sequences), and as a result you
| can interleave any number of independent (parallel) encoders
| in the same stream with just a (parallel) prefix sum to
| figure out where to write state values (as whole machine
| words) when renormalization needs to occur in individual
| encoders. It is super SIMD friendly as a result (see
| https://arxiv.org/pdf/1402.3392.pdf).
|
| I for one got up to 400 GB/s throughput on A100 GPUs in my
| implementation (https://github.com/facebookresearch/dietgpu),
| which to my knowledge is/was the fastest entropy coder out
| there on a commercially available hardware platform, so fast
| that we're using it to losslessly compress data before
| sending over PCIe/NVLink/Infiniband/etc during large scale
| neural network training while still being a win. Nvidia's
| nvCOMP library also recently added similar techniques.
|
| ANS can also self-synchronize as well, but chunking data into
| fixed >=4 KiB segments has fairly minimal compression size
| overhead (you need an index to tell you where each variable
| sized compressed chunk begins) and is faster.
| londons_explore wrote:
| When decoding a large data block (ie. Megabytes), it really
| doesn't matter if a few tens of bytes get decoded twice to find a
| sync point.
|
| An obvious solution would appear to be to simd-decode with say 16
| lanes, each starting decoding 1/16th of the input data. When each
| decoder gets to the end of its chunk, continue a little further.
| Then use non-simd code to verify that every chunk has reached a
| sync point with the following point and splice together all the
| decided data.
| hinkley wrote:
| Before I gave up compression as a hobby my last big observation
| was that many algorithms are trying to juggle several concerns
| at once and wouldn't it be better to double down on the bzip2
| (of for that matter, optimizing compiler) strategy of making
| multiple passes. That's not strictly speaking parallel decoding
| but it can be modeled by coroutines or streams, and involve
| quite a few cores if you so choose.
|
| And these days we are not as concerned about memory usage as we
| were when computer with modems might have only 384K of RAM, so
| a block structure with a length prefix is not such a bad
| strategy. Though should your blocks all be the same size or
| should you start a block every time the window needs to be
| reset? One is smaller, but the other has more predictable
| performance.
| londons_explore wrote:
| Does parallelizing Huffman decoding on a modern CPU actually
| speed up decoding?
|
| I would assume that the memory read and write bandwidth of a CPU
| would be the limiting factor for any vaguely efficient
| implementation, not the actual processing.
| [deleted]
| foota wrote:
| A single core is able to load less memory per second than the
| whole cpu because there is a limited number of resources per
| core for loading memory.
| mananaysiempre wrote:
| It would be, if only you could load the CPU properly. Huffman
| decoding is _highly_ serialized--an iteration per symbol in the
| FSM implementation is somewhat better than the loop iteration
| per bit in the naive one, but every iteration still depends on
| basically everything in the previous one; you're using a
| fraction of the scalar processing power of the core. Thus
| modern compressors resorting to format-incompatible tricks like
| interleaved streams, but how much to interleave depends on
| which class of CPU you'll want to decode on.
| wolf550e wrote:
| Modern compression formats are optimized for decoding huffman
| streams in parallel. See for example Fabian Giesen explaining
| about RAD's codecs and also about Yann Collet's zstandard:
| https://fgiesen.wordpress.com/2021/08/30/entropy-coding-in-o...
___________________________________________________________________
(page generated 2022-07-30 23:01 UTC)