[HN Gopher] Minimalist Guide to Lossless Compression (2019)
___________________________________________________________________
Minimalist Guide to Lossless Compression (2019)
Author : marklit
Score : 91 points
Date : 2021-12-12 12:44 UTC (1 days ago)
(HTM) web link (tech.marksblogg.com)
(TXT) w3m dump (tech.marksblogg.com)
| tobijdc wrote:
| Doesn't LZFSE stem from the Finite State Entropy library Yann
| Collet (LZ4, ZSTD) wrote as a base for ZSTD (together with HUFF0)
| and Apple just decided to use it before it was fully mature? So
| shouldn't LZFSE be a predecessor to ZSTD in the Tree?
| optimalsolver wrote:
| Lossless compression is apparently equivalent to general
| intelligence:
|
| http://mattmahoney.net/dc/rationale.html
| pornel wrote:
| Yes, but the compression process can be split into two
| components:
|
| 1. modelling (you convert data to symbols and predict
| probabilities of seeing each symbol)
|
| 2. entropy coding (convert symbols + probabilities to minimum
| number of bits)
|
| The second stage is relatively simple, and basically solved. We
| went from Huffman's approximation to Arithmetic Coding which is
| theoretically ideal, and now got ANS which is close to ideal
| and pretty fast too.
|
| The modelling part is indeed a completely open problem, and at
| this point as much art as science.
| Jasper_ wrote:
| Most compression codecs these days have stopped using ANS
| because the benefits aren't worth the tradeoffs. Turns out a
| good arith coder is much faster.
| hexxagone wrote:
| "Turns out a good arith coder is much faster". Citation
| needed.
| selfhoster11 wrote:
| FLAC isn't turning into Skynet that soon.
|
| Besides, humans are lossy. We forget, make mistakes, think via
| heuristic.
| greypowerOz wrote:
| >"Also, consider that if you can only read at 100 MB/s off a
| mechanical drive but your CPU can decompress data at ~500 MB/s
| then the mechanical drive is able to provide 5x the throughput
| you'd otherwise expect thanks to compression."
|
| I'd not really thought of that aspect before... My old brain is
| hard-coded to save cpu cycles ... Time to change my ways :)
| kzrdude wrote:
| This is an older chart of ZSTD But it is an inspiring
| visualisation - compression algorithm "times" transmission
| bandwidth => transmission speed.
|
| http://fastcompression.blogspot.com/2015/01/zstd-stronger-co...
|
| Taken from the fastcompression blog - where one could follow
| ZSTD's author since before ZSTD was even conceived.
|
| "Conveniently" enough the author of the blog has written both
| ZSTD and LZ4, which top the chart for their respective link
| speed domains. (2015 data - things have improved in both ZSTD
| and others since then.)
| flohofwoe wrote:
| This was already a common technique back in the 90's for
| loading asset data in games and even more important than today
| with fast SSDs. Instead of many small files, the asset data is
| loaded from a single big compressed archive file, this works
| around the terrible small-file-performance (mainly in Windows
| file systems), and it speeds up loading because loading the
| compressed data and decompressing is much faster than loading
| the decompressed data directly from the slow disc.
| Someone wrote:
| A variation on this also still is extremely common in the
| case where the 'file system' is a web server and the program
| accessing it a web browser (https://css-tricks.com/css-
| sprites/)
|
| In that case, there typically isn't additional explicit
| compression (1). The main gain is in decreasing the number of
| http requests.
|
| (1) the image itself may have inherent compression, and that
| may be improved by combining images with similar content, and
| the web server may be configured to use compression, but the
| first typically isn't a big win, and the second is
| independent from this strategy.
| magicalhippo wrote:
| Everything old is new again, so maybe you're just in time.
| Consider we have NVMe drives which can read and write 4-5
| GB/s[1], that whole equation changes again...
|
| [1]: https://www.anandtech.com/bench/SSD21/3017
| 1_player wrote:
| zstd at compression level 1 can do ~2GB/s per core, and as
| time goes on and processors get more and more cores,
| compressing data by default is a valid proposition.
|
| In fact, if you install Fedora 35 on btrfs, zstd:1 is enabled
| by default, using fs-level heuristics to decide when and when
| not to compress, reducing write amplification on SSD drives
| and gaining some space for free with negligible performance
| impact, which is nice.
|
| My 8GB ~/src directory on encrypted btrfs on NVMe uses 6GB on
| disk and I can easily saturate the link while reading from
| it. Computers are plenty fast.
| magicalhippo wrote:
| > zstd at compression level 1 can do ~2GB/s per core
|
| So unless you can multithread that workload, it's already
| behind by a factor of 2.
|
| > Computers are plenty fast.
|
| My point was you can no longer assume the disk is
| significantly slower, at least for streaming workloads. You
| can often still win by spending CPU cycles doing clever
| stuff, but it's not several orders of magnitude difference
| like it used to be.
| willis936 wrote:
| Most zstd implementations are multithreaded for single
| compression and decompression tasks. My home server has
| 24 cores and is using 16x PCIe 3.0 for storage (16 GB/s).
| Through benchmarking I found that I am storage bound from
| zstd-3+ and compute bound from zstd-2 and zstd-1. I run
| zstd-3.
| 1_player wrote:
| Dumb question, but shouldn't it be the opposite? The
| lower the compression level, the faster the CPU can
| process the data, making storage the bottleneck. Higher
| compression level -> CPU becomes the bottleneck.
| willis936 wrote:
| I believe in the TrueNAS interface I use these are
| negative compression levels. I could be wrong. It was
| just some empirical observations. I may flesh out my
| matrix more to paint a better picture.
| [deleted]
| mdp2021 wrote:
| There exist network connections at 80Kb/s. Still data
| repositories.
| bob1029 wrote:
| This idea is very powerful. Creating small compressed batches
| of transactions also allows you to write to disk in terms
| stated as "Transactions per block IO". When dealing with
| storage media that can wear out at the block grains, this can
| save substantial device lifetime.
| absoflutely wrote:
| >Entropy, an Information Theory term coined by Claude Shannon in
| 1948, describes the minimum number of bits, on average, needed to
| encode a dataset.
|
| Shannon didn't coin the term entropy. He borrowed it from the
| analogous definition in thermodynamics.
| efficientsticks wrote:
| Could anyone explain why LZ77 is preferred by implementors versus
| LZ78?
|
| It seems important for compressibility to prepare the data for
| maximum self-similarity, in addition to the LZ algorithms (as
| evidenced by the sort in this article). Could someone point
| towards a good modern summary of the approaches or heuristics?
| pornel wrote:
| LZW just doesn't compress that well. In LZW you only can use a
| dictionary built gradually from length-limited scraps of the
| previous data, but in LZ77 you can directly reference any of
| the previous data of any length (within implementation limits).
| commandlinefan wrote:
| I was going to answer "patents" because that's what I had read
| in the past (that LZ '78 was encumbered by patents while LZ '77
| was not), but this wikipedia talk page:
| https://en.wikipedia.org/wiki/Talk:LZ77_and_LZ78#preference_...
| suggests that it's not just patent encumbrances.
___________________________________________________________________
(page generated 2021-12-13 23:02 UTC)