[HN Gopher] Minimalist Guide to Lossless Compression (2019)
       ___________________________________________________________________
        
       Minimalist Guide to Lossless Compression (2019)
        
       Author : marklit
       Score  : 91 points
       Date   : 2021-12-12 12:44 UTC (1 days ago)
        
 (HTM) web link (tech.marksblogg.com)
 (TXT) w3m dump (tech.marksblogg.com)
        
       | tobijdc wrote:
       | Doesn't LZFSE stem from the Finite State Entropy library Yann
       | Collet (LZ4, ZSTD) wrote as a base for ZSTD (together with HUFF0)
       | and Apple just decided to use it before it was fully mature? So
       | shouldn't LZFSE be a predecessor to ZSTD in the Tree?
        
       | optimalsolver wrote:
       | Lossless compression is apparently equivalent to general
       | intelligence:
       | 
       | http://mattmahoney.net/dc/rationale.html
        
         | pornel wrote:
         | Yes, but the compression process can be split into two
         | components:
         | 
         | 1. modelling (you convert data to symbols and predict
         | probabilities of seeing each symbol)
         | 
         | 2. entropy coding (convert symbols + probabilities to minimum
         | number of bits)
         | 
         | The second stage is relatively simple, and basically solved. We
         | went from Huffman's approximation to Arithmetic Coding which is
         | theoretically ideal, and now got ANS which is close to ideal
         | and pretty fast too.
         | 
         | The modelling part is indeed a completely open problem, and at
         | this point as much art as science.
        
           | Jasper_ wrote:
           | Most compression codecs these days have stopped using ANS
           | because the benefits aren't worth the tradeoffs. Turns out a
           | good arith coder is much faster.
        
             | hexxagone wrote:
             | "Turns out a good arith coder is much faster". Citation
             | needed.
        
         | selfhoster11 wrote:
         | FLAC isn't turning into Skynet that soon.
         | 
         | Besides, humans are lossy. We forget, make mistakes, think via
         | heuristic.
        
       | greypowerOz wrote:
       | >"Also, consider that if you can only read at 100 MB/s off a
       | mechanical drive but your CPU can decompress data at ~500 MB/s
       | then the mechanical drive is able to provide 5x the throughput
       | you'd otherwise expect thanks to compression."
       | 
       | I'd not really thought of that aspect before... My old brain is
       | hard-coded to save cpu cycles ... Time to change my ways :)
        
         | kzrdude wrote:
         | This is an older chart of ZSTD But it is an inspiring
         | visualisation - compression algorithm "times" transmission
         | bandwidth => transmission speed.
         | 
         | http://fastcompression.blogspot.com/2015/01/zstd-stronger-co...
         | 
         | Taken from the fastcompression blog - where one could follow
         | ZSTD's author since before ZSTD was even conceived.
         | 
         | "Conveniently" enough the author of the blog has written both
         | ZSTD and LZ4, which top the chart for their respective link
         | speed domains. (2015 data - things have improved in both ZSTD
         | and others since then.)
        
         | flohofwoe wrote:
         | This was already a common technique back in the 90's for
         | loading asset data in games and even more important than today
         | with fast SSDs. Instead of many small files, the asset data is
         | loaded from a single big compressed archive file, this works
         | around the terrible small-file-performance (mainly in Windows
         | file systems), and it speeds up loading because loading the
         | compressed data and decompressing is much faster than loading
         | the decompressed data directly from the slow disc.
        
           | Someone wrote:
           | A variation on this also still is extremely common in the
           | case where the 'file system' is a web server and the program
           | accessing it a web browser (https://css-tricks.com/css-
           | sprites/)
           | 
           | In that case, there typically isn't additional explicit
           | compression (1). The main gain is in decreasing the number of
           | http requests.
           | 
           | (1) the image itself may have inherent compression, and that
           | may be improved by combining images with similar content, and
           | the web server may be configured to use compression, but the
           | first typically isn't a big win, and the second is
           | independent from this strategy.
        
         | magicalhippo wrote:
         | Everything old is new again, so maybe you're just in time.
         | Consider we have NVMe drives which can read and write 4-5
         | GB/s[1], that whole equation changes again...
         | 
         | [1]: https://www.anandtech.com/bench/SSD21/3017
        
           | 1_player wrote:
           | zstd at compression level 1 can do ~2GB/s per core, and as
           | time goes on and processors get more and more cores,
           | compressing data by default is a valid proposition.
           | 
           | In fact, if you install Fedora 35 on btrfs, zstd:1 is enabled
           | by default, using fs-level heuristics to decide when and when
           | not to compress, reducing write amplification on SSD drives
           | and gaining some space for free with negligible performance
           | impact, which is nice.
           | 
           | My 8GB ~/src directory on encrypted btrfs on NVMe uses 6GB on
           | disk and I can easily saturate the link while reading from
           | it. Computers are plenty fast.
        
             | magicalhippo wrote:
             | > zstd at compression level 1 can do ~2GB/s per core
             | 
             | So unless you can multithread that workload, it's already
             | behind by a factor of 2.
             | 
             | > Computers are plenty fast.
             | 
             | My point was you can no longer assume the disk is
             | significantly slower, at least for streaming workloads. You
             | can often still win by spending CPU cycles doing clever
             | stuff, but it's not several orders of magnitude difference
             | like it used to be.
        
               | willis936 wrote:
               | Most zstd implementations are multithreaded for single
               | compression and decompression tasks. My home server has
               | 24 cores and is using 16x PCIe 3.0 for storage (16 GB/s).
               | Through benchmarking I found that I am storage bound from
               | zstd-3+ and compute bound from zstd-2 and zstd-1. I run
               | zstd-3.
        
               | 1_player wrote:
               | Dumb question, but shouldn't it be the opposite? The
               | lower the compression level, the faster the CPU can
               | process the data, making storage the bottleneck. Higher
               | compression level -> CPU becomes the bottleneck.
        
               | willis936 wrote:
               | I believe in the TrueNAS interface I use these are
               | negative compression levels. I could be wrong. It was
               | just some empirical observations. I may flesh out my
               | matrix more to paint a better picture.
        
           | [deleted]
        
           | mdp2021 wrote:
           | There exist network connections at 80Kb/s. Still data
           | repositories.
        
         | bob1029 wrote:
         | This idea is very powerful. Creating small compressed batches
         | of transactions also allows you to write to disk in terms
         | stated as "Transactions per block IO". When dealing with
         | storage media that can wear out at the block grains, this can
         | save substantial device lifetime.
        
       | absoflutely wrote:
       | >Entropy, an Information Theory term coined by Claude Shannon in
       | 1948, describes the minimum number of bits, on average, needed to
       | encode a dataset.
       | 
       | Shannon didn't coin the term entropy. He borrowed it from the
       | analogous definition in thermodynamics.
        
       | efficientsticks wrote:
       | Could anyone explain why LZ77 is preferred by implementors versus
       | LZ78?
       | 
       | It seems important for compressibility to prepare the data for
       | maximum self-similarity, in addition to the LZ algorithms (as
       | evidenced by the sort in this article). Could someone point
       | towards a good modern summary of the approaches or heuristics?
        
         | pornel wrote:
         | LZW just doesn't compress that well. In LZW you only can use a
         | dictionary built gradually from length-limited scraps of the
         | previous data, but in LZ77 you can directly reference any of
         | the previous data of any length (within implementation limits).
        
         | commandlinefan wrote:
         | I was going to answer "patents" because that's what I had read
         | in the past (that LZ '78 was encumbered by patents while LZ '77
         | was not), but this wikipedia talk page:
         | https://en.wikipedia.org/wiki/Talk:LZ77_and_LZ78#preference_...
         | suggests that it's not just patent encumbrances.
        
       ___________________________________________________________________
       (page generated 2021-12-13 23:02 UTC)