hngopher.com

       [HN Gopher] Zstandard Worked Example
       ___________________________________________________________________
        
       Zstandard Worked Example
        
       Author : stargrave
       Score  : 54 points
       Date   : 2022-05-17 15:00 UTC (1 days ago)
        
 (HTM) web link (nigeltao.github.io)
 (TXT) w3m dump (nigeltao.github.io)
        
       | dietrichepp wrote:
       | Nice writeup! FSE, one of the parts of Zstandard, is a very
       | elegant entropy coder. If you're following this section of the
       | article, you might be wondering how it's encoded, since the
       | encoding process is somewhat less obvious than for Huffman
       | coding. My intuition tells me that there's probably a
       | homomorphism from Huffman coding to FSE that preserves the
       | encoded size exactly, but I haven't done the math to check.
       | 
       | What causes trouble is simply, "Once you've emitted a symbol, how
       | do you know that the next symbol is in the range [BL, BL+2^NB)?"
       | The answer is that the encoding is performed backwards, starting
       | from the end of the input, and the table is constructed so that
       | for any symbol emitted, you can find a _previous_ symbol so that
       | the _current_ symbol is in the previous symbol 's [BL, BL+2^NB)
       | range.
       | 
       | There's another writeup about FSE here, which goes into more
       | depth about FSE (rather than talking about Zstandard in general):
       | 
       | http://fastcompression.blogspot.com/2013/12/finite-state-ent...
        
       | keyle wrote:
       | Interesting, I've never heard of it before. I was aware that
       | there were better standards than zip, however adoption was always
       | an issue with licenses. If this is as open, is it purely an issue
       | of wide adoption?
       | 
       | We're stuck in the C++ paradigm (I'm sure there is a better
       | name), where everyone agree this isn't "ideal", that there is
       | better, but not widely adopted enough?
        
         | learndeeply wrote:
         | Your questions are answered in the first sentence of the
         | article.
         | 
         | > Zstd or Zstandard (RFC 8478, first released in 2015) is a
         | popular modern compression algorithm. It's smaller (better
         | compression ratio) and faster than the ubiquitous Zlib/Deflate
         | (RFC 1950 and RFC 1951, first released in 1995).
        
         | c0balt wrote:
         | The article already mentions zstd but for an IMO simple and
         | accurate overview take a look at the recommendations from
         | https://jolynch.github.io/posts/use_fast_data_algorithms/ .
        
         | wolf550e wrote:
         | Unless you have to use deflate/zlib/gzip/zip for compatibility
         | or you're very constrained in amount of memory, you should
         | always prefer zstd over deflate - it's always both faster and
         | compresses better, so there is no tradeoff and using deflate is
         | just wasteful.
         | 
         | Google's brotli is a close competitor and got supported in
         | Chrome first, so HTTP uses that more than it uses zstd.
         | 
         | zstd has a very wide range of compression levels to tradeoff
         | compression ratio for CPU time, much wider than the difference
         | between gzip -1, gzip -6 (default) and gzip -9. zstd -3 is
         | default. zstd -1 is very fast, zstd -19 compresses really well.
         | 
         | zstd also has a long range mode, for archives that contain the
         | same file a gigabyte away. gzip's window is only 32KB in size,
         | so it can't notice any repetitions more than that far away.
         | 
         | LZ4 by the same author as zstd aims at even faster compression,
         | at the cost of worse compression ratio. It makes sense to use
         | over fast network. Google snappy and LZO are the usual
         | competitors.
         | 
         | When you want better compression than zstd (at the cost of CPU
         | time, both during compression and during decompression), use
         | PPMD for natural language text or source code and use LZMA (xz)
         | for other things.
        
       | londons_explore wrote:
       | This post talksabout literals still being encoded with huffman
       | tables and stuff.
       | 
       | However, it appears this must be optional, because short strings
       | compressed with zstd pass through 'unencoded', apart from a
       | header and footer added:                   $ echo "hello world" |
       | zstd  | hexdump -C         00000000  28 b5 2f fd 04 58 61 00  00
       | 68 65 6c 6c 6f 20 77  |(./..Xa..hello w|         00000010  6f 72
       | 6c 64 0a 8c 6d 7d  20                       |orld..m} |
       | 
       | As an aside, I'm fairly disappointed that zstd didn't use a
       | single byte header for uncompressible strings, so that they could
       | guarantee that the compressed data will never be more than 1 byte
       | larger than the source data. That property is very useful where
       | lots of tiny strings are being compressed, such as in a database.
        
         | wolf550e wrote:
         | This is an uncompressed block.
         | 
         | For database records or log lines, you should use dictionary
         | trained on likely data. Then short records compress well.
         | 
         | See what facebook does with zstd (they employ its author):
         | https://engineering.fb.com/2018/12/19/core-data/zstandard/
        
         | pcwalton wrote:
         | The first 4 bytes are the magic number and the last 4 bytes are
         | the checksum [1] which you could always just chop off if you
         | wanted (it's legal to omit the checksum, see the spec). That
         | would get the total overhead down to 5 bytes.
         | 
         | [1]:
         | https://github.com/facebook/zstd/blob/dev/doc/zstd_compressi...
        
       ___________________________________________________________________
       (page generated 2022-05-18 23:00 UTC)