[HN Gopher] Zstandard Worked Example
___________________________________________________________________
Zstandard Worked Example
Author : stargrave
Score : 54 points
Date : 2022-05-17 15:00 UTC (1 days ago)
(HTM) web link (nigeltao.github.io)
(TXT) w3m dump (nigeltao.github.io)
| dietrichepp wrote:
| Nice writeup! FSE, one of the parts of Zstandard, is a very
| elegant entropy coder. If you're following this section of the
| article, you might be wondering how it's encoded, since the
| encoding process is somewhat less obvious than for Huffman
| coding. My intuition tells me that there's probably a
| homomorphism from Huffman coding to FSE that preserves the
| encoded size exactly, but I haven't done the math to check.
|
| What causes trouble is simply, "Once you've emitted a symbol, how
| do you know that the next symbol is in the range [BL, BL+2^NB)?"
| The answer is that the encoding is performed backwards, starting
| from the end of the input, and the table is constructed so that
| for any symbol emitted, you can find a _previous_ symbol so that
| the _current_ symbol is in the previous symbol 's [BL, BL+2^NB)
| range.
|
| There's another writeup about FSE here, which goes into more
| depth about FSE (rather than talking about Zstandard in general):
|
| http://fastcompression.blogspot.com/2013/12/finite-state-ent...
| keyle wrote:
| Interesting, I've never heard of it before. I was aware that
| there were better standards than zip, however adoption was always
| an issue with licenses. If this is as open, is it purely an issue
| of wide adoption?
|
| We're stuck in the C++ paradigm (I'm sure there is a better
| name), where everyone agree this isn't "ideal", that there is
| better, but not widely adopted enough?
| learndeeply wrote:
| Your questions are answered in the first sentence of the
| article.
|
| > Zstd or Zstandard (RFC 8478, first released in 2015) is a
| popular modern compression algorithm. It's smaller (better
| compression ratio) and faster than the ubiquitous Zlib/Deflate
| (RFC 1950 and RFC 1951, first released in 1995).
| c0balt wrote:
| The article already mentions zstd but for an IMO simple and
| accurate overview take a look at the recommendations from
| https://jolynch.github.io/posts/use_fast_data_algorithms/ .
| wolf550e wrote:
| Unless you have to use deflate/zlib/gzip/zip for compatibility
| or you're very constrained in amount of memory, you should
| always prefer zstd over deflate - it's always both faster and
| compresses better, so there is no tradeoff and using deflate is
| just wasteful.
|
| Google's brotli is a close competitor and got supported in
| Chrome first, so HTTP uses that more than it uses zstd.
|
| zstd has a very wide range of compression levels to tradeoff
| compression ratio for CPU time, much wider than the difference
| between gzip -1, gzip -6 (default) and gzip -9. zstd -3 is
| default. zstd -1 is very fast, zstd -19 compresses really well.
|
| zstd also has a long range mode, for archives that contain the
| same file a gigabyte away. gzip's window is only 32KB in size,
| so it can't notice any repetitions more than that far away.
|
| LZ4 by the same author as zstd aims at even faster compression,
| at the cost of worse compression ratio. It makes sense to use
| over fast network. Google snappy and LZO are the usual
| competitors.
|
| When you want better compression than zstd (at the cost of CPU
| time, both during compression and during decompression), use
| PPMD for natural language text or source code and use LZMA (xz)
| for other things.
| londons_explore wrote:
| This post talksabout literals still being encoded with huffman
| tables and stuff.
|
| However, it appears this must be optional, because short strings
| compressed with zstd pass through 'unencoded', apart from a
| header and footer added: $ echo "hello world" |
| zstd | hexdump -C 00000000 28 b5 2f fd 04 58 61 00 00
| 68 65 6c 6c 6f 20 77 |(./..Xa..hello w| 00000010 6f 72
| 6c 64 0a 8c 6d 7d 20 |orld..m} |
|
| As an aside, I'm fairly disappointed that zstd didn't use a
| single byte header for uncompressible strings, so that they could
| guarantee that the compressed data will never be more than 1 byte
| larger than the source data. That property is very useful where
| lots of tiny strings are being compressed, such as in a database.
| wolf550e wrote:
| This is an uncompressed block.
|
| For database records or log lines, you should use dictionary
| trained on likely data. Then short records compress well.
|
| See what facebook does with zstd (they employ its author):
| https://engineering.fb.com/2018/12/19/core-data/zstandard/
| pcwalton wrote:
| The first 4 bytes are the magic number and the last 4 bytes are
| the checksum [1] which you could always just chop off if you
| wanted (it's legal to omit the checksum, see the spec). That
| would get the total overhead down to 5 bytes.
|
| [1]:
| https://github.com/facebook/zstd/blob/dev/doc/zstd_compressi...
___________________________________________________________________
(page generated 2022-05-18 23:00 UTC)