[HN Gopher] Chrome feature: Compression dictionary transport wit...
___________________________________________________________________
Chrome feature: Compression dictionary transport with Shared Brotli
Author : judiisis
Score : 57 points
Date : 2023-09-22 08:57 UTC (14 hours ago)
(HTM) web link (chromestatus.com)
(TXT) w3m dump (chromestatus.com)
| jacobgorm wrote:
| I never really understood with Google keeps pushing Brotli. LZ4
| and ZStd both predate it by a couple of years, and seem to offer
| superior performance overall. Is it a Google-NIH thing?
| lifthrasiir wrote:
| They were designed more or less independently from each other,
| and your statement that Zstandard predates Brotli is also wrong
| (both around 2015). They are both designed to be much more
| efficient than DEFLATE in both time and size axes, but
| otherwise have little similarities. Zstandard is notable in its
| use of ANS, while Brotli is probably the first mainstream
| format that prominently uses a 2nd-order context model which
| was previously thought to be too slow. And Brotli is designed
| for web contents in mind, which made it excellent for small
| inputs with no prior. Zstandard has no such treatments but have
| a better performance when I/O can handle it, which makes it
| excellent for server jobs.
| miohtama wrote:
| The performance depends on the data. I have found that for
| English text and source code Brotli overperforms Zstd. Not
| sure if this because of the static dictionary or some other
| tweaks, though. However none of this is static, as both
| technologies are actively developed.
| lifthrasiir wrote:
| Better _decoding time_ performance, I should have said. The
| use of 2nd-order context model indeed makes Brotli more
| efficient in compression ratio for many kinds of data.
| pmarreck wrote:
| Yes, but hasn't "separate compression dictionary" been a core
| feature of Zstd since its inception, at least under that
| name?
| lifthrasiir wrote:
| Not even close, zlib had one. (Search for "FDICT" from RFC
| 1950.) What Zstandard did was a CLI to automatically
| generate a good enough dictionary from sample inputs.
| pmarreck wrote:
| FDICT (Preset dictionary) If FDICT is set, a
| DICT dictionary identifier is present
| immediately after the FLG byte. The dictionary is a
| sequence of bytes which are initially fed to
| the compressor without producing any
| compressed output. DICT is the Adler-32 checksum
| of this sequence of bytes (see the definition of ADLER32
| below). The decompressor can use this identifier to
| determine which dictionary has been used by
| the compressor.
|
| Well, wow. I have to wonder why this wasn't more
| utilized, then. There are a ton of contexts (columnar
| data in databases, for example) where shared-dictionary-
| based compression might have helped a ton before now.
| jacobgorm wrote:
| ZStd started out as ZHuff, which was first announced in 2008;
| see https://encode.su/threads/521-Zhuff-fast-compression and
| http://fastcompression.blogspot.com/p/zhuff.html
|
| I've been using LZ4 in commercial software since 2011, so
| unless I am some kind of luminary Google's compression
| experts must have been aware of both LZ4 and ZHuff when they
| started working on Brotli.
| lifthrasiir wrote:
| By that reckoning I can claim that the origin of Brotli
| also dates back to 2012 or earlier. :-) Indeed, Brotli
| authors were also reponsible for the VP8 Lossless codec
| (2012), Pik (2017) and eventually JPEG XL (2019, in part).
| The entropy coding scheme is substantially similar among
| those codecs for this reason.
|
| But really, the large amount of differences between two
| formats rather suggest that they were designed for
| different uses in mind, which I believe relate to the
| maximum possible I/O. Formats designed for the web don't
| have to have an extremely fast decoder because I/O is the
| bottleneck: 100 MB/s should be enough, and the earned time
| can be invested to higher compression ratio which affects
| the decoder performance.
| adgjlsfhk1 wrote:
| otoh, they achieve pretty similar compression ratios and
| faster decode saves battery life and will let you get on
| to running the JavaScript earlier
| jauntywundrkind wrote:
| Brotli has two major advantages. First it encodes to the same
| well known gzip format, so rolling it out didn't require new
| clients. Second, it didn't require dictionary support.
|
| Where-as if you look at Mozilla's standards-positions for zstd,
| they defer because it requires a compression dictionary to
| encode & Moz didn't want to have to "standardize" one specific
| dictionary. https://github.com/mozilla/standards-
| positions/issues/105
| pmarreck wrote:
| > Moz didn't want to have to "standardize" one specific
| dictionary
|
| You could have multiple sources of dictionary identified by
| something like a hash, and cached in the browser. For example
| you could have a general-purpose HTML dictionary, a general-
| purpose CSS dictionary and a general-purpose JS dictionary,
| have canonical/official versions of these hosted somewhere
| (say, on Github) and distributed wherever, and update them on
| a semi regular basis (at the cost of re-compression of the
| associated data); and in turn these could be overrided on
| perhaps a site by site basis with site-specific versions of
| these. You could basically associate an overridable but
| defaulted dictionary specification by MIME type (or none at
| all in the case of already-compressed data).
|
| Every file would have to be sent with the hash of the
| dictionary that was used to start compressing it, which would
| then be retrieved and cached somehow.
|
| I could see this resulting in some fairly significant byte
| savings, especially for retrieval of many smaller files.
|
| I bet this would also be useful potentially for email
| clients/servers.
| jauntywundrkind wrote:
| > _You could have multiple sources of dictionary identified
| by something like a hash, and cached in the browser._
|
| The browser actually has a very advanced implementation of
| something like this! They're called urls. The work here is
| to make it possible to identify dictionaries & retrieve &
| cache them in the browsers.
|
| Mozilla I think smartly avoided creating a spec that
| requires some variety of _centralized_ governance of what
| dictionaries to use, like you describe. Whatever the intent
| of such a community, it creates biases & will have a hard
| time figuring out what sets of dictionaries are worth
| creating, for what intents or code bases, and adapting or
| not as it discovers underserved code bases.
| masklinn wrote:
| > Brotli has two major advantages. First it encodes to the
| same well known gzip format
|
| It does not. You're thinking about zopfli. Brotli is an
| alternative to but not compatible with gzip.
| pmarreck wrote:
| The Zstd guy is employed by Facebook, so I'd suspect it's at
| least in part a NIH thing.
| joelthelion wrote:
| Patents, maybe?
| oynqr wrote:
| Brotli usually compresses plain text better than ZSTD.
| out_of_protocol wrote:
| Both with dictionaries? Both without? Any links to
| benchmarks?
| __alexs wrote:
| Zstd no longer comes with a patent grant which I guess makes it
| kind of hard to actually use for some people.
| jacobgorm wrote:
| Zstd is already used in the Linux kernel, and I would assume
| that any patents covering Lempel-Ziv and Huffman coding would
| have expired long ago. The situation around ANS also seems to
| have been resolved, see https://en.wikipedia.org/wiki/Asymmet
| ric_numeral_systems#Tab... .
| abdullahhhh wrote:
| [flagged]
| __alexs wrote:
| The Linux kernel is stuffed full of potentially patent
| encumbered code.
| secondcoming wrote:
| huh, isn't zstd the compression used for debian packages?
| tyingq wrote:
| Makes me wonder if rsync-style differential download would be
| more generally useful than "shared dictionary for shared brotli".
| The implementation is letting you reference a previously
| downloaded artifact using a hash value, which is the same
| starting point.
| philsnow wrote:
| Are you referring to courgette (which the Chrome team uses to
| make shipping patches/updates take far fewer bytes than
| previously)? https://www.chromium.org/developers/design-
| documents/softwar...
| tyingq wrote:
| Yes, but exposed to everyone. This proposal settles on
| headers that get passed back and forth that surely some
| servers would adopt. Headers that established "previously
| downloaded thing" with a hash. It feels like the process
| could be exactly the same at the start, but end instead with
| some data to the client that it could use to do the http
| range requests. This proposal feels like it could be extended
| in a fairly easy way to support differential download.
| philsnow wrote:
| Interesting! Privacy concerns aside for the moment, I would
| guess that there is some "optimal" compression dictionary
| over some transferred data set, and that Google/Chrome of
| all people would be in a position to calculate it and
| distribute it along with Chrome.
|
| Then Google could ship that dictionary to all (Chrome)
| browsers and update it once a month, and ship the same
| dictionary to anybody who wants to use it on the server
| side. Browsers can indicate which shared dictionaries they
| have preloaded and if there's overlap with which
| dictionaries the server has, the server can just choose one
| and use that to compress the stream (which seems kind of
| like ciphersuite negotiation). Compression should be faster
| and use way less memory if the sender can assume that the
| receiver already has the same dictionary, and if there's
| any problem, both sides can just fall back to on-the-fly
| compression with inlined dictionary like we've always done.
|
| There are almost certainly different optimal dictionaries
| for different locales / countries, each of which could be
| calculated and distributed in the same way. Even if the
| 'wrong' dictionary gets used for a given request, it's
| (probably) not catastrophic.
|
| I guess it might be possible for an attacker to indicate
| their client only has one dictionary and request pages that
| they know are disadvantageous for the server to have to
| compress with that dictionary. Even then, server-side
| heuristics can account for a lower-than-expected
| compression ratio and, again, fall back to on-the-fly
| dictionary calculation.
| [deleted]
| pornel wrote:
| It would be a dud. rsync-style download needs to prepare and
| upload significant amount of data to describe what the client
| already has. Hashes don't compress, and you want granular ones
| to reuse as much as possible.
|
| Web is more sensitive to latency than bandwidth, and upload is
| more restricted than download. So rsync hits all the worst
| spots, for the least gain. The data that diffs nicely, like
| HTML, is already tiny and compresses very well. The large
| files, like videos, don't diff well at all.
|
| There is a way to make rsync-style differential download really
| quick to request: the client and server need to agree on
| diffing only from a specific shared dataset... like a
| dictionary! And you can replace hashes with smaller variable-
| length integers.
| tyingq wrote:
| Not rsync style in the sense of whole directory trees.
| Differential range-request download of the subsequent version
| of some specific file. Because this scheme means client and
| server already know a prior version via the hash.
| lifthrasiir wrote:
| I think it solves a different problem. One of Brotli's
| strengths was a built-in dictionary tuned for the web contents,
| which makes it especially suitable for short data with no known
| priors. The compressed dictionary transport is same but tuned
| for each website. In comparison, rsync algorithm would make
| sense only when the data is long enough (say, more than 1 MB),
| in which case a custom dictionary wouldn't help much because
| the compressor already has a lot of patterns to working with.
| Ideally both can be implemented, where the HTTP Range request
| can be updated to support rsync-like algorithms.
| Scaevolus wrote:
| They already had a very similar feature. _15 years ago_ with SDCH
| that got removed in 2018--
| https://chromestatus.com/feature/5763176272494592
| https://daniel.haxx.se/blog/2008/09/09/shared-dictionary-com...
|
| Presumably they have some reason to think that this will be
| useful now when it wasn't before?
| qingcharles wrote:
| Talked about here:
|
| https://github.com/WICG/compression-dictionary-transport
| rasz wrote:
| >The client will store a hash of the uncompressed response
|
| >client will add ...request header as well as a sec-available-
| dictionary: <SHA-256>
|
| loving it, another thing to track clients with
| pimterry wrote:
| From the spec (https://github.com/WICG/compression-dictionary-
| transport#fin...):
|
| > The existence of a dictionary is effectively a cookie for any
| requests that match it and should be treated as such:
|
| > * Storage partitioning for dictionary resource metadata
| should be at least as restrictive as for cookies.
|
| > * Dictionary entries (or at least the metadata) should be
| cleared any time cookies are cleared.
|
| Given that, I think it offers no more client tracking
| possibilities than any other existing web tech (cookies, local
| storage, etc)..
| TonyTrapp wrote:
| Maybe no more possibilities than regular cookies, but more
| stealthy for sure. Cookies are (somewhat) easily reviewable
| by the user through the web developer UI of most browsers; I
| doubt there will be any browser UI to managing compression
| dictionary.
| out_of_protocol wrote:
| should be w3c dicts only and number of dicts and versions
| should be heavily limited. well, something like
| js,css,html/xml,json,common-web and new version every 5 years i
| think good enough for everyone
| tyingq wrote:
| Interesting two comment snippet from HN in 2018 about this:
| https://news.ycombinator.com/item?id=18720554
___________________________________________________________________
(page generated 2023-09-22 23:01 UTC)