[HN Gopher] Chrome feature: Compression dictionary transport wit...
       ___________________________________________________________________
        
       Chrome feature: Compression dictionary transport with Shared Brotli
        
       Author : judiisis
       Score  : 57 points
       Date   : 2023-09-22 08:57 UTC (14 hours ago)
        
 (HTM) web link (chromestatus.com)
 (TXT) w3m dump (chromestatus.com)
        
       | jacobgorm wrote:
       | I never really understood with Google keeps pushing Brotli. LZ4
       | and ZStd both predate it by a couple of years, and seem to offer
       | superior performance overall. Is it a Google-NIH thing?
        
         | lifthrasiir wrote:
         | They were designed more or less independently from each other,
         | and your statement that Zstandard predates Brotli is also wrong
         | (both around 2015). They are both designed to be much more
         | efficient than DEFLATE in both time and size axes, but
         | otherwise have little similarities. Zstandard is notable in its
         | use of ANS, while Brotli is probably the first mainstream
         | format that prominently uses a 2nd-order context model which
         | was previously thought to be too slow. And Brotli is designed
         | for web contents in mind, which made it excellent for small
         | inputs with no prior. Zstandard has no such treatments but have
         | a better performance when I/O can handle it, which makes it
         | excellent for server jobs.
        
           | miohtama wrote:
           | The performance depends on the data. I have found that for
           | English text and source code Brotli overperforms Zstd. Not
           | sure if this because of the static dictionary or some other
           | tweaks, though. However none of this is static, as both
           | technologies are actively developed.
        
             | lifthrasiir wrote:
             | Better _decoding time_ performance, I should have said. The
             | use of 2nd-order context model indeed makes Brotli more
             | efficient in compression ratio for many kinds of data.
        
           | pmarreck wrote:
           | Yes, but hasn't "separate compression dictionary" been a core
           | feature of Zstd since its inception, at least under that
           | name?
        
             | lifthrasiir wrote:
             | Not even close, zlib had one. (Search for "FDICT" from RFC
             | 1950.) What Zstandard did was a CLI to automatically
             | generate a good enough dictionary from sample inputs.
        
               | pmarreck wrote:
               | FDICT (Preset dictionary)              If FDICT is set, a
               | DICT dictionary identifier is present
               | immediately after the FLG byte. The dictionary is a
               | sequence of              bytes which are initially fed to
               | the compressor without              producing any
               | compressed output. DICT is the Adler-32 checksum
               | of this sequence of bytes (see the definition of ADLER32
               | below).  The decompressor can use this identifier to
               | determine              which dictionary has been used by
               | the compressor.
               | 
               | Well, wow. I have to wonder why this wasn't more
               | utilized, then. There are a ton of contexts (columnar
               | data in databases, for example) where shared-dictionary-
               | based compression might have helped a ton before now.
        
           | jacobgorm wrote:
           | ZStd started out as ZHuff, which was first announced in 2008;
           | see https://encode.su/threads/521-Zhuff-fast-compression and
           | http://fastcompression.blogspot.com/p/zhuff.html
           | 
           | I've been using LZ4 in commercial software since 2011, so
           | unless I am some kind of luminary Google's compression
           | experts must have been aware of both LZ4 and ZHuff when they
           | started working on Brotli.
        
             | lifthrasiir wrote:
             | By that reckoning I can claim that the origin of Brotli
             | also dates back to 2012 or earlier. :-) Indeed, Brotli
             | authors were also reponsible for the VP8 Lossless codec
             | (2012), Pik (2017) and eventually JPEG XL (2019, in part).
             | The entropy coding scheme is substantially similar among
             | those codecs for this reason.
             | 
             | But really, the large amount of differences between two
             | formats rather suggest that they were designed for
             | different uses in mind, which I believe relate to the
             | maximum possible I/O. Formats designed for the web don't
             | have to have an extremely fast decoder because I/O is the
             | bottleneck: 100 MB/s should be enough, and the earned time
             | can be invested to higher compression ratio which affects
             | the decoder performance.
        
               | adgjlsfhk1 wrote:
               | otoh, they achieve pretty similar compression ratios and
               | faster decode saves battery life and will let you get on
               | to running the JavaScript earlier
        
         | jauntywundrkind wrote:
         | Brotli has two major advantages. First it encodes to the same
         | well known gzip format, so rolling it out didn't require new
         | clients. Second, it didn't require dictionary support.
         | 
         | Where-as if you look at Mozilla's standards-positions for zstd,
         | they defer because it requires a compression dictionary to
         | encode & Moz didn't want to have to "standardize" one specific
         | dictionary. https://github.com/mozilla/standards-
         | positions/issues/105
        
           | pmarreck wrote:
           | > Moz didn't want to have to "standardize" one specific
           | dictionary
           | 
           | You could have multiple sources of dictionary identified by
           | something like a hash, and cached in the browser. For example
           | you could have a general-purpose HTML dictionary, a general-
           | purpose CSS dictionary and a general-purpose JS dictionary,
           | have canonical/official versions of these hosted somewhere
           | (say, on Github) and distributed wherever, and update them on
           | a semi regular basis (at the cost of re-compression of the
           | associated data); and in turn these could be overrided on
           | perhaps a site by site basis with site-specific versions of
           | these. You could basically associate an overridable but
           | defaulted dictionary specification by MIME type (or none at
           | all in the case of already-compressed data).
           | 
           | Every file would have to be sent with the hash of the
           | dictionary that was used to start compressing it, which would
           | then be retrieved and cached somehow.
           | 
           | I could see this resulting in some fairly significant byte
           | savings, especially for retrieval of many smaller files.
           | 
           | I bet this would also be useful potentially for email
           | clients/servers.
        
             | jauntywundrkind wrote:
             | > _You could have multiple sources of dictionary identified
             | by something like a hash, and cached in the browser._
             | 
             | The browser actually has a very advanced implementation of
             | something like this! They're called urls. The work here is
             | to make it possible to identify dictionaries & retrieve &
             | cache them in the browsers.
             | 
             | Mozilla I think smartly avoided creating a spec that
             | requires some variety of _centralized_ governance of what
             | dictionaries to use, like you describe. Whatever the intent
             | of such a community, it creates biases  & will have a hard
             | time figuring out what sets of dictionaries are worth
             | creating, for what intents or code bases, and adapting or
             | not as it discovers underserved code bases.
        
           | masklinn wrote:
           | > Brotli has two major advantages. First it encodes to the
           | same well known gzip format
           | 
           | It does not. You're thinking about zopfli. Brotli is an
           | alternative to but not compatible with gzip.
        
         | pmarreck wrote:
         | The Zstd guy is employed by Facebook, so I'd suspect it's at
         | least in part a NIH thing.
        
         | joelthelion wrote:
         | Patents, maybe?
        
         | oynqr wrote:
         | Brotli usually compresses plain text better than ZSTD.
        
           | out_of_protocol wrote:
           | Both with dictionaries? Both without? Any links to
           | benchmarks?
        
         | __alexs wrote:
         | Zstd no longer comes with a patent grant which I guess makes it
         | kind of hard to actually use for some people.
        
           | jacobgorm wrote:
           | Zstd is already used in the Linux kernel, and I would assume
           | that any patents covering Lempel-Ziv and Huffman coding would
           | have expired long ago. The situation around ANS also seems to
           | have been resolved, see https://en.wikipedia.org/wiki/Asymmet
           | ric_numeral_systems#Tab... .
        
             | abdullahhhh wrote:
             | [flagged]
        
             | __alexs wrote:
             | The Linux kernel is stuffed full of potentially patent
             | encumbered code.
        
           | secondcoming wrote:
           | huh, isn't zstd the compression used for debian packages?
        
       | tyingq wrote:
       | Makes me wonder if rsync-style differential download would be
       | more generally useful than "shared dictionary for shared brotli".
       | The implementation is letting you reference a previously
       | downloaded artifact using a hash value, which is the same
       | starting point.
        
         | philsnow wrote:
         | Are you referring to courgette (which the Chrome team uses to
         | make shipping patches/updates take far fewer bytes than
         | previously)? https://www.chromium.org/developers/design-
         | documents/softwar...
        
           | tyingq wrote:
           | Yes, but exposed to everyone. This proposal settles on
           | headers that get passed back and forth that surely some
           | servers would adopt. Headers that established "previously
           | downloaded thing" with a hash. It feels like the process
           | could be exactly the same at the start, but end instead with
           | some data to the client that it could use to do the http
           | range requests. This proposal feels like it could be extended
           | in a fairly easy way to support differential download.
        
             | philsnow wrote:
             | Interesting! Privacy concerns aside for the moment, I would
             | guess that there is some "optimal" compression dictionary
             | over some transferred data set, and that Google/Chrome of
             | all people would be in a position to calculate it and
             | distribute it along with Chrome.
             | 
             | Then Google could ship that dictionary to all (Chrome)
             | browsers and update it once a month, and ship the same
             | dictionary to anybody who wants to use it on the server
             | side. Browsers can indicate which shared dictionaries they
             | have preloaded and if there's overlap with which
             | dictionaries the server has, the server can just choose one
             | and use that to compress the stream (which seems kind of
             | like ciphersuite negotiation). Compression should be faster
             | and use way less memory if the sender can assume that the
             | receiver already has the same dictionary, and if there's
             | any problem, both sides can just fall back to on-the-fly
             | compression with inlined dictionary like we've always done.
             | 
             | There are almost certainly different optimal dictionaries
             | for different locales / countries, each of which could be
             | calculated and distributed in the same way. Even if the
             | 'wrong' dictionary gets used for a given request, it's
             | (probably) not catastrophic.
             | 
             | I guess it might be possible for an attacker to indicate
             | their client only has one dictionary and request pages that
             | they know are disadvantageous for the server to have to
             | compress with that dictionary. Even then, server-side
             | heuristics can account for a lower-than-expected
             | compression ratio and, again, fall back to on-the-fly
             | dictionary calculation.
        
           | [deleted]
        
         | pornel wrote:
         | It would be a dud. rsync-style download needs to prepare and
         | upload significant amount of data to describe what the client
         | already has. Hashes don't compress, and you want granular ones
         | to reuse as much as possible.
         | 
         | Web is more sensitive to latency than bandwidth, and upload is
         | more restricted than download. So rsync hits all the worst
         | spots, for the least gain. The data that diffs nicely, like
         | HTML, is already tiny and compresses very well. The large
         | files, like videos, don't diff well at all.
         | 
         | There is a way to make rsync-style differential download really
         | quick to request: the client and server need to agree on
         | diffing only from a specific shared dataset... like a
         | dictionary! And you can replace hashes with smaller variable-
         | length integers.
        
           | tyingq wrote:
           | Not rsync style in the sense of whole directory trees.
           | Differential range-request download of the subsequent version
           | of some specific file. Because this scheme means client and
           | server already know a prior version via the hash.
        
         | lifthrasiir wrote:
         | I think it solves a different problem. One of Brotli's
         | strengths was a built-in dictionary tuned for the web contents,
         | which makes it especially suitable for short data with no known
         | priors. The compressed dictionary transport is same but tuned
         | for each website. In comparison, rsync algorithm would make
         | sense only when the data is long enough (say, more than 1 MB),
         | in which case a custom dictionary wouldn't help much because
         | the compressor already has a lot of patterns to working with.
         | Ideally both can be implemented, where the HTTP Range request
         | can be updated to support rsync-like algorithms.
        
       | Scaevolus wrote:
       | They already had a very similar feature. _15 years ago_ with SDCH
       | that got removed in 2018--
       | https://chromestatus.com/feature/5763176272494592
       | https://daniel.haxx.se/blog/2008/09/09/shared-dictionary-com...
       | 
       | Presumably they have some reason to think that this will be
       | useful now when it wasn't before?
        
         | qingcharles wrote:
         | Talked about here:
         | 
         | https://github.com/WICG/compression-dictionary-transport
        
       | rasz wrote:
       | >The client will store a hash of the uncompressed response
       | 
       | >client will add ...request header as well as a sec-available-
       | dictionary: <SHA-256>
       | 
       | loving it, another thing to track clients with
        
         | pimterry wrote:
         | From the spec (https://github.com/WICG/compression-dictionary-
         | transport#fin...):
         | 
         | > The existence of a dictionary is effectively a cookie for any
         | requests that match it and should be treated as such:
         | 
         | > * Storage partitioning for dictionary resource metadata
         | should be at least as restrictive as for cookies.
         | 
         | > * Dictionary entries (or at least the metadata) should be
         | cleared any time cookies are cleared.
         | 
         | Given that, I think it offers no more client tracking
         | possibilities than any other existing web tech (cookies, local
         | storage, etc)..
        
           | TonyTrapp wrote:
           | Maybe no more possibilities than regular cookies, but more
           | stealthy for sure. Cookies are (somewhat) easily reviewable
           | by the user through the web developer UI of most browsers; I
           | doubt there will be any browser UI to managing compression
           | dictionary.
        
         | out_of_protocol wrote:
         | should be w3c dicts only and number of dicts and versions
         | should be heavily limited. well, something like
         | js,css,html/xml,json,common-web and new version every 5 years i
         | think good enough for everyone
        
       | tyingq wrote:
       | Interesting two comment snippet from HN in 2018 about this:
       | https://news.ycombinator.com/item?id=18720554
        
       ___________________________________________________________________
       (page generated 2023-09-22 23:01 UTC)