[HN Gopher] Compression Dictionary Transport
       ___________________________________________________________________
        
       Compression Dictionary Transport
        
       Author : todsacerdoti
       Score  : 65 points
       Date   : 2025-07-04 15:07 UTC (7 hours ago)
        
 (HTM) web link (developer.mozilla.org)
 (TXT) w3m dump (developer.mozilla.org)
        
       | o11c wrote:
       | That `Link:` header broke my brain for a moment.
        
       | Y-bar wrote:
       | Available-Dictionary: :    =:
       | 
       | It seems very odd to use a colon as starting and ending delimiter
       | when the header name is already using a colon. Wouldn't a comma
       | or semicolon work better?
        
         | judofyr wrote:
         | It's encoded using the spec that binary data in headers should
         | be enclosed by colons: https://www.rfc-
         | editor.org/rfc/rfc8941.html#name-byte-sequen...
        
           | Y-bar wrote:
           | Oh, thanks, it looked like a string such as a hash or base64
           | encoded data, not binary. Don't think I have ever seen a use
           | case for binary data like this in a header before.
        
       | divbzero wrote:
       | This seems like a lot of added complexity for limited gain. Are
       | there cases where _gzip_ and _br_ at their highest compression
       | levels aren't good enough?
        
         | pmarreck wrote:
         | Every piece of information or file that is compressed sends a
         | dictionary along with it. In the case of, say, many HTML or CSS
         | files, this dictionary data is likely nearly completely
         | redundant.
         | 
         | There's almost no added complexity since zstd already handles
         | separate compression dictionaries quite well.
        
           | pornel wrote:
           | The standard compressed formats don't literally contain a
           | dictionary. The decompressed data becomes its own dictionary
           | while its being decompressed. This makes the first occurrence
           | of any pattern less efficiently compressed (but usually it's
           | still compressed thanks to entropy coding), and then it
           | becomes cheap to repeat.
           | 
           | Brotli has a default dictionary with bits of HTML and
           | scripts. This is built in into the decompressor, and not sent
           | with the files.
           | 
           | The decompression dictionaries aren't magic. They're
           | basically a prefix for decompressed files, so that a first
           | occurrence of some pattern can be referenced from the
           | dictionary instead of built from scratch. This helps only
           | with the first occurrences of data near the start of the
           | file, and for all the later repetitions the dictionary
           | becomes irrelevant.
           | 
           | The dictionary needs to be downloaded too, and you're not
           | going to have dictionaries all the way down, so you pay the
           | cost of decompressing the data without a dictionary whether
           | it's a dictionary + dictionary-using-file, or just the full
           | file itself.
        
         | bsmth wrote:
         | If you're shipping a JS bundle, for instance, that has small,
         | frequent updates, this should be a good use case. There's a
         | test site here that accompanies the explainer which looks
         | interesting for estimates: https://use-as-
         | dictionary.com/generate/
        
         | ks2048 wrote:
         | Some examples here: https://github.com/WICG/compression-
         | dictionary-transport/blo...
         | 
         | show significant gain of using dictionary over compressed w/o
         | dictionary.
         | 
         | It seems like instead of sites reducing bloat, they will just
         | shift the bloat to your hard-drive. Some of the examples said
         | dictionary of 1MB which doesn't seem big, but could add up if
         | everyone is doing this.
        
       | bhaney wrote:
       | Cloudflare and similar services seem well positioned to take
       | advantage of this.
       | 
       | Analyze the most common responses of a website on their platform,
       | build an efficient dictionary from that data, and then
       | automatically inject a link to that site-specific dictionary so
       | future responses are optimally compressed and save on bandwidth.
       | All transparent to the customers and end users.
        
         | pornel wrote:
         | Per-URL dictionaries (where a URL is its own dictionary) are
         | great, because they allow updating to a new version of a
         | resource incrementally, and an old version of the same resource
         | is the best template, and there's no extra cost when you
         | already have it.
         | 
         | However, I'm sceptical about usefulness of multi-page shared
         | dictionaries (where you construct one for a site or group of
         | pages). They're a gamble that can backfire.
         | 
         | The extra dictionary needs to be downloaded, so it starts as an
         | extra overhead. It's not enough for it to just match something.
         | It has to beat regular (per-page) compression to be better than
         | nothing, and it must be useful enough to repay its own cost
         | before it even starts being a net positive. This basically
         | means everything in the dictionary must be useful to a user,
         | and has to be used more than once, otherwise it's just an
         | unnecessary upfront slowdown.
         | 
         | Standard (per-page) compression is already very good at
         | removing simple repetitive patterns, and Brotli even comes with
         | a default built-in dictionary of random HTML-like fragments.
         | This further narrows down usefulness of the shared
         | dictionaries, because generic page-like content is enough to be
         | an advantage. They need to contain more specific content to
         | beat standard compression, but the more specific the dictionary
         | is, the lesser the chance of it fitting what the user browses.
        
         | creatonez wrote:
         | Excited to see access control mishaps where the training data
         | includes random data from other users
        
       | mlhpdx wrote:
       | This seems very interesting for APIs where clients have chatty
       | and long lived connections. I'm thinking about the GitHub API,
       | for example.
        
       | everfrustrated wrote:
       | No doubt someone will figure out how to abuse this into yet
       | another cookie/tracking technology.
        
       | CottonMcKnight wrote:
       | If this interests you, I highly recommend watching this talk by
       | Pat Meenan.
       | 
       | https://www.youtube.com/watch?v=Gt0H2DxdAPY
        
       | londons_explore wrote:
       | Seems like this would result in quite a lot of increased server
       | load.
       | 
       | Previously servers would cache compressed versions of your static
       | resources.
       | 
       | Whereas now they either have to compress on-the-fly or have a
       | massive cache of not only your most recent static JavaScript
       | blob, but also all past blobs and versions compressed using
       | different combinations of them as a dictionary.
       | 
       | This could easily 10x resources needed for serving static
       | html/CSS/js.
        
         | magicalist wrote:
         | The past versions stored clientside _are_ the dictionaries.
         | Serverside, just keep the diffs against, say, the last five
         | versions around if storage is an issue, or whatever gets you
         | some high percentage of returning clients, then rebuild when
         | pushing a new release.
        
       | longhaul wrote:
       | Why can't browsers/servers just store a standard English
       | dictionary and communicate via indexes?. Anything that isn't in
       | the dictionary can be sent raw. I've always had this thought but
       | don't see why it isn't implemented. Might get a bit more involved
       | with other languages but the principle remains the same.
       | 
       | Thinking about it a bit more, we are doing this at the character
       | level- a Unicode table, so why can't we lookup words or maybe
       | even common sentences ?
        
       ___________________________________________________________________
       (page generated 2025-07-04 23:00 UTC)