[HN Gopher] Compression Dictionary Transport
___________________________________________________________________
Compression Dictionary Transport
Author : todsacerdoti
Score : 65 points
Date : 2025-07-04 15:07 UTC (7 hours ago)
(HTM) web link (developer.mozilla.org)
(TXT) w3m dump (developer.mozilla.org)
| o11c wrote:
| That `Link:` header broke my brain for a moment.
| Y-bar wrote:
| Available-Dictionary: : =:
|
| It seems very odd to use a colon as starting and ending delimiter
| when the header name is already using a colon. Wouldn't a comma
| or semicolon work better?
| judofyr wrote:
| It's encoded using the spec that binary data in headers should
| be enclosed by colons: https://www.rfc-
| editor.org/rfc/rfc8941.html#name-byte-sequen...
| Y-bar wrote:
| Oh, thanks, it looked like a string such as a hash or base64
| encoded data, not binary. Don't think I have ever seen a use
| case for binary data like this in a header before.
| divbzero wrote:
| This seems like a lot of added complexity for limited gain. Are
| there cases where _gzip_ and _br_ at their highest compression
| levels aren't good enough?
| pmarreck wrote:
| Every piece of information or file that is compressed sends a
| dictionary along with it. In the case of, say, many HTML or CSS
| files, this dictionary data is likely nearly completely
| redundant.
|
| There's almost no added complexity since zstd already handles
| separate compression dictionaries quite well.
| pornel wrote:
| The standard compressed formats don't literally contain a
| dictionary. The decompressed data becomes its own dictionary
| while its being decompressed. This makes the first occurrence
| of any pattern less efficiently compressed (but usually it's
| still compressed thanks to entropy coding), and then it
| becomes cheap to repeat.
|
| Brotli has a default dictionary with bits of HTML and
| scripts. This is built in into the decompressor, and not sent
| with the files.
|
| The decompression dictionaries aren't magic. They're
| basically a prefix for decompressed files, so that a first
| occurrence of some pattern can be referenced from the
| dictionary instead of built from scratch. This helps only
| with the first occurrences of data near the start of the
| file, and for all the later repetitions the dictionary
| becomes irrelevant.
|
| The dictionary needs to be downloaded too, and you're not
| going to have dictionaries all the way down, so you pay the
| cost of decompressing the data without a dictionary whether
| it's a dictionary + dictionary-using-file, or just the full
| file itself.
| bsmth wrote:
| If you're shipping a JS bundle, for instance, that has small,
| frequent updates, this should be a good use case. There's a
| test site here that accompanies the explainer which looks
| interesting for estimates: https://use-as-
| dictionary.com/generate/
| ks2048 wrote:
| Some examples here: https://github.com/WICG/compression-
| dictionary-transport/blo...
|
| show significant gain of using dictionary over compressed w/o
| dictionary.
|
| It seems like instead of sites reducing bloat, they will just
| shift the bloat to your hard-drive. Some of the examples said
| dictionary of 1MB which doesn't seem big, but could add up if
| everyone is doing this.
| bhaney wrote:
| Cloudflare and similar services seem well positioned to take
| advantage of this.
|
| Analyze the most common responses of a website on their platform,
| build an efficient dictionary from that data, and then
| automatically inject a link to that site-specific dictionary so
| future responses are optimally compressed and save on bandwidth.
| All transparent to the customers and end users.
| pornel wrote:
| Per-URL dictionaries (where a URL is its own dictionary) are
| great, because they allow updating to a new version of a
| resource incrementally, and an old version of the same resource
| is the best template, and there's no extra cost when you
| already have it.
|
| However, I'm sceptical about usefulness of multi-page shared
| dictionaries (where you construct one for a site or group of
| pages). They're a gamble that can backfire.
|
| The extra dictionary needs to be downloaded, so it starts as an
| extra overhead. It's not enough for it to just match something.
| It has to beat regular (per-page) compression to be better than
| nothing, and it must be useful enough to repay its own cost
| before it even starts being a net positive. This basically
| means everything in the dictionary must be useful to a user,
| and has to be used more than once, otherwise it's just an
| unnecessary upfront slowdown.
|
| Standard (per-page) compression is already very good at
| removing simple repetitive patterns, and Brotli even comes with
| a default built-in dictionary of random HTML-like fragments.
| This further narrows down usefulness of the shared
| dictionaries, because generic page-like content is enough to be
| an advantage. They need to contain more specific content to
| beat standard compression, but the more specific the dictionary
| is, the lesser the chance of it fitting what the user browses.
| creatonez wrote:
| Excited to see access control mishaps where the training data
| includes random data from other users
| mlhpdx wrote:
| This seems very interesting for APIs where clients have chatty
| and long lived connections. I'm thinking about the GitHub API,
| for example.
| everfrustrated wrote:
| No doubt someone will figure out how to abuse this into yet
| another cookie/tracking technology.
| CottonMcKnight wrote:
| If this interests you, I highly recommend watching this talk by
| Pat Meenan.
|
| https://www.youtube.com/watch?v=Gt0H2DxdAPY
| londons_explore wrote:
| Seems like this would result in quite a lot of increased server
| load.
|
| Previously servers would cache compressed versions of your static
| resources.
|
| Whereas now they either have to compress on-the-fly or have a
| massive cache of not only your most recent static JavaScript
| blob, but also all past blobs and versions compressed using
| different combinations of them as a dictionary.
|
| This could easily 10x resources needed for serving static
| html/CSS/js.
| magicalist wrote:
| The past versions stored clientside _are_ the dictionaries.
| Serverside, just keep the diffs against, say, the last five
| versions around if storage is an issue, or whatever gets you
| some high percentage of returning clients, then rebuild when
| pushing a new release.
| longhaul wrote:
| Why can't browsers/servers just store a standard English
| dictionary and communicate via indexes?. Anything that isn't in
| the dictionary can be sent raw. I've always had this thought but
| don't see why it isn't implemented. Might get a bit more involved
| with other languages but the principle remains the same.
|
| Thinking about it a bit more, we are doing this at the character
| level- a Unicode table, so why can't we lookup words or maybe
| even common sentences ?
___________________________________________________________________
(page generated 2025-07-04 23:00 UTC)