[HN Gopher] Compression Dictionary Transport
___________________________________________________________________
Compression Dictionary Transport
Author : tosh
Score : 55 points
Date : 2024-09-15 15:02 UTC (7 hours ago)
(HTM) web link (datatracker.ietf.org)
(TXT) w3m dump (datatracker.ietf.org)
| hedora wrote:
| I'm guessing the primary use case for this will be setting
| cookies. (Google fonts or whatever sends different dictionaries
| to different clients, the decompression result is unique per
| client).
|
| I wonder what percentage of http traffic is redundant compression
| dictionaries. How much could this actually help in theory?
| JoshTriplett wrote:
| > I'm guessing the primary use case for this will be setting
| cookies.
|
| There's a huge difference between "this could potentially be
| used to fingerprint a client" and "this will primarily be used
| to fingerprint a client".
|
| There are _many_ potential fingerprinting mechanisms that rely
| on tracking what the client already has in its cache. Clients
| could easily partition their cached dictionaries by origin,
| just as they could partition other cached information by
| origin, and that would prevent using this for cross-origin
| tracking.
|
| This proposal would help substantially with smaller requests,
| where a custom dictionary would provide better compression but
| not better enough to offset the full size of the dictionary.
| derf_ wrote:
| _> I wonder what percentage of http traffic is redundant
| compression dictionaries. How much could this actually help in
| theory?_
|
| A lot of times you will send a (possibly large) blob of text
| repeatedly with a few minor changes.
|
| One example I've used in practice is session descriptions (SDP)
| for WebRTC. Any time you add/remove/change a stream, you have
| to renegotiate the session, and this is done by passing an SDP
| blob that describes _all_ of the streams in a session in and
| out of the Javascript Session Establishment Protocol (JSEP) API
| on both sides of the connection. A video conference with dozens
| of participants each with separate audio and video streams
| joining one at a time might require exchanging hundreds or even
| thousands of SDP messages, and in large sessions these can grow
| to be hundreds of kB each, even though only a tiny portion of
| the SDP changes each time.
|
| Now, you could do a lot of work to parse the SDP locally,
| figure out exactly what changed, send just that difference to
| the other side, and have it be smart enough to patch its local
| idea of the current SDP with that difference to feed into JSEP,
| test it on every possible browser, make it robust to future
| changes in the SDP the browser will generate, etc.
|
| OR
|
| You could just send each SDP message compressed using the last
| SDP you sent as the initial dictionary. It will compress
| _really_ well. Even using gzip with the first ~31 kB of the
| previous SDP will get you in the neighborhood of 200:1
| compression[0]. Now your several-hundred kB SDP fits in a
| single MTU.
|
| I'm sure WebRTC is not the only place you will encounter this
| kind of pattern.
|
| [0] Not that either gzip or using part of a document as a
| dictionary are supported by this draft.
| toast0 wrote:
| > Now, you could do a lot of work to parse the SDP locally,
| figure out exactly what changed, send just that difference to
| the other side, and have it be smart enough to patch its
| local idea of the current SDP with that difference to feed
| into JSEP, test it on every possible browser, make it robust
| to future changes in the SDP the browser will generate, etc.
|
| I work with WebRTC outside a browser context, and we signal
| calls with simpler datastructures and generate the SDP right
| near the WebRTC Api. Our SFU doesn't ever see an SDP, because
| SDP is a big messy format around simpler parts --- it's
| easier to just communicate the simpler parts, and marshal
| into SDP at the API border. Even for 1:1 calls, we signal the
| parts, and then both ends generate the SDP to feed to WebRTC.
|
| IMHO, you're going to have to test your generated SDPs
| everywhere anyway, regardless of if clients or servers
| generate them.
|
| Well managed compression would certainly help reduce SDP size
| in transit, but reduce, reuse, recompress in that order.
| bawolff wrote:
| Sheesh, the google conspiracy theories are getting out of hand.
| Why would you use this for setting cookies when you could just
| use cookies? If for some reason you dont want to use real
| cookies, why wouldnt you just use the cache side channel
| directly?
|
| This doesn't really add any fingerprinting that doesnt already
| exist.
| giantrobot wrote:
| Because groups want to track people _without_ cookies. This
| re-introduces the problem of shared caches. Site A can know
| if someone visited Site B by whether or not they have the
| dictionary for some shared resource. Site A never has to show
| a cookie banner to do this tracking because there 's no
| cookie set. This is just reintroducing the "feature" of
| cross-site tracking with shared caches in the browser.
| bawolff wrote:
| Are you sure? I would assume that available dictionaries
| would be partioned by site (eTLD+1).
| patrickmeenan wrote:
| Nope. The dictionaries are partitioned the same way as the
| caches and cookies (whichever is partitioned more
| aggressively for a given browser). Usually by site and
| frame so there are no cross-site vectors that it opens.
| jepler wrote:
| this was my first thought as well. Authors just acknowledge it
| and move on; it's not like shopify and google care whether
| there's another way to successfully track users online.
| 10. Privacy Considerations Since dictionaries
| are advertised in future requests using the hash of
| the content of the dictionary, it is possible to abuse the
| dictionary to turn it into a tracking cookie.
| patrickmeenan wrote:
| Which is why they are treated as if they are cookies and are
| cleared any time the cache or cookies are cleared so that
| they can not provide an additional tracking vector beyond
| what cookies can do (and when 3rd party cookies are
| partitioned by site/frame, they are also partitioned the
| same).
|
| There are LOTS of privacy teams within the respective
| companies, W3C and IETF that have looked it over to make sure
| that it does not open any new abuse vectors. It's worth
| noting that Google, Mozilla and Apple are all supportive of
| the spec and have all been involved over the last year.
| SahAssar wrote:
| This follows the same origin rules as cookies and caching, so
| it is not useful for tracking anymore than those already are.
|
| As the when this is useful there are many examples:
|
| * Map tiling (like OSM, mapbox, google maps, etc.) often have
| many repeated tokens/data but the tiles are served as
| individual requests
|
| * Code splitting of CSS/JS, where they share a lot of the same
| tokens but you don't want to have to load the full bundle on
| every load since much of it wont be used
|
| * Any site where the same terms and/or parts are recurring
| across multiple pages (for many sites like IMDB or similar I'd
| guess 80% of the HTML is non-unique per request)
|
| This has been tried before with SDCH which unfortunately died,
| hopefully this goes better.
| patrickmeenan wrote:
| FWIW, this addresses the BREACH/CRIME issues that killed SDCH
| by only operating on CORS-readable content.
|
| It also solves the problem that SDCH had where the dictionary
| would be forced on the response and the client would have to
| go fetch it if it didn't have it before it could process the
| response.
|
| The tooling for generating dictionaries is also WAY better
| than it was back in the SDCH days (and they are a much
| cleaner format, being arbitrary byte strings that Brotli and
| ZStandard can back-reference).
|
| Lots of people involved in working on it were also involved
| with SDCH and have been trying to find a workable solution
| since it had to be turned down.
| magicalist wrote:
| > _I'm guessing the primary use case for this will be setting
| cookies._
|
| The cache is partitioned by document and resource origins, so
| you might as well just use first party cookies at that point
| (or etags if you insist on being sneaky).
| freeqaz wrote:
| This is the link you want if you want to read the actual meaty
| contents of this page. Took me a sec to find.
| https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-com...
| wmf wrote:
| Interesting that they removed SDCH in 2017 and now they're adding
| it back. Let's hope this version sticks.
| devinplatt wrote:
| I was curious about this given what happened with SDCH.
|
| Here is what Wikipedia[0] says
|
| > Due to the diffing results and the data being compressed with
| the same coding, SDCH dictionaries aged relatively quickly and
| compression density became quickly worse than with the usual
| non-dictionary compression such as GZip. This created extra
| effort in production to keep the dictionaries fresh and reduced
| its applicability. Modern dictionary coding such as Shared
| Brotli has a more effective solution for this that fixes the
| dictionary aging problem.
|
| This new proposal uses Brotli.
|
| [0]: https://en.m.wikipedia.org/wiki/SDCH
| patrickmeenan wrote:
| SDCH was removed when SPECTRE became a thing (CRIME/BREACH)
| because it was open to side-channel attacks.
|
| Yes, it had other problems, not the least of which was that
| it would block the processing of a response while a client
| fetched the dictionary, but the side-channel attacks were
| what killed it.
|
| The compression dictionary transport work addresses all of
| the known issues that we had with SDCH and we're cautiously
| optimistic that this will be around for a long time.
| londons_explore wrote:
| This is going to go the same way as HTTP/2 Server Push (removed
| from Chrome in 2022).
|
| Great tech promising amazing efficiencies and speedups, but
| programmers are too lazy to use them correctly, so they see
| barely any use, and the few places it is used correctly are
| overshadowed by the few places it is used wrongly hurting
| performance.
|
| Lesson: Tech that leads to the same page loading slightly faster
| generally won't be used unless they are fully automatic and
| enabled by default.
|
| Http push required extra config on the server to decide what to
| push. This requires extra headers to determine what content to
| compress against (and the server to store that old/common
| content).
|
| Neither will succeed because the vast majority of developers
| don't care about that last little bit of loading speed.
| magicalist wrote:
| > _Http push required extra config on the server to decide what
| to push. This requires extra headers to determine what content
| to compress against (and the server to store that old /common
| content)._
|
| This does seem like an advanced use case you won't generally
| want to set up manually, but it is pretty different than push.
| Push required knowing what to push, which depends on what a
| page loads but also when the page loads it, which itself
| depends on the speed of the client. Mess that up and you can
| actually slow down the page load (like pushing resources the
| client already started downloading).
|
| Compression dictionaries could be as simple as vercel or
| whoever supporting deltas for your last n deployments, at which
| point the header to include is trivial.
| londons_explore wrote:
| > Push required knowing what to push,
|
| Could have been as simple as once-per-resource webpack build
| step which detected all resources downloaded per-page (easy
| with a headless chromium), and then pushing all of those (or
| all of the non-cacheable ones if the client sends a recent
| cookie).
|
| Yet frameworks and build systems didn't do that, and the
| feature was never really used.
|
| > you can actually slow down the page load (like pushing
| resources the client already started downloading).
|
| Any semi-smart server can prevent that, because it already
| knows what is currently being transferred to the client, or
| was already transferred in this session. There is no race
| condition as long as the server refuses to double-send
| resources. Granted, some server side architectures make this
| hard without the load balancer being aware.
|
| This compression dictionary will see the exact same fate IMO.
| bcoates wrote:
| Any evidence this actually causes material performance
| improvement?
|
| Pre-shared compression dictionaries are rarely seen in the wild
| because they rarely provide meaningful benefit, particularly in
| the cases you care about most.
| CaptainOfCoit wrote:
| I guess you should outline the cases you care the most about,
| for anyone to be able to answer if there is any material
| performance improvements.
| hansvm wrote:
| The only requirements are serving a lot of the same "kind" of
| content, enough such (or for some other business reason) that
| it doesn't make sense to send it all at once, and somebody
| willing to spend a bit of time to implement it. Map and local
| geocoding information comes to mind as a decent application,
| and if the proposed implementation weren't so hyper-focused
| on common compression standards you could probably do
| something fancy for, e.g., meme sites (with a finite number
| of common or similar background images).
|
| Assuming (perhaps pessimistically) it won't work well for
| most sites serving images, other possibilities include:
|
| - Real-estate searches (where people commonly tweak tons of
| parameters and geographies, returning frequently duplicated
| information like "This historical unit has an amazing view in
| each of its 2 bedrooms").
|
| - Expanding that a bit, any product search where you expect a
| person to, over time, frequently query for the same sorts of
| stuff.
|
| - Local business information (how common is it to open at
| 10am and close at 6pm every weekday in some specific locale
| for example).
|
| ...
|
| If the JS, image, and structural HTML payloads dwarf
| everything else then maybe it won't matter in practice. I bet
| somebody could make good use of it though.
| therein wrote:
| Yup, we did experiment with SDCH at a large social network
| around 2015 and it didn't massively outperform gzip.
| Outperforming it required creating a pipeline for dynamic
| dictionary generation and distribution.
| itsthecourier wrote:
| SDHC?
| morsch wrote:
| Shared Dictionary Compression for HTTP
|
| https://en.m.wikipedia.org/wiki/SDCH
| csswizardry wrote:
| Huge, huge improvements. I've been following this spec keenly.
|
| 1. https://compression-dictionary-transport-threejs-
| demo.glitch...
|
| 2. https://compression-dictionary-transport-shop-
| demo.glitch.me...
| fnordpiglet wrote:
| If you care about latencies or are on very low bandwidth or
| noisy connections preshared dictionaries matter a lot. By and
| large they generally always help but come with complexity so
| are often avoided in favor of a simpler approach whose
| compression is acceptable rather than optimal. But if there's a
| clear and well implemented standard that's widely adopted I
| would always choose preshared dictionaries. Likewise secure
| connections are hard and without TLS and other standards most
| people wouldn't try unless they really needed it. But with a
| good standard and broad implementation it's basically
| ubiquitous.
| itsthecourier wrote:
| Bro, this stuff will be wild for our iot stuff
| vitus wrote:
| The one example I can think of with a pre-seeded dictionary
| (for web, no less) is Brotli.
|
| https://datatracker.ietf.org/doc/html/rfc7932#appendix-A
|
| You can more or less see what it looks like (per an older
| commit):
| https://github.com/google/brotli/blob/5692e422da6af1e991f918...
|
| Certainly it performs better than gzip by itself.
|
| Some historical discussion:
| https://news.ycombinator.com/item?id=19678985
| duskwuff wrote:
| A more readable version of the Brotli dictionary:
|
| https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37.
| ..
| patrickmeenan wrote:
| Absolutely, for the use cases where it makes sense. There are
| some examples here: https://github.com/WICG/compression-
| dictionary-transport/blo...
|
| In the web case, it mostly only makes sense if users are using
| a site for more than one page (or over a long time).
|
| Some of the common places where it can have a huge impact:
|
| - Delta-updating wasm or JS/CSS code between releases. Like the
| youtube player JavaScript or Adobe Web's WASM code. Instead of
| downloading the whole thing again, the version in the user's
| cache can be used as a "dictionary" and just the deltas for the
| update can be delivered. Typically this is 90-99% smaller than
| using Brotli with no dictionary.
|
| - Lazy-loading a site-specific dictionary for the HTML content.
| Pages after the first one can use the dictionary and just load
| the page-specific content (compresses away the headers,
| template, common phrases, logos, inline svg's or data URI's,
| etc). This usually makes the HTML 60-90% smaller depending on
| how much unique content is in the HTML (there is a LOT of site
| boilerplate).
|
| - JSON API's can load a dictionary that has the keys and common
| values and basically yield a binary format on the wire for JSON
| data, compressing out all of the verbosity.
|
| I expect we're still just scratching the surface of how they
| will be used but the results are pretty stunning if you have a
| site with regular user engagement.
|
| FWIW, they are not "pre-shared" so it doesn't help the first
| visit to a site. The can use existing requests for delta
| updates or the dictionaries can be loaded on demand but it is
| up to the site to load them (and create them).
|
| It will probably fall over if it gets hit too hard, but there
| is some tooling here that can generate dictionaries for you
| (using the brotli dictionary generator) and let you test the
| effectiveness: https://use-as-dictionary.com/
| jiggawatts wrote:
| > if you have a site with regular user engagement.
|
| Ah, gotcha: this is a new Google standard that helps Google
| sites when browsed using Google Chrome.
|
| Everyone else will discover that keeping the previous version
| of a library around at build time doesn't fit into the
| typical build process and won't bother with this feature.
|
| Only Facebook and _maybe_ half a dozen similar other orgs
| will enable this and benefit.
|
| The Internet top-to-bottom is owned by the G in FAANG with
| FAAN just along for the ride.
| dexterdog wrote:
| Why can't we implement caching on the integrity sha of the
| library so they can be shared across sites? Sure there is
| technically an attack vector there but that is pretty easily
| scannable for tampering by verifying with trusted CDNs.
|
| With something like that you could preload all of the most common
| libraries in the background.
| demurgos wrote:
| The problem are timing attacks leaking the navigation history.
| An attacker can load a lib shared with some other site. Based
| on the load time he may then learn if the user visited the
| target site recently.
| drdaeman wrote:
| Can't we record the original download timings and replicate
| them artificially, with some noise added for privacy? Of
| course, with reasonable upper latency threshold (derived from
| overall network performance metrics) to prevent DoS on
| clients.
|
| While this won't improve load times, it can save bandwidth,
| improving user experience on slow connections. And most
| popular hand-selected library versions can get preloaded by
| the browser itself and listed as exceptions with zero added
| latency.
|
| Also, this way the networking telemetry would gain a
| meaningful purpose for end users, not just developers.
| manwe150 wrote:
| My understanding had been that you get a double whammy of
| there being too many versions of the "common" dependencies,
| so it is already in-effect a per-site cache (or nearly so),
| but that uniqueness also means a fingerprint can be
| established with fairly few latency tests
| dexterdog wrote:
| How would you know the original site if it's a common asset?
| josephg wrote:
| No need. Just make a custom asset referenced only by those
| sites, and get the browser to cache it.
| kevincox wrote:
| It is interesting that a proxy won't be able to see the complete
| response anymore. It will see the dictionary ID and hash but
| without a copy the server's response won't be fully intelligible
| to it.
| patrickmeenan wrote:
| If the proxy is correctly handling the Accept-Encoding
| (rewriting it with only encodings that it understands), it can
| either remove the `dcb` and `dcz` encodings or it can check if
| it knows the announced dictionary and only allow them through
| if it has the dictionary.
|
| MITM devices that just inspect and fail on unknown content-
| encoding values will have a problem and will need to be updated
| (there is an enterprise policy to disable the feature in Chrome
| for that situation until the proxies can be updated).
___________________________________________________________________
(page generated 2024-09-15 23:01 UTC)