[HN Gopher] Compression Dictionary Transport
       ___________________________________________________________________
        
       Compression Dictionary Transport
        
       Author : tosh
       Score  : 55 points
       Date   : 2024-09-15 15:02 UTC (7 hours ago)
        
 (HTM) web link (datatracker.ietf.org)
 (TXT) w3m dump (datatracker.ietf.org)
        
       | hedora wrote:
       | I'm guessing the primary use case for this will be setting
       | cookies. (Google fonts or whatever sends different dictionaries
       | to different clients, the decompression result is unique per
       | client).
       | 
       | I wonder what percentage of http traffic is redundant compression
       | dictionaries. How much could this actually help in theory?
        
         | JoshTriplett wrote:
         | > I'm guessing the primary use case for this will be setting
         | cookies.
         | 
         | There's a huge difference between "this could potentially be
         | used to fingerprint a client" and "this will primarily be used
         | to fingerprint a client".
         | 
         | There are _many_ potential fingerprinting mechanisms that rely
         | on tracking what the client already has in its cache. Clients
         | could easily partition their cached dictionaries by origin,
         | just as they could partition other cached information by
         | origin, and that would prevent using this for cross-origin
         | tracking.
         | 
         | This proposal would help substantially with smaller requests,
         | where a custom dictionary would provide better compression but
         | not better enough to offset the full size of the dictionary.
        
         | derf_ wrote:
         | _> I wonder what percentage of http traffic is redundant
         | compression dictionaries. How much could this actually help in
         | theory?_
         | 
         | A lot of times you will send a (possibly large) blob of text
         | repeatedly with a few minor changes.
         | 
         | One example I've used in practice is session descriptions (SDP)
         | for WebRTC. Any time you add/remove/change a stream, you have
         | to renegotiate the session, and this is done by passing an SDP
         | blob that describes _all_ of the streams in a session in and
         | out of the Javascript Session Establishment Protocol (JSEP) API
         | on both sides of the connection. A video conference with dozens
         | of participants each with separate audio and video streams
         | joining one at a time might require exchanging hundreds or even
         | thousands of SDP messages, and in large sessions these can grow
         | to be hundreds of kB each, even though only a tiny portion of
         | the SDP changes each time.
         | 
         | Now, you could do a lot of work to parse the SDP locally,
         | figure out exactly what changed, send just that difference to
         | the other side, and have it be smart enough to patch its local
         | idea of the current SDP with that difference to feed into JSEP,
         | test it on every possible browser, make it robust to future
         | changes in the SDP the browser will generate, etc.
         | 
         | OR
         | 
         | You could just send each SDP message compressed using the last
         | SDP you sent as the initial dictionary. It will compress
         | _really_ well. Even using gzip with the first ~31 kB of the
         | previous SDP will get you in the neighborhood of 200:1
         | compression[0]. Now your several-hundred kB SDP fits in a
         | single MTU.
         | 
         | I'm sure WebRTC is not the only place you will encounter this
         | kind of pattern.
         | 
         | [0] Not that either gzip or using part of a document as a
         | dictionary are supported by this draft.
        
           | toast0 wrote:
           | > Now, you could do a lot of work to parse the SDP locally,
           | figure out exactly what changed, send just that difference to
           | the other side, and have it be smart enough to patch its
           | local idea of the current SDP with that difference to feed
           | into JSEP, test it on every possible browser, make it robust
           | to future changes in the SDP the browser will generate, etc.
           | 
           | I work with WebRTC outside a browser context, and we signal
           | calls with simpler datastructures and generate the SDP right
           | near the WebRTC Api. Our SFU doesn't ever see an SDP, because
           | SDP is a big messy format around simpler parts --- it's
           | easier to just communicate the simpler parts, and marshal
           | into SDP at the API border. Even for 1:1 calls, we signal the
           | parts, and then both ends generate the SDP to feed to WebRTC.
           | 
           | IMHO, you're going to have to test your generated SDPs
           | everywhere anyway, regardless of if clients or servers
           | generate them.
           | 
           | Well managed compression would certainly help reduce SDP size
           | in transit, but reduce, reuse, recompress in that order.
        
         | bawolff wrote:
         | Sheesh, the google conspiracy theories are getting out of hand.
         | Why would you use this for setting cookies when you could just
         | use cookies? If for some reason you dont want to use real
         | cookies, why wouldnt you just use the cache side channel
         | directly?
         | 
         | This doesn't really add any fingerprinting that doesnt already
         | exist.
        
           | giantrobot wrote:
           | Because groups want to track people _without_ cookies. This
           | re-introduces the problem of shared caches. Site A can know
           | if someone visited Site B by whether or not they have the
           | dictionary for some shared resource. Site A never has to show
           | a cookie banner to do this tracking because there 's no
           | cookie set. This is just reintroducing the "feature" of
           | cross-site tracking with shared caches in the browser.
        
             | bawolff wrote:
             | Are you sure? I would assume that available dictionaries
             | would be partioned by site (eTLD+1).
        
             | patrickmeenan wrote:
             | Nope. The dictionaries are partitioned the same way as the
             | caches and cookies (whichever is partitioned more
             | aggressively for a given browser). Usually by site and
             | frame so there are no cross-site vectors that it opens.
        
         | jepler wrote:
         | this was my first thought as well. Authors just acknowledge it
         | and move on; it's not like shopify and google care whether
         | there's another way to successfully track users online.
         | 10.  Privacy Considerations                 Since dictionaries
         | are advertised in future requests using the hash            of
         | the content of the dictionary, it is possible to abuse the
         | dictionary to turn it into a tracking cookie.
        
           | patrickmeenan wrote:
           | Which is why they are treated as if they are cookies and are
           | cleared any time the cache or cookies are cleared so that
           | they can not provide an additional tracking vector beyond
           | what cookies can do (and when 3rd party cookies are
           | partitioned by site/frame, they are also partitioned the
           | same).
           | 
           | There are LOTS of privacy teams within the respective
           | companies, W3C and IETF that have looked it over to make sure
           | that it does not open any new abuse vectors. It's worth
           | noting that Google, Mozilla and Apple are all supportive of
           | the spec and have all been involved over the last year.
        
         | SahAssar wrote:
         | This follows the same origin rules as cookies and caching, so
         | it is not useful for tracking anymore than those already are.
         | 
         | As the when this is useful there are many examples:
         | 
         | * Map tiling (like OSM, mapbox, google maps, etc.) often have
         | many repeated tokens/data but the tiles are served as
         | individual requests
         | 
         | * Code splitting of CSS/JS, where they share a lot of the same
         | tokens but you don't want to have to load the full bundle on
         | every load since much of it wont be used
         | 
         | * Any site where the same terms and/or parts are recurring
         | across multiple pages (for many sites like IMDB or similar I'd
         | guess 80% of the HTML is non-unique per request)
         | 
         | This has been tried before with SDCH which unfortunately died,
         | hopefully this goes better.
        
           | patrickmeenan wrote:
           | FWIW, this addresses the BREACH/CRIME issues that killed SDCH
           | by only operating on CORS-readable content.
           | 
           | It also solves the problem that SDCH had where the dictionary
           | would be forced on the response and the client would have to
           | go fetch it if it didn't have it before it could process the
           | response.
           | 
           | The tooling for generating dictionaries is also WAY better
           | than it was back in the SDCH days (and they are a much
           | cleaner format, being arbitrary byte strings that Brotli and
           | ZStandard can back-reference).
           | 
           | Lots of people involved in working on it were also involved
           | with SDCH and have been trying to find a workable solution
           | since it had to be turned down.
        
         | magicalist wrote:
         | > _I'm guessing the primary use case for this will be setting
         | cookies._
         | 
         | The cache is partitioned by document and resource origins, so
         | you might as well just use first party cookies at that point
         | (or etags if you insist on being sneaky).
        
       | freeqaz wrote:
       | This is the link you want if you want to read the actual meaty
       | contents of this page. Took me a sec to find.
       | https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-com...
        
       | wmf wrote:
       | Interesting that they removed SDCH in 2017 and now they're adding
       | it back. Let's hope this version sticks.
        
         | devinplatt wrote:
         | I was curious about this given what happened with SDCH.
         | 
         | Here is what Wikipedia[0] says
         | 
         | > Due to the diffing results and the data being compressed with
         | the same coding, SDCH dictionaries aged relatively quickly and
         | compression density became quickly worse than with the usual
         | non-dictionary compression such as GZip. This created extra
         | effort in production to keep the dictionaries fresh and reduced
         | its applicability. Modern dictionary coding such as Shared
         | Brotli has a more effective solution for this that fixes the
         | dictionary aging problem.
         | 
         | This new proposal uses Brotli.
         | 
         | [0]: https://en.m.wikipedia.org/wiki/SDCH
        
           | patrickmeenan wrote:
           | SDCH was removed when SPECTRE became a thing (CRIME/BREACH)
           | because it was open to side-channel attacks.
           | 
           | Yes, it had other problems, not the least of which was that
           | it would block the processing of a response while a client
           | fetched the dictionary, but the side-channel attacks were
           | what killed it.
           | 
           | The compression dictionary transport work addresses all of
           | the known issues that we had with SDCH and we're cautiously
           | optimistic that this will be around for a long time.
        
       | londons_explore wrote:
       | This is going to go the same way as HTTP/2 Server Push (removed
       | from Chrome in 2022).
       | 
       | Great tech promising amazing efficiencies and speedups, but
       | programmers are too lazy to use them correctly, so they see
       | barely any use, and the few places it is used correctly are
       | overshadowed by the few places it is used wrongly hurting
       | performance.
       | 
       | Lesson: Tech that leads to the same page loading slightly faster
       | generally won't be used unless they are fully automatic and
       | enabled by default.
       | 
       | Http push required extra config on the server to decide what to
       | push. This requires extra headers to determine what content to
       | compress against (and the server to store that old/common
       | content).
       | 
       | Neither will succeed because the vast majority of developers
       | don't care about that last little bit of loading speed.
        
         | magicalist wrote:
         | > _Http push required extra config on the server to decide what
         | to push. This requires extra headers to determine what content
         | to compress against (and the server to store that old /common
         | content)._
         | 
         | This does seem like an advanced use case you won't generally
         | want to set up manually, but it is pretty different than push.
         | Push required knowing what to push, which depends on what a
         | page loads but also when the page loads it, which itself
         | depends on the speed of the client. Mess that up and you can
         | actually slow down the page load (like pushing resources the
         | client already started downloading).
         | 
         | Compression dictionaries could be as simple as vercel or
         | whoever supporting deltas for your last n deployments, at which
         | point the header to include is trivial.
        
           | londons_explore wrote:
           | > Push required knowing what to push,
           | 
           | Could have been as simple as once-per-resource webpack build
           | step which detected all resources downloaded per-page (easy
           | with a headless chromium), and then pushing all of those (or
           | all of the non-cacheable ones if the client sends a recent
           | cookie).
           | 
           | Yet frameworks and build systems didn't do that, and the
           | feature was never really used.
           | 
           | > you can actually slow down the page load (like pushing
           | resources the client already started downloading).
           | 
           | Any semi-smart server can prevent that, because it already
           | knows what is currently being transferred to the client, or
           | was already transferred in this session. There is no race
           | condition as long as the server refuses to double-send
           | resources. Granted, some server side architectures make this
           | hard without the load balancer being aware.
           | 
           | This compression dictionary will see the exact same fate IMO.
        
       | bcoates wrote:
       | Any evidence this actually causes material performance
       | improvement?
       | 
       | Pre-shared compression dictionaries are rarely seen in the wild
       | because they rarely provide meaningful benefit, particularly in
       | the cases you care about most.
        
         | CaptainOfCoit wrote:
         | I guess you should outline the cases you care the most about,
         | for anyone to be able to answer if there is any material
         | performance improvements.
        
           | hansvm wrote:
           | The only requirements are serving a lot of the same "kind" of
           | content, enough such (or for some other business reason) that
           | it doesn't make sense to send it all at once, and somebody
           | willing to spend a bit of time to implement it. Map and local
           | geocoding information comes to mind as a decent application,
           | and if the proposed implementation weren't so hyper-focused
           | on common compression standards you could probably do
           | something fancy for, e.g., meme sites (with a finite number
           | of common or similar background images).
           | 
           | Assuming (perhaps pessimistically) it won't work well for
           | most sites serving images, other possibilities include:
           | 
           | - Real-estate searches (where people commonly tweak tons of
           | parameters and geographies, returning frequently duplicated
           | information like "This historical unit has an amazing view in
           | each of its 2 bedrooms").
           | 
           | - Expanding that a bit, any product search where you expect a
           | person to, over time, frequently query for the same sorts of
           | stuff.
           | 
           | - Local business information (how common is it to open at
           | 10am and close at 6pm every weekday in some specific locale
           | for example).
           | 
           | ...
           | 
           | If the JS, image, and structural HTML payloads dwarf
           | everything else then maybe it won't matter in practice. I bet
           | somebody could make good use of it though.
        
         | therein wrote:
         | Yup, we did experiment with SDCH at a large social network
         | around 2015 and it didn't massively outperform gzip.
         | Outperforming it required creating a pipeline for dynamic
         | dictionary generation and distribution.
        
           | itsthecourier wrote:
           | SDHC?
        
             | morsch wrote:
             | Shared Dictionary Compression for HTTP
             | 
             | https://en.m.wikipedia.org/wiki/SDCH
        
         | csswizardry wrote:
         | Huge, huge improvements. I've been following this spec keenly.
         | 
         | 1. https://compression-dictionary-transport-threejs-
         | demo.glitch...
         | 
         | 2. https://compression-dictionary-transport-shop-
         | demo.glitch.me...
        
         | fnordpiglet wrote:
         | If you care about latencies or are on very low bandwidth or
         | noisy connections preshared dictionaries matter a lot. By and
         | large they generally always help but come with complexity so
         | are often avoided in favor of a simpler approach whose
         | compression is acceptable rather than optimal. But if there's a
         | clear and well implemented standard that's widely adopted I
         | would always choose preshared dictionaries. Likewise secure
         | connections are hard and without TLS and other standards most
         | people wouldn't try unless they really needed it. But with a
         | good standard and broad implementation it's basically
         | ubiquitous.
        
         | itsthecourier wrote:
         | Bro, this stuff will be wild for our iot stuff
        
         | vitus wrote:
         | The one example I can think of with a pre-seeded dictionary
         | (for web, no less) is Brotli.
         | 
         | https://datatracker.ietf.org/doc/html/rfc7932#appendix-A
         | 
         | You can more or less see what it looks like (per an older
         | commit):
         | https://github.com/google/brotli/blob/5692e422da6af1e991f918...
         | 
         | Certainly it performs better than gzip by itself.
         | 
         | Some historical discussion:
         | https://news.ycombinator.com/item?id=19678985
        
           | duskwuff wrote:
           | A more readable version of the Brotli dictionary:
           | 
           | https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37.
           | ..
        
         | patrickmeenan wrote:
         | Absolutely, for the use cases where it makes sense. There are
         | some examples here: https://github.com/WICG/compression-
         | dictionary-transport/blo...
         | 
         | In the web case, it mostly only makes sense if users are using
         | a site for more than one page (or over a long time).
         | 
         | Some of the common places where it can have a huge impact:
         | 
         | - Delta-updating wasm or JS/CSS code between releases. Like the
         | youtube player JavaScript or Adobe Web's WASM code. Instead of
         | downloading the whole thing again, the version in the user's
         | cache can be used as a "dictionary" and just the deltas for the
         | update can be delivered. Typically this is 90-99% smaller than
         | using Brotli with no dictionary.
         | 
         | - Lazy-loading a site-specific dictionary for the HTML content.
         | Pages after the first one can use the dictionary and just load
         | the page-specific content (compresses away the headers,
         | template, common phrases, logos, inline svg's or data URI's,
         | etc). This usually makes the HTML 60-90% smaller depending on
         | how much unique content is in the HTML (there is a LOT of site
         | boilerplate).
         | 
         | - JSON API's can load a dictionary that has the keys and common
         | values and basically yield a binary format on the wire for JSON
         | data, compressing out all of the verbosity.
         | 
         | I expect we're still just scratching the surface of how they
         | will be used but the results are pretty stunning if you have a
         | site with regular user engagement.
         | 
         | FWIW, they are not "pre-shared" so it doesn't help the first
         | visit to a site. The can use existing requests for delta
         | updates or the dictionaries can be loaded on demand but it is
         | up to the site to load them (and create them).
         | 
         | It will probably fall over if it gets hit too hard, but there
         | is some tooling here that can generate dictionaries for you
         | (using the brotli dictionary generator) and let you test the
         | effectiveness: https://use-as-dictionary.com/
        
           | jiggawatts wrote:
           | > if you have a site with regular user engagement.
           | 
           | Ah, gotcha: this is a new Google standard that helps Google
           | sites when browsed using Google Chrome.
           | 
           | Everyone else will discover that keeping the previous version
           | of a library around at build time doesn't fit into the
           | typical build process and won't bother with this feature.
           | 
           | Only Facebook and _maybe_ half a dozen similar other orgs
           | will enable this and benefit.
           | 
           | The Internet top-to-bottom is owned by the G in FAANG with
           | FAAN just along for the ride.
        
       | dexterdog wrote:
       | Why can't we implement caching on the integrity sha of the
       | library so they can be shared across sites? Sure there is
       | technically an attack vector there but that is pretty easily
       | scannable for tampering by verifying with trusted CDNs.
       | 
       | With something like that you could preload all of the most common
       | libraries in the background.
        
         | demurgos wrote:
         | The problem are timing attacks leaking the navigation history.
         | An attacker can load a lib shared with some other site. Based
         | on the load time he may then learn if the user visited the
         | target site recently.
        
           | drdaeman wrote:
           | Can't we record the original download timings and replicate
           | them artificially, with some noise added for privacy? Of
           | course, with reasonable upper latency threshold (derived from
           | overall network performance metrics) to prevent DoS on
           | clients.
           | 
           | While this won't improve load times, it can save bandwidth,
           | improving user experience on slow connections. And most
           | popular hand-selected library versions can get preloaded by
           | the browser itself and listed as exceptions with zero added
           | latency.
           | 
           | Also, this way the networking telemetry would gain a
           | meaningful purpose for end users, not just developers.
        
             | manwe150 wrote:
             | My understanding had been that you get a double whammy of
             | there being too many versions of the "common" dependencies,
             | so it is already in-effect a per-site cache (or nearly so),
             | but that uniqueness also means a fingerprint can be
             | established with fairly few latency tests
        
           | dexterdog wrote:
           | How would you know the original site if it's a common asset?
        
             | josephg wrote:
             | No need. Just make a custom asset referenced only by those
             | sites, and get the browser to cache it.
        
       | kevincox wrote:
       | It is interesting that a proxy won't be able to see the complete
       | response anymore. It will see the dictionary ID and hash but
       | without a copy the server's response won't be fully intelligible
       | to it.
        
         | patrickmeenan wrote:
         | If the proxy is correctly handling the Accept-Encoding
         | (rewriting it with only encodings that it understands), it can
         | either remove the `dcb` and `dcz` encodings or it can check if
         | it knows the announced dictionary and only allow them through
         | if it has the dictionary.
         | 
         | MITM devices that just inspect and fail on unknown content-
         | encoding values will have a problem and will need to be updated
         | (there is an enterprise policy to disable the feature in Chrome
         | for that situation until the proxies can be updated).
        
       ___________________________________________________________________
       (page generated 2024-09-15 23:01 UTC)