[HN Gopher] Compression efficiency with shared dictionaries in C...
       ___________________________________________________________________
        
       Compression efficiency with shared dictionaries in Chrome
        
       Author : chamoda
       Score  : 104 points
       Date   : 2024-03-06 12:29 UTC (10 hours ago)
        
 (HTM) web link (developer.chrome.com)
 (TXT) w3m dump (developer.chrome.com)
        
       | ComputerGuru wrote:
       | This seems like a possibly huge user/browser fingerprint. Yes,
       | CORS has been taken into account, but for massive touch surface
       | origins (Google, Facebook, doubleclick, etc) this certainly has
       | concerning ramifications.
       | 
       | It's also insanely complicated. All this effort, so many possible
       | tuples of (shared dictionary, requested resource), none of which
       | make sense to compress on-the-fly per-request, mean it's
       | specifically for the benefit of a select few sites.
       | 
       | When I saw the headline I thought that Chrome would ship with
       | specific dictionaries (say one for js, one for css, etc) and
       | advertise them and you could use the same server-side. But this
       | is really convoluted.
        
         | strongpigeon wrote:
         | > [...] mean it's specifically for the benefit of a select few
         | sites.
         | 
         | It does seem like the ones who benefit from this are large web
         | application that often ship incremental changes. Which, to be
         | fair are the ones that can use the most help.
         | 
         | This has the potential of moving the needle between: "the app
         | takes 10 seconds to load" to "it loads instantly" for these
         | scenarios. Say what you want about the fact that maybe they
         | should optimize their stuff better, this does give them an easy
         | out.
         | 
         | That being said, yeah this is really convoluted and does seem
         | like a big fingerprinting surface.
        
         | wongarsu wrote:
         | Don't want to set session cookies? Just provide user-specific
         | compression dictionaries and use them as your session id! After
         | all, how is the user supposed to notice they got a different
         | dictionary than everyone else
        
           | hinkley wrote:
           | Same problem with etags.
        
         | dspillett wrote:
         | _> I thought that Chrome would ship with specific dictionaries
         | (say one for js, one for css, etc) and advertise them and you
         | could use the same server-side. But this is really convoluted._
         | 
         | More convoluted, but I expect using an old version as the
         | source for the dictionary will yield significantly better
         | results than a generic dictionary for that type of file.
         | 
         | Of course it doesn't help the first load, which might be more
         | noticeable than subsequent loads when not every object has been
         | modified. Perhaps having a standard dictionary for each type
         | for the first request and using a specific one when the old
         | version if available, would give noticeable extra benefit for
         | those first requests for minimal extra implementation effort.
        
       | jgrahamc wrote:
       | The very first project I worked on at Cloudflare but in 2012 was
       | a delta compression-based service called Railgun. We installed
       | software both on the customer's web server and on our end and
       | thus were able to automatically manage shared dictionaries (in
       | this case version of pages sent over Railgun were used as
       | dictionaries automatically). You definitely get incredible
       | compression results.
       | 
       | https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...
       | 
       | I am glad to see that things have moved on from SDCH. Be
       | interesting to see how this measures up in the real world.
        
         | Scaevolus wrote:
         | Delta compression is a huge win for many applications, but it
         | takes a careful hand to make it work well, and inevitably it
         | gets deprecated as the engineers move on and bandwidth stops
         | being a focus-- just like Railgun has been deprecated!
         | https://blog.cloudflare.com/deprecating-railgun
         | 
         | Maybe the basic problem is with how hard it is to find
         | engineers passionate about performance AND compression?
        
           | jgrahamc wrote:
           | I don't think your characterization of why Railgun was
           | deprecated is accurate. From the blog post you link to:
           | 
           | "I use Railgun for performance improvements."
           | 
           | Cloudflare has invested significantly in performance upgrades
           | in the eight years since the last release of Railgun. This
           | list is not comprehensive, but highlights some areas where
           | performance can be significantly improved by adopting newer
           | services relative to using Railgun.
           | 
           | Cloudflare Tunnel features Cloudflare's Argo Smart Routing
           | technology, a service that delivers both "middle mile" and
           | last mile optimization, reducing round trip time by up to
           | 40%. Web assets using Argo perform, on average, 30% faster
           | overall.
           | 
           | Cloudflare Network Interconnect (CNI) gives customers the
           | ability to directly connect to our network, either virtually
           | or physically, to improve the reliability and performance of
           | the connection between Cloudflare's network and your
           | infrastructure. CNI customers have a dedicated on-ramp to
           | Cloudflare for their origins.
        
             | Scaevolus wrote:
             | Right, but isn't that part of the general trend of
             | bandwidth becoming far cheaper in the last decade along
             | with dynamic HTML becoming a smaller fraction of total
             | transit?
             | 
             | A 95%+ reduction in bandwidth usage for dynamic server-
             | side-rendered HTML is much less important in 2023 than
             | 2013.
        
               | Twirrim wrote:
               | Unless you're part of the large majority of people in the
               | world on slower mobile networks. We keep designing and
               | building for people with broadband / wifi, and missing
               | out just how big the 3G / lousy latency markets are.
        
               | jgrahamc wrote:
               | I think it's related to the size of the Cloudflare
               | network and how good its connectivity is (and our own
               | fibre backbone). But on the eyeball side bandwidth isn't
               | the only game in town: latency is the silent killer.
        
         | lynguist wrote:
         | I might be naive but isn't that what rsync is doing?
        
       | TacticalCoder wrote:
       | > Available-Dictionary:
       | :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:
       | 
       | The savings are nice in the best case (like in TFA: switching
       | from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64
       | encoded hash is not compressible and so this line basically adds
       | 60+ bytes to the request.
       | 
       | Kinda ouch for when it's going to be a miss?
        
         | dspillett wrote:
         | Maybe.
         | 
         | Though from the client side 60 bytes is likely not really
         | noticeable1 as a delay in the request send. Perhaps server
         | side, which is seen many many client requests, will see an
         | uptick in incoming bandwidth used, but in most cases servers
         | responding to HTTP(S) requests see a _lot_ more outgoing
         | traffic (response sizes are much larger than requests sizes, on
         | average), so have enough incoming bandwidth "spare" that it is
         | not going to be saturated to the point where this has a
         | significant effect.
         | 
         | --
         | 
         | [1] if the link is slow enough that several lots of 60 bytes is
         | going to have much effect2 it likely also has such high latency
         | that the difference is dwarfed by the existing delays.
         | 
         | [2] a spotty GRPS connection? is anything slower than that in
         | common use anywhere?
        
         | nevir wrote:
         | What are the chances that the ~60 bytes are going to push the
         | request over the frame size and end up splitting into another
         | packet?
        
         | sillysaurusx wrote:
         | Clearly we'll need to use a shared dictionary to compress this.
        
         | sethev wrote:
         | If 60 bytes per request is a material overhead, then your
         | workload is unlikely to benefit from general purpose
         | compression of any kind.
        
           | pornel wrote:
           | Upload is usually slower, more latency sensitive, and suffers
           | from tcp cold start. Pages also make lots of small requests,
           | so header overhead can add up. HTTP/2 added header
           | compression for these reasons.
        
         | lozenge wrote:
         | It might be compressible. HTTP/3 includes compression of
         | request headers. Base64 doesn't use the top two bits in a byte
         | so it's compressible.
        
         | tarasglek wrote:
         | chrome team usually trials changes like this with extensive a/b
         | testing via telemetry. Got to be a large overall win even with
         | this.
        
       | eyelidlessness wrote:
       | I agree with other comments concerned with fingerprinting, and it
       | was my second thought reading through the article. But my first
       | thought was how beneficial this could be for return visitors of a
       | web app, and how it could similarly benefit related concerns,
       | such as managing local caches for offline service workers.
       | 
       | True, for _documents_ (as is another comment's focus) this is
       | perhaps overkill. Although even there, a benefit could be
       | imagined for a large body of documents--it's unclear whether this
       | case is addressed, but it certainly could be with appropriate
       | support across say preload links[0]. But if "the web is for
       | documents, not apps" isn't the proverbial hill you're prepared to
       | die on, this is a very compelling story for web apps.
       | 
       | I don't know if it's so compelling that it outweighs privacy
       | implications, but I expect the other browser engines will have
       | some good insights on that.
       | 
       | 0: https://developer.mozilla.org/en-
       | US/docs/Web/HTML/Attributes...
        
       | ramses0 wrote:
       | This plus native web-components is an incredible advance for "the
       | web".
       | 
       | Fingerprinting concerns aside (compression == timing attacks in
       | the general case), the fact that it's nearly network-transparent
       | and framework/webserver compatible is incredible!
        
       | Sigliotio wrote:
       | That should be used together with ML models.
       | 
       | Image compression for example or voice and video compression like
       | what nvidia does.
       | 
       | But i do like this implementation focusing on libs, why not?
        
       | matsemann wrote:
       | How could a dictionary in the browsers that are pre-made with JS
       | in mind fare? Aka instead of making a custom dictionary per
       | resource I send to the user, I could say that "my scripts.js file
       | uses the browser's built-in js-es2023-abc dictionary". So the
       | browser's would have some dictionaries others could reuse.
       | 
       | What's the savings on that approach vs a gziped file without any
       | dictionary?
        
         | saagarjha wrote:
         | So Brotli already contains a dictionary that is trained on web
         | traffic. I think the thing here is that Google wants to make
         | sending YouTube 1.1 more efficient if you already have YouTube
         | 1.0, but they can't put YouTube 1.0 into the browser.
        
           | hinkley wrote:
           | This is something game devs have been doing for decades.
           | 
           | If you want to delta 1.0 to 1.1 that's server side work you
           | do once at deployment or build time, not on every request.
        
             | Wingy wrote:
             | What happens when you release 1.2 and someone who has 1.0
             | visits? Do you generate a delta for every past version at
             | build time?
        
               | hinkley wrote:
               | You mean if a user who hasn't visited the site in a year
               | comes back?
               | 
               | They download 1.2 because 1.0 is no longer in their
               | browser cache, that's what.
               | 
               | The web is easier than games because "files at rest" are
               | much more volatile on the web.
        
       | saagarjha wrote:
       | Even putting aside CORS because I don't even want to think about
       | how this plays well with requests to another (tracking?) domain,
       | this still doesn't seem worth it. The explicit use case seems to
       | be that it basically tells the server when you last visited the
       | site based on which dictionary you have and then it gives you the
       | moral equivalent of a delta update. Except, most browsers are
       | working hard to expire data of this kind for privacy reasons.
       | What's the lifetime of these dictionaries going to be? I can see
       | it being ok if it's like 1 day but if this outlives how long
       | cookies are stored it's a significant privacy problem. The user
       | visits the site again and essentially a cookie gets sent to the
       | server? The page says "don't put user-specific data in the
       | request" but like nobody is stopping a website from doing this.
        
         | jiripospisil wrote:
         | > Dictionary entries (or at least the metadata) should be
         | cleared any time cookies are cleared.
         | 
         | So it seems it should not get you anything you cannot already
         | do with cookies.
         | 
         | https://github.com/WICG/compression-dictionary-transport/blo...
        
           | twotwotwo wrote:
           | It's interesting this is mentioned specifically about the
           | metadata used by this feature: fingerprinting using this
           | feature has similarities with other cache fingerprinting
           | (wrote a sibling comment about that).
           | 
           | It's not actively bad to have defense-in-depth measures at
           | the level of the dictionary feature. But if your
           | implementation of dictionaries using your browser's existing
           | cache policies is a privacy problem, I'd consider changing
           | the cache, not just the shared-dictionary implementation.
        
         | charcircuit wrote:
         | Currently the max is temporarily capped at 30 days otherwise it
         | would work as long as the dictionary is in the cache.
         | 
         | https://source.chromium.org/chromium/chromium/src/+/main:ser...
        
         | twotwotwo wrote:
         | I think fingerprinting using this is mostly like the more
         | direct ways to fingerprint with the cache, and the defenses
         | against one are the defenses against the other.
         | 
         | For the cross-site thing, cache partitioning is the defense. If
         | the cache of facebook.com/file is independent for a.com and
         | b.com, Facebook can't link the visits.
         | 
         | An attacker using the hash of a cached resource as a pseudo-
         | cookie could previously use the _content_ of the resource as
         | the pseudo-cookie. The Use-As-Dictionary wildcard allows
         | cleverer implementations, but it seems like you can fingerprint
         | for the same time period /in the same circumstances as before.
         | In both cases you might do your tracking by ignoring how you're
         | supposed to be using the feature; as you note, no one's
         | stopping you.
         | 
         | Before and after the compression feature, it is true anti-
         | tracking laws, etc. should address tracking with persistent
         | storage in general not only cookies, much as they need to
         | handle localStorage or other hiding places for data. Also true
         | that for a browser to robustly defend against linking two
         | visits to the same domain (or limit the possibility of tracking
         | to a certain time period, session, origin, etc.), caching is
         | one of the things it has to limit.
         | 
         | I think if they get the expiry, partitioning, etc. right (or
         | wrong) for stopping cache fingerprinting, they also get it
         | right (or wrong) for this.
         | 
         | I was admittedly a fan of the original SDCH that didn't take
         | off, figuring that inter-resource redundancy is a thing. It's a
         | neat spin on it to use the compression algo history windows
         | instead of purpose-built diff tools, and use the existing cache
         | of instead of a dictionary store to the side. Seems easier to
         | implement on both ends compared to the previous try. I could
         | see this being helpful for quickly kicking off page load, maybe
         | especially for non-SPAs and imperfectly optimized sites that
         | repeat a not-tiny header across loads.
        
         | hinkley wrote:
         | I think I'd feel better with a fixed set of dictionaries based
         | on a corpus that gets updated every year to match new patterns
         | of traffic and specifications. Even if it's less efficient.
        
           | pyrolistical wrote:
           | Ya. Where is accept-encoding: zstandard-d-es2024
           | 
           | Where it encodes js files with a known dictionary that is
           | ideal for es2024
        
             | hinkley wrote:
             | And here's one tuned for react, and one for svelte...
        
       | falsandtru wrote:
       | Doesn't the fact that resources send different data mean that
       | SRI(Subresource Integrity) checks cannot be performed? As for
       | fingerprinting, it would not be a problem since it is the same as
       | with Etag.
       | 
       | https://developer.mozilla.org/en-US/docs/Web/Security/Subres...
        
         | charcircuit wrote:
         | SRI hashes the decompressed resource
        
       | skybrian wrote:
       | I wonder if this would be a good alternative to minimizing
       | JavaScript and having separate sourcemaps?
        
         | kevingadd wrote:
         | JS minification will probably never die, because it makes
         | parsing meaningfully faster.
        
           | adgjlsfhk1 wrote:
           | the fact that the default on the web is to ship something
           | that needs a parser is very silly.
        
             | kevingadd wrote:
             | Depending on how you look at it, Java, .NET and WebAssembly
             | all need parsers too, they just happen to be parsing a
             | binary format instead of text.
        
               | adgjlsfhk1 wrote:
               | yes, and technically so does x86, but there's a pretty
               | big difference between formats where the data is
               | normalized and expected to be correct and formats that
               | are intended for users and need to do things like name
               | resolution and error checking. Parsing a language made
               | for machines is easy to do faster than you can read the
               | data from ram, while parsing a high level language will
               | often happen at <100mbps
        
         | madeofpalk wrote:
         | Not really.
         | 
         | Compressing JavaScript already gives you tonnes of benefits,
         | but syntax-aware compression (modify js) gives you more.
         | 
         | Besides, this is a form of more efficient caching on that it
         | only benefits subsequent visits.
        
       | cuckatoo wrote:
       | What stands out to me is that this creates another 'key' that the
       | browser sends on every request which can be fingerprinted or
       | tracked by the server.
       | 
       | I do not want my browser sending anything that looks like it
       | could be used to uniquely identify me. Ever.
       | 
       | I want every request my browser makes to look like any other
       | request made by another user's browser. I understand that this is
       | what Google doesn't want but why can't they just be honest about
       | it? Why come up with these elaborate lies?
       | 
       | Now to limit tracking exposure, in addition to running the
       | AutoCookieDelete extension I'll have to go find some
       | AutoDictionaryDelete extension to go with it. Boy am I glad the
       | internet is getting better every day.
        
         | jsnell wrote:
         | The obvious answer is that they are not lying.
         | 
         | You're making three assertions, none backed by any evidence.
         | That this is a tracking vector, that it's primarily intended to
         | be a tracking vector, and that they're lying about their
         | motivations.
         | 
         | But your reasoning fails already at the first step, since you
         | just assumed malice rather than do any research. This is not a
         | useful tracking vector. The storage is partitioned by the top
         | window, and it is cleared when cookies are cleared. It's also
         | not really a new tracking vector, it's pretty much the same as
         | ETags.
        
       | jauntywundrkind wrote:
       | The Request For Position on Mozilla Zstd Support (2018) has a ton
       | of interesting discussion on dictionaries.
       | https://github.com/mozilla/standards-positions/issues/105
       | 
       | The original proposal for Zstd was to use a predefined
       | stastically generated dictionary. Mozilla rejected the proposal
       | for that.
       | 
       | But there's a lot of great discussion on what Zstd can do, whic.h
       | is astoundingly flexible & powerful. There's discussion on
       | dynamic adjustment if cinpression ratios. And discussion around
       | shared dictionaries and their privacy implications. That Mozilla
       | turned around & started supporting Zstd & has stamped a positive
       | indicator, worth prototyping on shared dictionaries is a good
       | initial stamp of approval to see!
       | https://github.com/mozilla/standards-positions/issues/771
       | 
       | One of my main questions after reading this promising update is:
       | how do pick what to include when generating custom dictionaries?
       | Another comment mentions that brotli has a standard dictionary it
       | uses, and that's some kind of possible starting place. But it
       | feels like tools to build one's custom dictionary would be ideal.
        
       | netol wrote:
       | The part I'm missing is how these dictionaries are created. Can I
       | use the homepage to create my dictionary, so all other pages that
       | share html are better efficiently compressed? How?
        
       | IshKebab wrote:
       | Ah damn I thought this was going to be available to JavaScript.
       | Would be amazing for one use case I have (an HTML page containing
       | inline logs from a load of commands, many of which are
       | substantially similar).
        
         | jauntywundrkind wrote:
         | That would be an excellent web standard!!
         | 
         | There's wasm modules that do similar but having it bakes into
         | the browser could allow for further optimization than what's
         | possible with wasm. https://github.com/bokuweb/zstd-wasm
         | 
         | I have no idea if it's possible but I wonder if a webgpu port
         | could be made? Alternatively, for your use case, maybe you
         | could try applying something like Basis Universal; a fast
         | compression system for textures, that it seems there are some
         | webgpu loaders for... Maybe that could be bent to
         | encoding/deciding text?
        
       | kazinator wrote:
       | With shared dictionaries you can compress everything down to
       | under a byte.
       | 
       | Just put the to-be-compressed item into the shared dictionary,
       | somehow distribute that to everyone, and then the compressed
       | artifact consists of a reference to that item.
       | 
       | If the shared dictionary contains nothing else, it can just be a
       | one-bit message whose meaning is "extract the one and only item
       | out of the dictionary".
        
       | raggi wrote:
       | What I really want: dictionaries derived from the standards and
       | standard libraries (perhaps once a year or somesuch), which I'd
       | use independently of build system gunk, and while it wouldn't be
       | the tightest squeeze you can get, it would make my non-built
       | assets get very close to built asset size for small to medium
       | sized deployments.
        
       | jwally wrote:
       | Dumb question, but with respect to fingerprinting - how is this
       | any worse than cookies, service workers, or localstorage?
        
       ___________________________________________________________________
       (page generated 2024-03-06 23:00 UTC)