[HN Gopher] Compression efficiency with shared dictionaries in C...
___________________________________________________________________
Compression efficiency with shared dictionaries in Chrome
Author : chamoda
Score : 104 points
Date : 2024-03-06 12:29 UTC (10 hours ago)
(HTM) web link (developer.chrome.com)
(TXT) w3m dump (developer.chrome.com)
| ComputerGuru wrote:
| This seems like a possibly huge user/browser fingerprint. Yes,
| CORS has been taken into account, but for massive touch surface
| origins (Google, Facebook, doubleclick, etc) this certainly has
| concerning ramifications.
|
| It's also insanely complicated. All this effort, so many possible
| tuples of (shared dictionary, requested resource), none of which
| make sense to compress on-the-fly per-request, mean it's
| specifically for the benefit of a select few sites.
|
| When I saw the headline I thought that Chrome would ship with
| specific dictionaries (say one for js, one for css, etc) and
| advertise them and you could use the same server-side. But this
| is really convoluted.
| strongpigeon wrote:
| > [...] mean it's specifically for the benefit of a select few
| sites.
|
| It does seem like the ones who benefit from this are large web
| application that often ship incremental changes. Which, to be
| fair are the ones that can use the most help.
|
| This has the potential of moving the needle between: "the app
| takes 10 seconds to load" to "it loads instantly" for these
| scenarios. Say what you want about the fact that maybe they
| should optimize their stuff better, this does give them an easy
| out.
|
| That being said, yeah this is really convoluted and does seem
| like a big fingerprinting surface.
| wongarsu wrote:
| Don't want to set session cookies? Just provide user-specific
| compression dictionaries and use them as your session id! After
| all, how is the user supposed to notice they got a different
| dictionary than everyone else
| hinkley wrote:
| Same problem with etags.
| dspillett wrote:
| _> I thought that Chrome would ship with specific dictionaries
| (say one for js, one for css, etc) and advertise them and you
| could use the same server-side. But this is really convoluted._
|
| More convoluted, but I expect using an old version as the
| source for the dictionary will yield significantly better
| results than a generic dictionary for that type of file.
|
| Of course it doesn't help the first load, which might be more
| noticeable than subsequent loads when not every object has been
| modified. Perhaps having a standard dictionary for each type
| for the first request and using a specific one when the old
| version if available, would give noticeable extra benefit for
| those first requests for minimal extra implementation effort.
| jgrahamc wrote:
| The very first project I worked on at Cloudflare but in 2012 was
| a delta compression-based service called Railgun. We installed
| software both on the customer's web server and on our end and
| thus were able to automatically manage shared dictionaries (in
| this case version of pages sent over Railgun were used as
| dictionaries automatically). You definitely get incredible
| compression results.
|
| https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...
|
| I am glad to see that things have moved on from SDCH. Be
| interesting to see how this measures up in the real world.
| Scaevolus wrote:
| Delta compression is a huge win for many applications, but it
| takes a careful hand to make it work well, and inevitably it
| gets deprecated as the engineers move on and bandwidth stops
| being a focus-- just like Railgun has been deprecated!
| https://blog.cloudflare.com/deprecating-railgun
|
| Maybe the basic problem is with how hard it is to find
| engineers passionate about performance AND compression?
| jgrahamc wrote:
| I don't think your characterization of why Railgun was
| deprecated is accurate. From the blog post you link to:
|
| "I use Railgun for performance improvements."
|
| Cloudflare has invested significantly in performance upgrades
| in the eight years since the last release of Railgun. This
| list is not comprehensive, but highlights some areas where
| performance can be significantly improved by adopting newer
| services relative to using Railgun.
|
| Cloudflare Tunnel features Cloudflare's Argo Smart Routing
| technology, a service that delivers both "middle mile" and
| last mile optimization, reducing round trip time by up to
| 40%. Web assets using Argo perform, on average, 30% faster
| overall.
|
| Cloudflare Network Interconnect (CNI) gives customers the
| ability to directly connect to our network, either virtually
| or physically, to improve the reliability and performance of
| the connection between Cloudflare's network and your
| infrastructure. CNI customers have a dedicated on-ramp to
| Cloudflare for their origins.
| Scaevolus wrote:
| Right, but isn't that part of the general trend of
| bandwidth becoming far cheaper in the last decade along
| with dynamic HTML becoming a smaller fraction of total
| transit?
|
| A 95%+ reduction in bandwidth usage for dynamic server-
| side-rendered HTML is much less important in 2023 than
| 2013.
| Twirrim wrote:
| Unless you're part of the large majority of people in the
| world on slower mobile networks. We keep designing and
| building for people with broadband / wifi, and missing
| out just how big the 3G / lousy latency markets are.
| jgrahamc wrote:
| I think it's related to the size of the Cloudflare
| network and how good its connectivity is (and our own
| fibre backbone). But on the eyeball side bandwidth isn't
| the only game in town: latency is the silent killer.
| lynguist wrote:
| I might be naive but isn't that what rsync is doing?
| TacticalCoder wrote:
| > Available-Dictionary:
| :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:
|
| The savings are nice in the best case (like in TFA: switching
| from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64
| encoded hash is not compressible and so this line basically adds
| 60+ bytes to the request.
|
| Kinda ouch for when it's going to be a miss?
| dspillett wrote:
| Maybe.
|
| Though from the client side 60 bytes is likely not really
| noticeable1 as a delay in the request send. Perhaps server
| side, which is seen many many client requests, will see an
| uptick in incoming bandwidth used, but in most cases servers
| responding to HTTP(S) requests see a _lot_ more outgoing
| traffic (response sizes are much larger than requests sizes, on
| average), so have enough incoming bandwidth "spare" that it is
| not going to be saturated to the point where this has a
| significant effect.
|
| --
|
| [1] if the link is slow enough that several lots of 60 bytes is
| going to have much effect2 it likely also has such high latency
| that the difference is dwarfed by the existing delays.
|
| [2] a spotty GRPS connection? is anything slower than that in
| common use anywhere?
| nevir wrote:
| What are the chances that the ~60 bytes are going to push the
| request over the frame size and end up splitting into another
| packet?
| sillysaurusx wrote:
| Clearly we'll need to use a shared dictionary to compress this.
| sethev wrote:
| If 60 bytes per request is a material overhead, then your
| workload is unlikely to benefit from general purpose
| compression of any kind.
| pornel wrote:
| Upload is usually slower, more latency sensitive, and suffers
| from tcp cold start. Pages also make lots of small requests,
| so header overhead can add up. HTTP/2 added header
| compression for these reasons.
| lozenge wrote:
| It might be compressible. HTTP/3 includes compression of
| request headers. Base64 doesn't use the top two bits in a byte
| so it's compressible.
| tarasglek wrote:
| chrome team usually trials changes like this with extensive a/b
| testing via telemetry. Got to be a large overall win even with
| this.
| eyelidlessness wrote:
| I agree with other comments concerned with fingerprinting, and it
| was my second thought reading through the article. But my first
| thought was how beneficial this could be for return visitors of a
| web app, and how it could similarly benefit related concerns,
| such as managing local caches for offline service workers.
|
| True, for _documents_ (as is another comment's focus) this is
| perhaps overkill. Although even there, a benefit could be
| imagined for a large body of documents--it's unclear whether this
| case is addressed, but it certainly could be with appropriate
| support across say preload links[0]. But if "the web is for
| documents, not apps" isn't the proverbial hill you're prepared to
| die on, this is a very compelling story for web apps.
|
| I don't know if it's so compelling that it outweighs privacy
| implications, but I expect the other browser engines will have
| some good insights on that.
|
| 0: https://developer.mozilla.org/en-
| US/docs/Web/HTML/Attributes...
| ramses0 wrote:
| This plus native web-components is an incredible advance for "the
| web".
|
| Fingerprinting concerns aside (compression == timing attacks in
| the general case), the fact that it's nearly network-transparent
| and framework/webserver compatible is incredible!
| Sigliotio wrote:
| That should be used together with ML models.
|
| Image compression for example or voice and video compression like
| what nvidia does.
|
| But i do like this implementation focusing on libs, why not?
| matsemann wrote:
| How could a dictionary in the browsers that are pre-made with JS
| in mind fare? Aka instead of making a custom dictionary per
| resource I send to the user, I could say that "my scripts.js file
| uses the browser's built-in js-es2023-abc dictionary". So the
| browser's would have some dictionaries others could reuse.
|
| What's the savings on that approach vs a gziped file without any
| dictionary?
| saagarjha wrote:
| So Brotli already contains a dictionary that is trained on web
| traffic. I think the thing here is that Google wants to make
| sending YouTube 1.1 more efficient if you already have YouTube
| 1.0, but they can't put YouTube 1.0 into the browser.
| hinkley wrote:
| This is something game devs have been doing for decades.
|
| If you want to delta 1.0 to 1.1 that's server side work you
| do once at deployment or build time, not on every request.
| Wingy wrote:
| What happens when you release 1.2 and someone who has 1.0
| visits? Do you generate a delta for every past version at
| build time?
| hinkley wrote:
| You mean if a user who hasn't visited the site in a year
| comes back?
|
| They download 1.2 because 1.0 is no longer in their
| browser cache, that's what.
|
| The web is easier than games because "files at rest" are
| much more volatile on the web.
| saagarjha wrote:
| Even putting aside CORS because I don't even want to think about
| how this plays well with requests to another (tracking?) domain,
| this still doesn't seem worth it. The explicit use case seems to
| be that it basically tells the server when you last visited the
| site based on which dictionary you have and then it gives you the
| moral equivalent of a delta update. Except, most browsers are
| working hard to expire data of this kind for privacy reasons.
| What's the lifetime of these dictionaries going to be? I can see
| it being ok if it's like 1 day but if this outlives how long
| cookies are stored it's a significant privacy problem. The user
| visits the site again and essentially a cookie gets sent to the
| server? The page says "don't put user-specific data in the
| request" but like nobody is stopping a website from doing this.
| jiripospisil wrote:
| > Dictionary entries (or at least the metadata) should be
| cleared any time cookies are cleared.
|
| So it seems it should not get you anything you cannot already
| do with cookies.
|
| https://github.com/WICG/compression-dictionary-transport/blo...
| twotwotwo wrote:
| It's interesting this is mentioned specifically about the
| metadata used by this feature: fingerprinting using this
| feature has similarities with other cache fingerprinting
| (wrote a sibling comment about that).
|
| It's not actively bad to have defense-in-depth measures at
| the level of the dictionary feature. But if your
| implementation of dictionaries using your browser's existing
| cache policies is a privacy problem, I'd consider changing
| the cache, not just the shared-dictionary implementation.
| charcircuit wrote:
| Currently the max is temporarily capped at 30 days otherwise it
| would work as long as the dictionary is in the cache.
|
| https://source.chromium.org/chromium/chromium/src/+/main:ser...
| twotwotwo wrote:
| I think fingerprinting using this is mostly like the more
| direct ways to fingerprint with the cache, and the defenses
| against one are the defenses against the other.
|
| For the cross-site thing, cache partitioning is the defense. If
| the cache of facebook.com/file is independent for a.com and
| b.com, Facebook can't link the visits.
|
| An attacker using the hash of a cached resource as a pseudo-
| cookie could previously use the _content_ of the resource as
| the pseudo-cookie. The Use-As-Dictionary wildcard allows
| cleverer implementations, but it seems like you can fingerprint
| for the same time period /in the same circumstances as before.
| In both cases you might do your tracking by ignoring how you're
| supposed to be using the feature; as you note, no one's
| stopping you.
|
| Before and after the compression feature, it is true anti-
| tracking laws, etc. should address tracking with persistent
| storage in general not only cookies, much as they need to
| handle localStorage or other hiding places for data. Also true
| that for a browser to robustly defend against linking two
| visits to the same domain (or limit the possibility of tracking
| to a certain time period, session, origin, etc.), caching is
| one of the things it has to limit.
|
| I think if they get the expiry, partitioning, etc. right (or
| wrong) for stopping cache fingerprinting, they also get it
| right (or wrong) for this.
|
| I was admittedly a fan of the original SDCH that didn't take
| off, figuring that inter-resource redundancy is a thing. It's a
| neat spin on it to use the compression algo history windows
| instead of purpose-built diff tools, and use the existing cache
| of instead of a dictionary store to the side. Seems easier to
| implement on both ends compared to the previous try. I could
| see this being helpful for quickly kicking off page load, maybe
| especially for non-SPAs and imperfectly optimized sites that
| repeat a not-tiny header across loads.
| hinkley wrote:
| I think I'd feel better with a fixed set of dictionaries based
| on a corpus that gets updated every year to match new patterns
| of traffic and specifications. Even if it's less efficient.
| pyrolistical wrote:
| Ya. Where is accept-encoding: zstandard-d-es2024
|
| Where it encodes js files with a known dictionary that is
| ideal for es2024
| hinkley wrote:
| And here's one tuned for react, and one for svelte...
| falsandtru wrote:
| Doesn't the fact that resources send different data mean that
| SRI(Subresource Integrity) checks cannot be performed? As for
| fingerprinting, it would not be a problem since it is the same as
| with Etag.
|
| https://developer.mozilla.org/en-US/docs/Web/Security/Subres...
| charcircuit wrote:
| SRI hashes the decompressed resource
| skybrian wrote:
| I wonder if this would be a good alternative to minimizing
| JavaScript and having separate sourcemaps?
| kevingadd wrote:
| JS minification will probably never die, because it makes
| parsing meaningfully faster.
| adgjlsfhk1 wrote:
| the fact that the default on the web is to ship something
| that needs a parser is very silly.
| kevingadd wrote:
| Depending on how you look at it, Java, .NET and WebAssembly
| all need parsers too, they just happen to be parsing a
| binary format instead of text.
| adgjlsfhk1 wrote:
| yes, and technically so does x86, but there's a pretty
| big difference between formats where the data is
| normalized and expected to be correct and formats that
| are intended for users and need to do things like name
| resolution and error checking. Parsing a language made
| for machines is easy to do faster than you can read the
| data from ram, while parsing a high level language will
| often happen at <100mbps
| madeofpalk wrote:
| Not really.
|
| Compressing JavaScript already gives you tonnes of benefits,
| but syntax-aware compression (modify js) gives you more.
|
| Besides, this is a form of more efficient caching on that it
| only benefits subsequent visits.
| cuckatoo wrote:
| What stands out to me is that this creates another 'key' that the
| browser sends on every request which can be fingerprinted or
| tracked by the server.
|
| I do not want my browser sending anything that looks like it
| could be used to uniquely identify me. Ever.
|
| I want every request my browser makes to look like any other
| request made by another user's browser. I understand that this is
| what Google doesn't want but why can't they just be honest about
| it? Why come up with these elaborate lies?
|
| Now to limit tracking exposure, in addition to running the
| AutoCookieDelete extension I'll have to go find some
| AutoDictionaryDelete extension to go with it. Boy am I glad the
| internet is getting better every day.
| jsnell wrote:
| The obvious answer is that they are not lying.
|
| You're making three assertions, none backed by any evidence.
| That this is a tracking vector, that it's primarily intended to
| be a tracking vector, and that they're lying about their
| motivations.
|
| But your reasoning fails already at the first step, since you
| just assumed malice rather than do any research. This is not a
| useful tracking vector. The storage is partitioned by the top
| window, and it is cleared when cookies are cleared. It's also
| not really a new tracking vector, it's pretty much the same as
| ETags.
| jauntywundrkind wrote:
| The Request For Position on Mozilla Zstd Support (2018) has a ton
| of interesting discussion on dictionaries.
| https://github.com/mozilla/standards-positions/issues/105
|
| The original proposal for Zstd was to use a predefined
| stastically generated dictionary. Mozilla rejected the proposal
| for that.
|
| But there's a lot of great discussion on what Zstd can do, whic.h
| is astoundingly flexible & powerful. There's discussion on
| dynamic adjustment if cinpression ratios. And discussion around
| shared dictionaries and their privacy implications. That Mozilla
| turned around & started supporting Zstd & has stamped a positive
| indicator, worth prototyping on shared dictionaries is a good
| initial stamp of approval to see!
| https://github.com/mozilla/standards-positions/issues/771
|
| One of my main questions after reading this promising update is:
| how do pick what to include when generating custom dictionaries?
| Another comment mentions that brotli has a standard dictionary it
| uses, and that's some kind of possible starting place. But it
| feels like tools to build one's custom dictionary would be ideal.
| netol wrote:
| The part I'm missing is how these dictionaries are created. Can I
| use the homepage to create my dictionary, so all other pages that
| share html are better efficiently compressed? How?
| IshKebab wrote:
| Ah damn I thought this was going to be available to JavaScript.
| Would be amazing for one use case I have (an HTML page containing
| inline logs from a load of commands, many of which are
| substantially similar).
| jauntywundrkind wrote:
| That would be an excellent web standard!!
|
| There's wasm modules that do similar but having it bakes into
| the browser could allow for further optimization than what's
| possible with wasm. https://github.com/bokuweb/zstd-wasm
|
| I have no idea if it's possible but I wonder if a webgpu port
| could be made? Alternatively, for your use case, maybe you
| could try applying something like Basis Universal; a fast
| compression system for textures, that it seems there are some
| webgpu loaders for... Maybe that could be bent to
| encoding/deciding text?
| kazinator wrote:
| With shared dictionaries you can compress everything down to
| under a byte.
|
| Just put the to-be-compressed item into the shared dictionary,
| somehow distribute that to everyone, and then the compressed
| artifact consists of a reference to that item.
|
| If the shared dictionary contains nothing else, it can just be a
| one-bit message whose meaning is "extract the one and only item
| out of the dictionary".
| raggi wrote:
| What I really want: dictionaries derived from the standards and
| standard libraries (perhaps once a year or somesuch), which I'd
| use independently of build system gunk, and while it wouldn't be
| the tightest squeeze you can get, it would make my non-built
| assets get very close to built asset size for small to medium
| sized deployments.
| jwally wrote:
| Dumb question, but with respect to fingerprinting - how is this
| any worse than cookies, service workers, or localstorage?
___________________________________________________________________
(page generated 2024-03-06 23:00 UTC)