[HN Gopher] Bit-sync: synchronizing arbitrary data using rsync a...
___________________________________________________________________
Bit-sync: synchronizing arbitrary data using rsync algorithm in
pure JavaScript
Author : ingve
Score : 97 points
Date : 2021-05-16 11:12 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| blixt wrote:
| Also check out the Rabin rolling hash algorithm for smart
| chunking of files which can be useful to either dedupe dynamic
| content or efficiently sync files.
| hardwaresofton wrote:
| Related (rsync in the browser):
|
| https://webdeltasync.github.io/
| claytongulick wrote:
| This is a cool project!
|
| Looks like they use my bit-sync lib for the actual rsync part
| [1] which is neat.
|
| I should modernize this lib so that it's easier for them to add
| as a dependency.
|
| https://github.com/WebDeltaSync/WebRsync/blob/master/public/...
| zmj wrote:
| The rsync algorithm is pretty cool. I implemented it in C# a
| couple of years ago: https://github.com/zmj/rsync-delta
| cshipley wrote:
| Does this only work with files or can any source be used as
| long as it is a Stream? Basically, I'd love to have something
| like this for arbitrary data, not just files.
|
| Also, looks like the README has a broken link:
| https://librsync.github.io/rdiff.html
| zmj wrote:
| Yep, Streams or Pipes for IO. No seeking, forward-only reads
| - using this with network in/out was my goal.
|
| Thanks for pointing out the broken link. Looks like it's here
| now: https://librsync.github.io/page_rdiff.html
| tyingq wrote:
| _" The md5 algorithm needs to be rewritten with performance in
| mind"_
|
| Poked around a little, and this area seems a bit under-researched
| since MD5 went out of favor for many use cases. For example, the
| SubtleCrypto digest doesn't support it.
|
| There's this WASM implementation[1], that shows 2-37 MB/sec,
| presumably on fairly modern hardware. But, I can also find an old
| pure Javascript implementation[2] that shows 5-7 MB/sec on an
| ancient Pentium4 running Opera[2]. Makes you wonder how it would
| fare with a recent Chrome/V8 and hardware. There's online tests,
| but with small inputs and Chrome is fast enough that I get a lot
| of NaN/Infinity.
|
| Maybe it would be better to use SHA-1 or some other digest
| supported by the built-in browser SubtleCrypto package. Wouldn't
| technically be rsync, but this package seems to be intended for
| use on both sides.
|
| [1] https://www.npmjs.com/package/md5-wasm
|
| [2] http://www.myersdaily.org/joseph/javascript/md5-speed-
| test.h...
| claytongulick wrote:
| Hi, author here. It's funny you mention this, I was just
| revisiting this lib and thinking that SubtleCrypto would be a
| better option. Also interesting idea to use the md5-wasm,
| thanks for the link.
|
| It's been years since I made this, probably due for a little
| love.
| londons_explore wrote:
| I'd like to see something like this used for big JavaScript
| bundles.
|
| Many webapps now have 2MB+ of JavaScript, and lots of web
| companies do daily or even hourly software releases, which
| effectively means every time I visit your site I have to download
| many megabytes of JavaScript which takes many seconds. It's
| frustrating. Diffing the JavaScript and sending just the changed
| lines of code would save hundreds of millions of people a few
| seconds each day, and is totally worth it.
| rakoo wrote:
| All of this has already happened, and all of this will happen
| again.
|
| - The original rsync author wrote rproxy
| (https://rproxy.samba.org/), an ingenious system wrapping http
| inside the rsync protocol. On subsequent visits, the proxies
| see the resemblance and sends a diff. I've toyed with it myself
| (https://github.com/rakoo/rproxy/) but I don't think website
| really care about how disastrously huge and cumbersome they are
| to use
|
| - another idea is to invert rsync: instead of the client
| calculating a signature, sending it to the server, and the
| server calculating the delta _for every request of every
| client_ , it is possible to have the server calculate the
| signature once every time actually changes, send the signature
| to the client, let the client calculate the parts they need and
| download only them. In short, zsync
| (http://zsync.moria.org.uk/)
|
| There are a million things already done in this sector, but for
| some reason it has never caugh up... I think there's just not
| enough interest
| tyingq wrote:
| HTTP already has range requests, so it seems like you could do
| this in a relatively straightforward way.
| claytongulick wrote:
| That is a very interesting idea.
|
| I've been thinking about what I should do in a V2 of this
| lib.
| kevmo314 wrote:
| I really like this idea, but I'm not sure if it would work in
| practice. It's basically working against webpack's code
| splitting because those are tagged with hashes. Additionally,
| the first load will still be quite heavy, whereas with code
| splitting at least there's some semblance of a smaller initial
| load. Since it's not native to the browser, it would have a
| nontrivial initialization overhead as well.
|
| I think it could be a cool idea if it's offered as a static
| file hosting as a service kind of deal that requires zero
| config, but then again if you're the kind of person that wants
| a zero config static file hosting, you're probably not pushing
| daily/hourly software releases?
|
| It would be neat to have something like this more transparently
| implemented though. Seeing a 2MB+ module rebuild for a minor
| typo fix is aggravating.
| londons_explore wrote:
| One possible implementation:
|
| * For each bundle, webpack produces the compiled bundle _and_
| a seperate diff file between the latest release and every
| previously published release. Filenames could be
| [oldhash]-[newhash].js
|
| * The webpack loader can check if the browser has that
| specific module cached (I believe service workers are able to
| inspect the http cache like this), and then based on the
| version that is cached, request the correct filename for a
| diff to the latest version.
|
| * Then apply the diff and load the javascript as normal.
|
| I just did a quick test with a webapp of mine, and a small
| diff file made with the rsync algorithm (via rdiff-backup) of
| a webpack bundle was just 150 bytes.
| kevmo314 wrote:
| > Then apply the diff and load the javascript as normal.
|
| Can service workers mutate the HTTP cache? I haven't seen
| it yet, but if not, it seems like the client might be
| forever pinned to the old version. I guess most likely we'd
| have to avoid the HTTP cache all together and run our own
| application module caching. That could work, but wonder if
| the performance overhead would be worth it...
|
| I'll play around this a little more though, it's a really
| cool idea. Getting a clean API seems tricky but doable.
___________________________________________________________________
(page generated 2021-05-18 23:02 UTC)