[HN Gopher] Bit-sync: synchronizing arbitrary data using rsync a...
       ___________________________________________________________________
        
       Bit-sync: synchronizing arbitrary data using rsync algorithm in
       pure JavaScript
        
       Author : ingve
       Score  : 97 points
       Date   : 2021-05-16 11:12 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | blixt wrote:
       | Also check out the Rabin rolling hash algorithm for smart
       | chunking of files which can be useful to either dedupe dynamic
       | content or efficiently sync files.
        
       | hardwaresofton wrote:
       | Related (rsync in the browser):
       | 
       | https://webdeltasync.github.io/
        
         | claytongulick wrote:
         | This is a cool project!
         | 
         | Looks like they use my bit-sync lib for the actual rsync part
         | [1] which is neat.
         | 
         | I should modernize this lib so that it's easier for them to add
         | as a dependency.
         | 
         | https://github.com/WebDeltaSync/WebRsync/blob/master/public/...
        
       | zmj wrote:
       | The rsync algorithm is pretty cool. I implemented it in C# a
       | couple of years ago: https://github.com/zmj/rsync-delta
        
         | cshipley wrote:
         | Does this only work with files or can any source be used as
         | long as it is a Stream? Basically, I'd love to have something
         | like this for arbitrary data, not just files.
         | 
         | Also, looks like the README has a broken link:
         | https://librsync.github.io/rdiff.html
        
           | zmj wrote:
           | Yep, Streams or Pipes for IO. No seeking, forward-only reads
           | - using this with network in/out was my goal.
           | 
           | Thanks for pointing out the broken link. Looks like it's here
           | now: https://librsync.github.io/page_rdiff.html
        
       | tyingq wrote:
       | _" The md5 algorithm needs to be rewritten with performance in
       | mind"_
       | 
       | Poked around a little, and this area seems a bit under-researched
       | since MD5 went out of favor for many use cases. For example, the
       | SubtleCrypto digest doesn't support it.
       | 
       | There's this WASM implementation[1], that shows 2-37 MB/sec,
       | presumably on fairly modern hardware. But, I can also find an old
       | pure Javascript implementation[2] that shows 5-7 MB/sec on an
       | ancient Pentium4 running Opera[2]. Makes you wonder how it would
       | fare with a recent Chrome/V8 and hardware. There's online tests,
       | but with small inputs and Chrome is fast enough that I get a lot
       | of NaN/Infinity.
       | 
       | Maybe it would be better to use SHA-1 or some other digest
       | supported by the built-in browser SubtleCrypto package. Wouldn't
       | technically be rsync, but this package seems to be intended for
       | use on both sides.
       | 
       | [1] https://www.npmjs.com/package/md5-wasm
       | 
       | [2] http://www.myersdaily.org/joseph/javascript/md5-speed-
       | test.h...
        
         | claytongulick wrote:
         | Hi, author here. It's funny you mention this, I was just
         | revisiting this lib and thinking that SubtleCrypto would be a
         | better option. Also interesting idea to use the md5-wasm,
         | thanks for the link.
         | 
         | It's been years since I made this, probably due for a little
         | love.
        
       | londons_explore wrote:
       | I'd like to see something like this used for big JavaScript
       | bundles.
       | 
       | Many webapps now have 2MB+ of JavaScript, and lots of web
       | companies do daily or even hourly software releases, which
       | effectively means every time I visit your site I have to download
       | many megabytes of JavaScript which takes many seconds. It's
       | frustrating. Diffing the JavaScript and sending just the changed
       | lines of code would save hundreds of millions of people a few
       | seconds each day, and is totally worth it.
        
         | rakoo wrote:
         | All of this has already happened, and all of this will happen
         | again.
         | 
         | - The original rsync author wrote rproxy
         | (https://rproxy.samba.org/), an ingenious system wrapping http
         | inside the rsync protocol. On subsequent visits, the proxies
         | see the resemblance and sends a diff. I've toyed with it myself
         | (https://github.com/rakoo/rproxy/) but I don't think website
         | really care about how disastrously huge and cumbersome they are
         | to use
         | 
         | - another idea is to invert rsync: instead of the client
         | calculating a signature, sending it to the server, and the
         | server calculating the delta _for every request of every
         | client_ , it is possible to have the server calculate the
         | signature once every time actually changes, send the signature
         | to the client, let the client calculate the parts they need and
         | download only them. In short, zsync
         | (http://zsync.moria.org.uk/)
         | 
         | There are a million things already done in this sector, but for
         | some reason it has never caugh up... I think there's just not
         | enough interest
        
         | tyingq wrote:
         | HTTP already has range requests, so it seems like you could do
         | this in a relatively straightforward way.
        
           | claytongulick wrote:
           | That is a very interesting idea.
           | 
           | I've been thinking about what I should do in a V2 of this
           | lib.
        
         | kevmo314 wrote:
         | I really like this idea, but I'm not sure if it would work in
         | practice. It's basically working against webpack's code
         | splitting because those are tagged with hashes. Additionally,
         | the first load will still be quite heavy, whereas with code
         | splitting at least there's some semblance of a smaller initial
         | load. Since it's not native to the browser, it would have a
         | nontrivial initialization overhead as well.
         | 
         | I think it could be a cool idea if it's offered as a static
         | file hosting as a service kind of deal that requires zero
         | config, but then again if you're the kind of person that wants
         | a zero config static file hosting, you're probably not pushing
         | daily/hourly software releases?
         | 
         | It would be neat to have something like this more transparently
         | implemented though. Seeing a 2MB+ module rebuild for a minor
         | typo fix is aggravating.
        
           | londons_explore wrote:
           | One possible implementation:
           | 
           | * For each bundle, webpack produces the compiled bundle _and_
           | a seperate diff file between the latest release and every
           | previously published release. Filenames could be
           | [oldhash]-[newhash].js
           | 
           | * The webpack loader can check if the browser has that
           | specific module cached (I believe service workers are able to
           | inspect the http cache like this), and then based on the
           | version that is cached, request the correct filename for a
           | diff to the latest version.
           | 
           | * Then apply the diff and load the javascript as normal.
           | 
           | I just did a quick test with a webapp of mine, and a small
           | diff file made with the rsync algorithm (via rdiff-backup) of
           | a webpack bundle was just 150 bytes.
        
             | kevmo314 wrote:
             | > Then apply the diff and load the javascript as normal.
             | 
             | Can service workers mutate the HTTP cache? I haven't seen
             | it yet, but if not, it seems like the client might be
             | forever pinned to the old version. I guess most likely we'd
             | have to avoid the HTTP cache all together and run our own
             | application module caching. That could work, but wonder if
             | the performance overhead would be worth it...
             | 
             | I'll play around this a little more though, it's a really
             | cool idea. Getting a clean API seems tricky but doable.
        
       ___________________________________________________________________
       (page generated 2021-05-18 23:02 UTC)