[HN Gopher] How does rsync work?
___________________________________________________________________
How does rsync work?
Author : secure
Score : 228 points
Date : 2022-07-02 12:41 UTC (10 hours ago)
(HTM) web link (michael.stapelberg.ch)
(TXT) w3m dump (michael.stapelberg.ch)
| lloeki wrote:
| When trying to understand rsync and the rolling checksum I
| stumbled upon a small python implementation in some self-hosted
| corner of the web[0], which I have archived on GH[1] (not the
| author, but things can vanish quickly, as proved by the bzr repo
| which went _poof_ [2]).
|
| [0]: https://blog.liw.fi/posts/rsync-in-python/
|
| [1]: https://github.com/lloeki/rsync/blob/master/rsync.py
|
| [2]: https://code.liw.fi/obsync/bzr/trunk/
| mmlb wrote:
| For bzr did you try archive.org?
| https://web.archive.org/web/20150321194412/https://code.liw....
| generalizations wrote:
| It's not actually there.
|
| https://web.archive.org/web/20150321212547/http://code.liw.f.
| ..
| jwilk wrote:
| I don't think you can clone that.
| lazypenguin wrote:
| Nice write up. rsync is great as an application but I found it
| more cumbersome to use when wanting to integrate it into my own
| application. There's librsync but the documentation is threadbare
| and it requires an rsync server to run. I found bita/bitar
| (https://github.com/oll3/bita) which is inspired by rsync &
| family. It works more like zsync which leverages HTTP Range
| requests so it doesn't require anything running on the server to
| get chunks. Works like a treat using s3/b2 storage to serve files
| and get incremental differential updates on the client side!
| oll3 wrote:
| Always happy to see my pet project mentioned (bita) and that it
| is actually being used by others than me :)
| bigChris wrote:
| Rsync worst issue is someone port scanning and brute force their
| way into your system. Turn off your port.
| hatware wrote:
| Don't most consumer routers have all ports blocked? Who is
| connecting a computer directly to the modem these days?
| stavros wrote:
| Most of them do NAT, which isn't semantically the same as
| blocking all ports, but functionally it is.
| seunosewa wrote:
| Most people use rsync over ssh
| srvmshr wrote:
| I encountered a strange situation 2 days ago. I rsync my pdf
| files periodically between my harddrives. rsync showed no
| differences between two folder trees, but if I did `diff -r`
| between the two, 3 pdfs came out different.
|
| I checked the three individually but they showed no corruption or
| changes either side. How can this happen?
|
| Edit: the hard drive copy is previously rsynced from this copy &
| both copies are mirrored with google cloud bucket.
|
| The 3 files which showed different have the same MD5 checksum
| cliZX81 wrote:
| The same happend to me syncing to an SD card. The reason may be
| different timestamp resolutions, which make same files look
| different. I synced from HFS+ to FAT32 back then.
| orangepurple wrote:
| rsync uses a heuristic based on file times and sizes to compare
| files. to compare file content use the --checksum feature
| (computationally expensive to run)
| Dylan16807 wrote:
| Rsync can checksum a lot of megabytes per second. In general
| I'd say the disk IO is much more expensive than the
| computation.
| srvmshr wrote:
| Yes but it doesn't answer why rsync & checksum pass the set
| of files as same, but diff reports them different.
| pronoiac wrote:
| It isn't clear from the conversation so far: do you use "--
| checksum"?
| samoppy wrote:
| boomskats wrote:
| This was a great write up. I've already sent it to a few people.
|
| On the question of what happens if a file's contents change after
| the initial checksum, the man page for rsync[0] has an
| interesting explanation of the *--checksum* option:
|
| > This changes the way rsync checks if the files have been
| changed and are in need of a transfer. Without this option, rsync
| uses a "quick check" that (by default) checks if each file's size
| and time of last modification match between the sender and
| receiver. This option changes this to compare a 128-bit checksum
| for each file that has a matching size. Generating the checksums
| means that both sides will expend a lot of disk I/O reading all
| the data in the files in the transfer (and this is prior to any
| reading that will be done to transfer changed files), so this can
| slow things down significantly.
|
| > The sending side generates its checksums while it is doing the
| file-system scan that builds the list of the available files. The
| receiver generates its checksums when it is scanning for changed
| files, and will checksum any file that has the same size as the
| corresponding sender's file: files with either a changed size or
| a changed checksum are selected for transfer.
|
| > Note that rsync always verifies that each transferred file was
| correctly reconstructed on the receiving side by checking a
| whole-file checksum that is generated as the file is transferred,
| but that automatic after-the-transfer verification has nothing to
| do with this option's before-the-transfer "Does this file need to
| be updated?" check. For protocol 30 and beyond (first supported
| in 3.0.0), the checksum used is MD5. For older protocols, the
| checksum used is MD4.
|
| [0]: https://linux.die.net/man/1/rsync
| londons_explore wrote:
| Failure cases of the 'quick check':
|
| * Underlying disk device corruption - but modern disks do
| internal error checking, and should emit an IO error.
|
| * Corruption in RAM/software bug in the kernel IO subsystem.
| Should be detected by filesystem checksumming.
|
| * User has accidentally modified file and set mtime back.
| _fixes this case_.
|
| * User has maliciously modified file and set mtime back. Since
| it's MD5 (broken), the malicious user can make the checksum
| match too. _checksumming doesn 't help_.
|
| Given that, I see no users who really benefit from
| checksumming. It isn't sufficient for anyone with really high
| data integrity requirements, while also being overkill for
| typical usecases.
| naniwaduni wrote:
| > * User has maliciously modified file and set mtime back.
| Since it's MD5 (broken), the malicious user can make the
| checksum match too. checksumming doesn't help.
|
| No, md5 is not broken like that (yet). There is no nkown
| second-preimage attack against md5; the practical collision
| vulns only affect cases where an attack controls the file
| content both before _and_ after update.
| tjoff wrote:
| The very existence of filesystem checksumming is because your
| first point isn't always true.
|
| Also, filesystem checksumming does not guard against
| ram/kernel-bugs. On top of that file system checksumming is
| very rare.
| jwilk wrote:
| > https://linux.die.net/man/1/rsync
|
| linux.die.net is horribly outdated. This particular page is
| from 2009.
|
| Up-to-date docs are here:
|
| https://download.samba.org/pub/rsync/rsync.1
| secure wrote:
| Which version is the samba one? Latest release? Git?
|
| If you want to see the man page of the version in Debian,
| that would be
| https://manpages.debian.org/testing/rsync/rsync.1.en.html
|
| Disclaimer: I wrote the software behind manpages.debian.org
| :)
| jwilk wrote:
| "This manpage is current for version 3.2.5dev of rsync" -
| so I guess it's from git.
| kzrdude wrote:
| I guess zfs send and similar are better solutions, but what if
| we could query the filesystem for existing checksums of a file
| and save IO that way, if filesystems on both sides already
| stored usable checksums?
| formerly_proven wrote:
| Unless you are also doing FS-level deduplication using the
| same checksums, it generally makes no sense for these to be
| cryptographic hashes, so they're not necessarily suitable for
| this purpose.
|
| IIRC neither ZFS nor btrfs use cryptographic hashes for
| checksumming by default.
| throw0101a wrote:
| > _on is a short hand for fletcher4 for non-deduped
| datasets and sha256 for deduped datasets_
|
| * https://openzfs.github.io/openzfs-
| docs/Basic%20Concepts/Chec...
|
| * https://people.freebsd.org/~asomers/fletcher.pdf
|
| * https://en.wikipedia.org/wiki/Fletcher%27s_checksum
|
| Strangely enough SHA-512 is actually (50%) faster than
| -256:
|
| > _ZFS actually uses a special version of SHA512 called
| SHA512t256, it uses a different initial value, and
| truncates the results to 256 bits (that is all the room
| there is in the block pointer). The advantage is only that
| it is faster on 64 bit CPUs._
|
| * https://twitter.com/klarainc/status/1367199461546614788
| slavik81 wrote:
| It seems that SHA512t256 is another name for SHA-512/256.
| It's a shame that the initialization is different from
| SHA-512, as it would have been very useful to be able to
| convert a SHA-512 hash into a SHA-512/256 hash.
| [deleted]
| js2 wrote:
| > I guess zfs send and similar are better solutions.
|
| It depends. I recently built a new zfs pool server and needed
| to transfer a few TB of data from the old pool to the new
| pool, but I built the new pool with a larger record size. If
| I'd used zfs send the files would have retained their
| existing record size. So rsync it was.
| throw0101a wrote:
| > _For protocol 30 and beyond (first supported in 3.0.0), the
| checksum used is MD5. For older protocols, the checksum used is
| MD4._
|
| Newer versions (>=3.2?) support xxHash and xxHash3:
|
| * https://github.com/WayneD/rsync/blob/master/checksum.c
|
| * https://github.com/Cyan4973/xxHash
|
| * https://news.ycombinator.com/item?id=19402602 (2019 XXH3
| discussion)
| aidenn0 wrote:
| I thought xxHash was only used for the chunk hash, not the
| whole file hash?
| throw0101a wrote:
| This is also available as a video, "Why I wrote my own rsync":
|
| * https://media.ccc.de/v/gpn20-41-why-i-wrote-my-own-rsync
|
| * https://www.youtube.com/watch?v=wpwObdgemoE
| throw0101a wrote:
| See also the 1996 original paper by Tridgell (also of Samba fame)
| and Mackerras:
|
| * https://rsync.samba.org/tech_report/
|
| * https://www.andrew.cmu.edu/course/15-749/READINGS/required/c...
| secure wrote:
| Another great resource from Tridgell himself is this Ottawa
| Linux Symposium talk:
| http://olstrans.sourceforge.net/release/OLS2000-rsync/OLS200...
| CamperBob2 wrote:
| I don't see why any of this is needed. Just install Dropbox,
| and...
| LambdaComplex wrote:
| I'm gonna interpret this in the best faith possible, assume
| you're referencing the infamous Dropbox HN comment, upvote you
| to counteract the downvotes from people who missed the joke,
| and link to the aforementioned comment:
| https://news.ycombinator.com/item?id=9224
| CamperBob2 wrote:
| The irony is that BrandonM's classic HN comment makes _more_
| sense these days, as Dropbox continues its evolution towards
| Microsofthood. Increasingly, using Dropbox means putting up
| with an unceasing deluge of product promotions while you 're
| trying to get your work done. No such annoyances with rsync
| and ftp and the like.
|
| Just last week, Dropbox unilaterally decided I didn't want
| local copies of the shared files on my laptop, which made for
| some awkwardness inside a secured facility with no Internet
| access.
| wardedVibe wrote:
| Let them use rsync for you?
___________________________________________________________________
(page generated 2022-07-02 23:00 UTC)