[HN Gopher] CDC File Transfer
___________________________________________________________________
CDC File Transfer
Author : cglong
Score : 131 points
Date : 2023-01-08 21:46 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| teraflop wrote:
| TL;DR they rediscovered the context-dependent variable block size
| technique that Tarsnap uses:
| https://www.tarsnap.com/download/EuroBSDCon13.pdf
| crazygringo wrote:
| They don't claim to have invented CDC, or FastCDC, they just
| made and are sharing a useful implementation of it.
|
| And if that Tarsnap presentation is from 2013, and FastCDC was
| published in 2016 [1] according to Wikipedia [2], then
| presumably Tarsnap didn't invent FastCDC either.
|
| [1]
| https://www.usenix.org/system/files/conference/atc16/atc16-p...
|
| [2]
| https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_...
| jyxent wrote:
| Another well-cited predecessor is "A low-bandwidth network
| file system."
| (https://dl.acm.org/doi/abs/10.1145/502034.502052), which was
| published in 2001. It uses Rabin fingerprinting to define
| chunk boundaries.
| kccqzy wrote:
| The Tarsnap technique is interesting but that presentation is a
| bit hard to follow. What are the examples for values alpha and
| p?
| elromulous wrote:
| I took a quick glance at the repo, it wasn't totally clear to me
| if the client is windows only. Anyone know more?
| Moissanite wrote:
| I thought the same; seems to be perfectly fine as a Linux-to-
| Linux tool.
| m000 wrote:
| It would be interesting to also try this on more underpowered
| systems. E.g. aging armv5-based NAS boxes. Would we have similar
| performance improvements, or any improvements would be hampered
| by poor I/O performance and architectural inefficiencies?
| zX41ZdbW wrote:
| Content Defined Chunking is one of my favorite algorithms because
| it has some "magic" similar to HyperLogLogs, Bloom filters,
| etc... This algorithm is good to explain to people, to get them
| inspired by computer science. I usually explain the simplest
| variant with rolling hashes.
|
| It is interesting what the result will be (average saving on
| deduplication) if it is applied globally to a large-scale blob
| storage, such as Amazon S3 or Google Drive (we need metadata
| storage about chunks, and the chunks can be deduplicated).
|
| PS. I don't use this algorithm in ClickHouse, but it always
| remains tempting.
| piceas wrote:
| Implementing fastCDC is fun (2016).
|
| Do you have a suggestion on what to read on the topic since
| then?
|
| I don't keep up with these things. A quick search came up with
| the following but I haven't read it yet.
|
| Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality
| to Accelerate Chunking in CDC-based Deduplication Systems", in
| Proceedings of 2019 ACM Symposium on Cloud Computing (ACM
| SoCC'19), Santa Cruz, CA, November, 2019.
|
| https://ranger.uta.edu/~sjiang/index.html
| windowsworkstoo wrote:
| Microsoft have this for DFS-R and the spec is open
| https://learn.microsoft.com/en-
| us/openspecs/windows_protocol... - its pretty straight
| forward to implement.
| sigmonsays wrote:
| can we gut rsync and implement this?
|
| rsync doesn't even have a great protocol specification... maybe
| it's a new tool...
| progbits wrote:
| > Overall, cdc_rsync syncs files about 3 times faster than
| Cygwin rsync.
|
| Sounds promising.
| mjevans wrote:
| I'd like additional info about the Content Defined Chunking
| part of the specification.
|
| How is the content defined? Where I'd try to begin is with a
| single pass that looks for runs of 'null' ('\0') bytes, even
| one long, as potential boundary ends. During that pass also
| look for 'magic signatures' for known stream types like already
| compressed content streams (all the more so to just not try
| compressing anyway). The CDC might also be aware of some file
| structures, zip, 7z, tar, etc; and have a dedicated segment
| creation algorithm for them. At a low level, the two ends
| should exchange a list of segment offsets, lengths, checksum
| (partial?) and maybe some short fragment of bytes to check.
| (E.G. 4 byte chunks at various powers of 2 offsets or major
| chunk starts.) Where the two ends have differences in chunks
| existing they might also expend some minor additional effort to
| investigate if the chunks that were identified on the other
| side exist locally; in case the two versions are using
| different filters or happened to reach different conclusions.
| 5440 wrote:
| I'm currently working on a required CDC (Center for Disease
| Control) reporting function for a COVID test. For a second I
| thought this article was going to be extremely helpful.
| brendoncarroll wrote:
| FastCDC is the same chunking algorithm used in Got.
|
| https://github.com/gotvc/got
| wahern wrote:
| Well, that's confusing--there's another Git follow on project
| from some OpenBSD developers also called Got:
| http://gameoftrees.org/
|
| They both seem like very cool projects, so may the best Got
| win!
| jws wrote:
| To elaborate, rsync chunks in fixed sizes, so inserting or
| deleting a few bytes makes all different chunks from that point
| onward.
|
| If instead you chunk based off of local content (conceptually
| like chunking text into sentences at periods, but its a binary
| thing on has an upper size limit and lower size limit and I
| couldn't find the algorithm specification) so that after an
| insertion or deletion in a small number of bytes you start
| getting the same chunks as before.
|
| This drastically reduces the cost of identifying unmodified
| chunks.
| brendoncarroll wrote:
| I don't think rsync uses fixed sized chunks. The algorithm is
| described here; it's a rolling hash.
|
| https://rsync.samba.org/tech_report/node3.html
|
| Your description of content defined chunking is exactly right
| though. There are a number of techniques for doing it.
| FastCDC is one of them, although not the one used in rsync.
|
| https://en.wikipedia.org/wiki/Rolling_hash
|
| EDIT: Corrected in the comments below. Fixed sized chunks
| searched for at any offset with a rolling hash. The rsync
| algorithm description is here.
|
| https://rsync.samba.org/tech_report/node2.html
| teraflop wrote:
| rsync does use fixed-size chunks, but the rolling hash
| allows them to be identified even at non-integer chunk
| offsets.
|
| So a change partway through the file doesn't force rsync to
| actually re-transfer all of the subsequent unmodified
| chunks, but it does incur a computational cost to find them
| since it has to search through all possible offsets.
| Waterluvian wrote:
| Maybe I'm projecting (probably) but I detect some subtext in the
| README's language. Like a "developers who are irate that Stadia
| was shut down, wanting to free anything they can for the
| community" type feeling.
| elromulous wrote:
| Google / abc has a solid track record of open sourcing shutdown
| projects. See Makani[1] and iirc also Loon.
|
| [1] https://news.ycombinator.com/item?id=24456613
| Waterluvian wrote:
| Aha! I wasn't at all aware of that. Okay, this reads more
| like that to me second time around. Just excitement to open
| source a cool piece of tech that came from Stadia.
| [deleted]
| encryptluks2 wrote:
| This is great for syncing, but what about separating courgette
| from Chromium so that we can finally have a descent delta diff
| program? I am tired of Windows community relying on SmartVersion
| to create ISO diff files just because you need to be a master at
| compiling Chromium just to use courgette.
___________________________________________________________________
(page generated 2023-01-08 23:00 UTC)