hngopher.com

       [HN Gopher] CDC File Transfer
       ___________________________________________________________________
        
       CDC File Transfer
        
       Author : GalaxySnail
       Score  : 363 points
       Date   : 2025-10-01 02:38 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rekttrader wrote:
       | Nice to see Stadia had some long term benefit. It's a shame they
       | don't make a self hosted version but if you did that it's just
       | piracy in today's drm world.
        
         | jMyles wrote:
         | > it's just piracy in today's drm world
         | 
         | ...which is more important / needed than ever. I encourage
         | every who asks to get my music from bit torrent instead of
         | spotify.
        
           | MyOutfitIsVague wrote:
           | Why not something like Bandcamp, or other DRM-free purchase
           | options?
           | 
           | I'm not above piracy if there's no DRM free option (or if the
           | music is very old or the artist is long dead), but I still
           | believe in supporting artists who actively support freedom.
        
             | jMyles wrote:
             | Yep, I put everything on bandcamp.
             | https://justinholmes.bandcamp.com/
             | 
             | Even better though, is a P2P service that is censorship
             | resistant.
             | 
             | But yeah I like Bandcamp plenty.
             | 
             | > artists who actively support freedom.
             | 
             | The bluegrass world is quickly becoming this.
             | 
             | https://pickipedia.xyz/wiki/DRM-free
        
           | MaxikCZ wrote:
           | So you create and seed your torrents with your music, and
           | present them prominently on your site?
        
             | jMyles wrote:
             | I was doing that for a while, and running a seedbox.
             | However, on occasions when the seedbox was the only seeder,
             | clients were unable to begin the download, for reasons I've
             | never figured out. If I also seeded from my desktop, then
             | fan downloads were being fed by both the desktop and the
             | seedbox. But without the desktop, the seedbox did nothing.
             | 
             | I need to revisit this in the next few weeks as I release
             | my second record (which, if I may boast, has an incredible
             | ensemble of most of my favorite bluegrass musicians on it;
             | it was a really fun few days at the studio).
             | 
             | Currently I do pin all new content to IPFS and put the
             | hashes in the content description, as with this video of
             | Drowsy Maggie with David Grier:
             | https://www.youtube.com/watch?v=yTI1HoFYbE0
             | 
             | Another note: our study of Drowsy Maggie was largely made
             | possible by finding old-and-nearly-forgotten versions in
             | the Great78 project, which of course the industry attempted
             | to sue out of existence on an IP basis. This is another
             | example of how IP is a conceptual threat to traditional
             | music - we need to be able to hear the tradition in order
             | to honor it.
        
         | oofbey wrote:
         | What do you mean piracy in the a DRM world. Like being able to
         | share your own PC games through the cloud?
        
           | killingtime74 wrote:
           | You can share the games you authored all you like. If you
           | bought a license to play them that's another story.
        
         | kanemcgrath wrote:
         | for self-hosted game streaming you can use moonlight +
         | sunshine, they work really well in my experience.
        
           | BrokenCogs wrote:
           | Exactly my experience too. I easily get 60fps at 1080p over
           | wireless LAN with moonlight + sunshine. Parsec is also
           | another option
        
         | sheepscreek wrote:
         | Probably wouldn't have been feasible - I heard developers had
         | to compile their games with Stadia support. Maybe it was an
         | entirely different platform, with its own alternative to
         | DirectX, or maybe had some kind of lightweight emulation (such
         | as Proton) but I remember vaguely the few games I played had
         | custom stadia key bindings (with stadia symbols). They would
         | display like that within the game. So definitely some
         | customization did happen.
         | 
         | This is unlike the model that PlayStation, Xbox and even Nvidia
         | are following - I don't know about Amazon Luna.
        
           | jakebasile wrote:
           | As I understand it, GeForce Now actually does require changes
           | to the game to run in the standard and until recently only
           | option of "Ready To Play". This is the supposed reason that
           | new updates to games sometimes take time to get released on
           | the service, since either the developers themselves or Nvidia
           | needs to modify it to work correctly on the service. I have
           | no idea if this is true, but it makes sense to me.
           | 
           | They recently added "Install to Play" where you can install
           | games from Steam that aren't modified for the service. They
           | charge for storage for this though.
           | 
           | Sadly, there's still tons of games unavaiable because
           | publishers need to opt in and many don't.
        
             | TiredOfLife wrote:
             | GeForce Now doesn't require any changes.
        
           | MindSpunk wrote:
           | Stadia games were just run on Linux with Vulkan + some extra
           | Stadia APIs for their custom swapchain and other bits and
           | pieces. Stadia games were basically just Linux builds.
        
           | numpad0 wrote:
           | They did have a dev console based on a Lenovo workstation, as
           | well as off-menu AMD V340L 2x8GB GPUs, both later leaked into
           | Internet auctions. So some hardware and software
           | customizations had definitely happened.
        
         | laidoffamazon wrote:
         | Stadia was sadly engineered in such a way that this is
         | impossible.
         | 
         | Speaking of which, who thought up the idea to use custom
         | hardware for this that would _already be obsolete_ a year
         | later? Who considered using Linux native instead of a compat
         | layer? Why did the original Stadia website not even have a
         | search bar??
        
         | nolok wrote:
         | For self hosted remote streaming of game look at Moonlight /
         | Sunshine (Apollo)
         | 
         | Stadia required special version of games, so it wouldn't be
         | that useful
        
           | asmor wrote:
           | It's a shame that virtual / headless displays are such a mess
           | on both Linux and Windows. I use a 32:9 ultrawide and stream
           | to 16:9/16:10 devices, and even with hours of messing around
           | with an HDMI dummy and kscreen-doctor[1] it was still an
           | unreliable mess. Sometimes it wouldn't work when the machine
           | was locked, and sometimes Sunshine wouldn't restore the
           | resolution on the physical monitor (and there's no session
           | timeout either).
           | 
           | Artemis is a bit better, but it still requires per-device
           | setup of displays since it somehow doesn't disable the
           | physical output next to the virtual one. Those drivers also
           | add latency to the capture (the author of looking glass
           | really dislikes them because they undo all the hard work of
           | near-zero latency).
           | 
           | [1]: https://github.com/acuteaura/universe/blob/main/systems/
           | _mod...
        
             | nolok wrote:
             | Use Apollo (a fork of Sunshine) :
             | https://github.com/ClassicOldSong/Apollo
             | 
             | > Built-in Virtual Display with HDR support that matches
             | the resolution/framerate config of your client
             | automatically
             | 
             | It includes a virtual screen driver, and it handles all the
             | crap (it can disable your physical screen when streaming
             | and re enable after, it can generate the virtual screen by
             | client to match the client's needs, or do it by game, or
             | ...)
             | 
             | I stream from my main pc to both my laptop and my
             | steamdeck, and each get the screen that matches them
             | without having to do anything more than connect to it with
             | moonlight.
        
               | asmor wrote:
               | Artemis/Apollo are mentioned in the post above - yeah
               | they work better than the out of box experience, but you
               | still have to configure your physical screen to be off
               | for every virtual display. It unfortunately only runs on
               | Windows and my machine usually doesn't. I also only have
               | one dGPU and a Raphael iGPU (which are sensitive to
               | memory overclocks) and I like the Linux gaming experience
               | for the most part, so while I did have a working gaming
               | VM, it wasn't for me (or I'd want another GPU).
        
             | heavyset_go wrote:
             | On Linux with an AMD i/dGPU, you can set the
             | `virtual_display` module parameter for `amdgpu`[1] and do
             | what you want without the need for an HDMI dummy or weird
             | software. It's also hardware accelerated.
             | 
             | > _virtual_display (charp)_
             | 
             | > _Set to enable virtual display feature. This feature
             | provides a virtual display hardware on headless boards or
             | in virtualized environments. It will be set like
             | xxxx:xx:xx.x,x;xxxx:xx:xx.x,x. It's the pci address of the
             | device, plus the number of crtcs to expose. E.g.,
             | 0000:26:00.0,4 would enable 4 virtual crtcs on the pci
             | device at 26:00.0. The default is NULL._
             | 
             | [1]https://www.kernel.org/doc/html/latest/gpu/amdgpu/module
             | -par...
        
               | asmor wrote:
               | Unfortunately this seems to disable physical outputs.
               | 
               | https://bugzilla.kernel.org/show_bug.cgi?id=203339
        
               | heavyset_go wrote:
               | I figure if you're using an HDMI dummy you're running
               | headless anyway
               | 
               | edit: didn't realize you're the OP lol
        
         | mrguyorama wrote:
         | I don't understand, "self hosted stadia" is just one of the
         | myriad of services and tools that do literally that.
         | 
         | Steam has game streaming built in and works very well. Both
         | Nvidia and AMD built this into their GPU drivers at one point
         | or another (I think the AMD one was shut down?)
         | 
         | Those are just the solutions I accidentally have installed
         | despite not using that functionality. You can even stream games
         | _from_ the steam deck!
         | 
         | Sony even has a system to let you stream your PS4 to your
         | computer anywhere and play it. I think Microsoft built
         | something similar for Xbox.
        
       | theamk wrote:
       | This CDC is "Content Defined Chunking" - fast incremental file
       | transfer.
       | 
       | Use case is to copy file over slow net, but the previous version
       | is already there, so one can save time by only sending changed
       | parts of the file.
       | 
       | Not to be confused with USB CDC ("communications device class"),
       | an USB device protocol used to present serial ports and network
       | cards. It can also be used to transfer files, the old PC-to-PC
       | cables used it by implementing two network cards connected to
       | each other.
        
         | oofbey wrote:
         | The clever trick is how it recognizes insertions. The standard
         | trick of computing hashes on fixed sized blocks works
         | efficiently for substitutions but is totally defeated by an
         | insertion or deletion.
         | 
         | Instead with CDC the block boundaries are define by the
         | content, so an insertion doesn't change the block boundary, so
         | it can tell the subsequent blocks are unchanged. I haven't read
         | the CDC paper but I'm guessing they just use some probabilistic
         | hash function to define certain strings as block boundaries.
        
           | teraflop wrote:
           | Probably worth noting that ordinary rsync can also handle
           | insertions/deletions because it uses a rolling hash. Rsync's
           | method is bandwidth-efficient, but not especially CPU-
           | efficient.
        
           | adzm wrote:
           | > I haven't read the CDC paper but I'm guessing they just use
           | some probabilistic hash function to define certain strings as
           | block boundaries.
           | 
           | You choose a number of bits (say, 12) and then evenly
           | distribute these in a 48-bit mask; if the hash at any point
           | has all these bits on, that defines a boundary.
        
         | NooneAtAll3 wrote:
         | not to be confused with Center of Disease Control
        
           | 1ncorrect wrote:
           | ...or cDc[0]
           | 
           | [0] https://en.wikipedia.org/wiki/Cult_of_the_Dead_Cow
        
             | bbkane wrote:
             | Or https://en.wikipedia.org/wiki/Change_data_capture
        
               | monocasa wrote:
               | Or https://en.wikipedia.org/wiki/Control_Data_Corporation
        
           | petsfed wrote:
           | Especially in the context of recent (that is, last 10 years)
           | removal of data from Center of Disease Control sources due to
           | changing political winds.
        
       | claytongulick wrote:
       | I ran into some of those issues with the chunk size and hash
       | misses when writing bitsync [1], but at the time I didn't want to
       | get too clever with it because I was focused on rsync algorithm
       | compatibility.
       | 
       | This is a cool idea!
       | 
       | [1] https://github.com/claytongulick/bit-sync
        
       | modeless wrote:
       | Does Steam do something like this for game updates?
        
         | Scaevolus wrote:
         | Steam unfortunately doesn't use a rolling hash like this
         | (fastcdc, buzhash, etc.), but rather slices files into 1MB
         | chunks, hashes them, and updates at that granularity.
         | 
         | https://partner.steamgames.com/doc/sdk/uploading#AppStructur...
        
       | supportengineer wrote:
       | Cygwin? Does anyone still use that?
        
         | cheema33 wrote:
         | Cygwin has its benefits over WSL. e.g. It does not run in a VM
         | for example and therefore does not suffer from the resulting
         | performance penalty.
        
       | mikae1 wrote:
       | _> cdc_rsync is a tool to sync files from a Windows machine to a
       | Linux device, similar to the standard Linux rsync._
       | 
       | Does this work Linux to Linux too?
        
         | kxrm wrote:
         | No: https://github.com/google/cdc-file-transfer?tab=readme-ov-
         | fi...
        
       | maxlin wrote:
       | Having dabbled in trying to make a quick delta patch system like
       | Steam's, which required me to understand delta patching methods
       | and made small patches to big files in a 10gb+ installation in a
       | few seconds, this is sure is quite interesting!
       | 
       | I wonder if Steam ever decides to supercharge their content
       | handling with some user-space filesystem stuff. With fast
       | connections, there isn't really a reason they couldn't launch
       | games in seconds, streaming data on-demand with smart pre-caching
       | steering based on automatically trained access pattern data. And
       | especially with finely tuned delta patching like this, online
       | game pauses for patching could be almost entirely eliminated.
       | Stop & go instead of a pit stop.
        
         | fsfod wrote:
         | Someone already created that[1] using custom kernel driver and
         | there own CDN, but they seem to of abandoned it[2], maybe
         | because they would of attracted Valve's wrath trying to
         | monetized it.
         | 
         | [1]
         | https://web.archive.org/web/20250517130138/https://venusoft....
         | 
         | [2] https://venusoft.net/#home
        
           | maxlin wrote:
           | That's actually quite interesting. Not entirely what I had in
           | mind but close! My version would have only the first boot be
           | a bit slow, but the aspect of dynamically replacing local
           | content there is cool.
           | 
           | This would be extra cool for LAN parties with good network
           | hardware
        
         | Zekio wrote:
         | steam game installs are bottlenecked by cpu speed these days
         | due to the heavy compression, so doubt it be much faster
        
           | maxlin wrote:
           | Well, the amount of compression isn't set in stone, obviously
           | a system like this would run with a less compressed dataset
           | to balance game boot time, time taken away from running the
           | game by compression, and scale on available bandwidth.
           | 
           | With low bandwidth just downloading the whole thing while
           | having enough compression to 80% saturate the local system
           | would be optimal instead, sure.
        
       | ur-whale wrote:
       | Great initiative, especially the new sync algorithm, but giant
       | hurdles to adoption:
       | 
       | - only works on a weird combo of (src platform / dst platform).
       | Why???? How hard is it to write platform-independent code to
       | read/write bytes and send them over the wire in 2025?
       | 
       | - uses bazel, an enormous, Java-based abomination, to build.
       | 
       | Fingers crossed that these can be fixed, or this project is dead
       | in the water.
        
         | hobs wrote:
         | First thing might be considered a bug by googles, but everyone
         | I have talked to LOVED their bazel or at least thought of it as
         | superior to any other tool to do the same stuff.
         | 
         | Literally tonight my buddy was talking about how months long
         | plan to introduce bazel into his companies infra.
        
         | jve wrote:
         | Hey the repo is archived and as I read the tool was meant to
         | solve one specific scenario. Not everything has to please the
         | public.
         | 
         | The great thing is googlers could make such a tool and publish
         | it in the first place. So you can improve it to use it in your
         | scenario. Or become maintainer of such a tool.
        
         | maccard wrote:
         | > only works on a weird combo of (src platform / dst platform).
         | Why????
         | 
         | Stadia ran on linux, and 99.9999999% of game development is
         | done on windows (and cross compiled for linux).
         | 
         | > Fingers crossed that these can be fixed, or this project is
         | dead in the water.
         | 
         | The project was archived 9 months ago, and hasn't had a commit
         | in 2 years. It's already dead.
        
       | EdSchouten wrote:
       | I've also been doing lots of experimenting with Content Defined
       | Chunking since last year (for https://bonanza.build/). One of the
       | things I discovered is that the most commonly used algorithm
       | FastCDC (also used by this project) can be improved significantly
       | by looking ahead. An implementation of that can be found here:
       | 
       | https://github.com/buildbarn/go-cdc
        
         | Scaevolus wrote:
         | This lookahead is very similar to the "lazy matching" used in
         | Lempel-Ziv compressors!
         | https://fastcompression.blogspot.com/2010/12/parsing-level-1...
         | 
         | Did you compare it to Buzhash? I assume gearhash is faster
         | given the simpler per iteration structure. (also, rand/v2's
         | seeded generators might be better for gear init than mt19937)
        
           | EdSchouten wrote:
           | Yeah, GEAR hashing is simple enough that I haven't considered
           | using anything else.
           | 
           | Regarding the RNG used to seed the GEAR table: I don't think
           | it actually makes that much of a difference. You only use it
           | once to generate 2 KB of data (256 64-bit constants). My
           | suspicion is that using some nothing-up-my-sleeve numbers
           | (e.g., the first 2048 binary digits of p) would work as well.
        
             | pbhjpbhj wrote:
             | The random number generation could match the first 2048
             | digits of pi, so if it works with _any_ random number...
             | 
             | If it doesn't work with any random number, then some work
             | better than others then intuitively you can find a (or a
             | set of) best seed(s).
        
             | Scaevolus wrote:
             | Right, just one fewer module dependency using the stdlib
             | RNG.
        
         | rokkamokka wrote:
         | What would you estimate the performance implications of using
         | go-cdc instead of fastcdc in their cdc_rsync are?
        
           | EdSchouten wrote:
           | In my case I observed a ~2% reduction in data storage when
           | attempting to store and deduplicate various versions of the
           | Linux kernel source tree (see link above). But that also
           | includes the space needed to store the original version.
           | 
           | If we take that out of the equation and only measure the size
           | of the additional chunks being transferred, it's a reduction
           | of about 3.4%. So it's not an order of magnitude difference,
           | but not bad for a relatively small change.
        
         | quotemstr wrote:
         | I wonder whether there's a role for AI here.
         | 
         | (Please don't hurt me.)
         | 
         | AI turns out to be useful for data compression
         | (https://statusneo.com/creating-lossless-compression-
         | algorith...) and RF modulation optimization
         | (https://www.arxiv.org/abs/2509.04805).
         | 
         | Maybe it'd be useful to train a small model (probably of the
         | SSM variety) to find optimal chunking boundaries.
        
           | EdSchouten wrote:
           | Yeah, that's true. Having some kind of chunking algorithm
           | that's content/file format aware could make it work even
           | better. For example, it makes a lot of sense to chunk source
           | files at function/scope boundaries.
           | 
           | In my case I need to ensure that all producers of data use
           | exactly the same algorithm, as I need to look up build cache
           | results based on Merkle tree hashes. That's why I'm
           | intentionally focusing on having algorithms that are not only
           | easy to implement, but also easy to implement _consistently_.
           | I think that MaxCDC implementation that I shared strikes a
           | good balance in that regard.
        
         | xyzzy_plugh wrote:
         | > https://bonanza.build
         | 
         | I just wanted to let you know, this is really cool. Makes me
         | wish I still used Bazel.
        
       | laidoffamazon wrote:
       | As I've gotten further in my career I've started to wonder - how
       | many engineering quarters did it take to build this for their
       | customers? How did they manage to get this on their own roadmap?
       | This seems like a lot of code surface area for a fairly minimal
       | optimization that would be redundant with a different development
       | substrate (like running Windows on Stadia like how Amazon Luna
       | worked...)
        
         | jayd16 wrote:
         | It's easy to get work on this problem. Any effort that shortens
         | game deploy time will be highly visible. It's something every
         | game needs, and every member of the team deals with.
        
           | laidoffamazon wrote:
           | Im sympathetic to this idea but it seems like this is a
           | situation that most game developers don't have because they
           | just develop locally. Sometimes they do need to push to a
           | console which this could help with if Microsoft or Sony built
           | this into their dev kit tooling.
        
         | grodes wrote:
         | You are thinking like a manager, but this (as with most of the
         | good things in life) has been built by doers, artisans, and
         | engineers (developers).
         | 
         | This is a problem interesting enough, with huge potential
         | benefits for humanity if it manages to improve anything, which
         | it did.
        
       | AnonC wrote:
       | Does anyone know if there's work being done to integrate this
       | into the standard rsync tool (even as an optional feature)? It
       | seems like a very useful improvement that ought to be available
       | widely. From this website it seems a bit disappointing that it's
       | not even available for Linux to Linux transfers.
        
         | rincebrain wrote:
         | You can find some thoughts on it not working for Linux to
         | Linux, and more broad compatibility, here[1] and here[2].
         | 
         | [1] - https://github.com/google/cdc-file-
         | transfer/issues/56#issuec...
         | 
         | [2] - https://github.com/librsync/librsync/issues/242
        
       | est wrote:
       | I wonder if this could be applied to git.
       | 
       | The git blob was hashed with a header of decimal length, and you
       | change a slight bit of content, you have to calculate the hash
       | from start again.
       | 
       | Something like CDC would improve this alot.
        
         | oac wrote:
         | It's done in xet as a replacement for git lfs:
         | https://huggingface.co/blog/from-files-to-chunks
        
         | pabs3 wrote:
         | Backup tools like restic/borg do this, I wonder if anyone has
         | used them to replace git yet.
        
       | janpmz wrote:
       | Tailscale and python3 -m http.server 1337 and then navigating the
       | browser to ip:1337 is a nice way to transfer files too (without
       | chunking). I've made an alias for it alias serveit="python3 -m
       | http.server 1337"
        
       | wheybags wrote:
       | If anyone else was left wondering about the details of how CDC
       | actually generates chunks, I found these two blog posts explained
       | the idea pretty clearly:
       | 
       | https://joshleeb.com/posts/content-defined-chunking.html
       | 
       | https://joshleeb.com/posts/gear-hashing.html
        
         | jcul wrote:
         | Thanks, I was puzzled by that. They kind of gloss over it in
         | the original link.
         | 
         | Looking forward to reading those.
        
       | tgsovlerkhgsel wrote:
       | Key sentence: "The remote diffing algorithm is based on CDC
       | [Content Defined Chunking]. In our tests, it is up to 30x faster
       | than the one used in rsync (1500 MB/s vs 50 MB/s)."
        
       | MayeulC wrote:
       | I am quite confused; doesn't rsync already use content-defined
       | chunk boundaries, with a condition on the rolling hash to define
       | boundaries?
       | 
       | https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
       | 
       | https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
       | 
       | The speed improvements over rsync seem related to a more
       | efficient rolling hash algorithm, and possibly by using native
       | windows executables instead of cygwin (windows file systems are
       | notoriously slow, maybe that plays a role here).
       | 
       | Or am I missing something?
       | 
       | In any case, the performance boost is interesting. Glad the
       | source was opened, and I hope it finds its way into rsync.
        
         | sneak wrote:
         | rsync seems frozen in time; it's been around for ages and there
         | are so many basic and small quality of life improvements that
         | could have been made that haven't been. I have always assumed
         | it's like vim now: only really maintained in theory, not in
         | practice.
        
           | Zardoz84 wrote:
           | So you not used vim or neovim in the last 10 years ?
        
             | lftl wrote:
             | To be fair, there was a roughly 6 year period when vim saw
             | one very minor release. That slow development period was
             | the impetus for the fork of Neovim.
        
               | Zardoz84 wrote:
               | I know. I use Neovim. But since that, and thanks to
               | Neovim, Vim has speedup and got some improvements.
        
               | dotancohen wrote:
               | Time for neorsync.
               | 
               | That said, VIM 8 was terrific.
        
           | chasil wrote:
           | Please bear in mind that there are [now] two distinct rsync
           | codebases.
           | 
           | The original is the GPL variant [today displaying "Upgrade
           | required"]:
           | 
           | https://rsync.samba.org/
           | 
           | The second is the BSD clone:
           | 
           | https://www.openrsync.org/
           | 
           | The BSD version would be used on platforms that are
           | intolerant of later versions of the GPL (Apple, Android,
           | etc.).
        
         | re wrote:
         | > doesn't rsync already use content-defined chunk boundaries,
         | with a condition on the rolling hash to define boundaries?
         | 
         | No, it operates on fixed size blocks over the destination file.
         | However, by using a rolling hash, it can detect those blocks at
         | any offset within the source file to avoid re-transferring
         | them.
         | 
         | https://rsync.samba.org/tech_report/node2.html
        
         | ohitsdom wrote:
         | The readme very nicely contrasts the approach with rsync.
        
       | exikyut wrote:
       | I'm curious: what does MUC stand for? :)
        
       | bilekas wrote:
       | This is actually kind of cool, I've implemented my own version of
       | this for my job and seems to be something that's important when
       | the numbers gets tight, but if I remember correctly for their
       | case i guess, wouldn't it have been easier to work from rsynch?
       | 
       | > scp always copies full files, there is no "delta mode" to copy
       | only the things that changed, it is slow for many small files,
       | and there is no fast compression.
       | 
       | I havent tried it myself but doesnt this already suit that
       | requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/
       | 
       | > Compression If the SOURCE and DESTINATION are on different
       | machines with fast CPUs, especially if they're on different
       | networks (e.g. your home computer and the FASRC cluster), it's
       | recommended to add the -z option to compress the data that's
       | transferred. This will cause more CPU to be used on both ends,
       | but it is usually faster.
       | 
       | Maybe it's not fast enough, but seems a better place to start
       | than scp imo.
        
         | regularfry wrote:
         | > The remote diffing algorithm is based on CDC. In our tests,
         | it is up to 30x faster than the one used in rsync (1500 MB/s vs
         | 50 MB/s).
        
         | rincebrain wrote:
         | rsync in my experience is not optimized for a number of use
         | cases.
         | 
         | Game development, in particular, often involves truly enormous
         | sizes and numbers of assets, particularly for dev build
         | iteration, where you're sometimes working with placeholder or
         | unoptimized assets, and debug symbol bloated things, and in my
         | experience, rsync scales poorly for speed of copying large
         | numbers of things. (In the past, I've used naive wrapper
         | scripts with pregenerated lists of the files on one side and
         | GNU parallel to partition the list into subsets and hand those
         | to N different rsync jobs, and then run a sync pass at the end
         | to cleanup any deletions.)
         | 
         | Just last week, I was trying to figure out a more effective way
         | to scale copying a directory tree that was ~250k files varying
         | in size between 128b and 100M, spread out across a
         | complicatedly nested directory structure of 500k directories,
         | because rsync would serialize badly around the cost of creating
         | files and directories. After a few rounds of trying to do many-
         | way rsync partitions, I finally just gave the directory to
         | syncthing and let its pregenerated index and watching handle
         | it.
        
           | jmuhlich wrote:
           | Try this: https://alexsaveau.dev/blog/projects/performance/fi
           | les/fuc/f...
           | 
           | > The key insight is that file operations in separate
           | directories don't (for the most part) interfere with each
           | other, enabling parallel execution.
           | 
           | It really is magically fast.
           | 
           | EDIT: Sorry, that tool is only for local copies. I just
           | remembered you're doing remote copies. Still worth keeping in
           | mind.
        
       | Sammi wrote:
       | It's dead and archived atm, but it looks like a good candidate
       | for revival as an actual active open source project. If you ever
       | wanted to work on something that looks good on your resume, then
       | this looks like your chance. Basically just get it running and
       | released on all major platforms.
        
       | phyzome wrote:
       | You can see something similar in use in the borg backup tool --
       | content-defined chunking, before deduplication and encryption.
        
       | syngrog66 wrote:
       | CDC is an unfortunately chosen name
        
       | 0xfeba wrote:
       | the name reminds me of Microsoft's RDC, Remote Differential
       | Compression.
       | 
       | https://en.wikipedia.org/wiki/Remote_Differential_Compressio...
        
       | velcrovan wrote:
       | > Download the precompiled binaries from the latest release to a
       | Windows device and unzip them. The Linux binaries are
       | automatically deployed to ~/.cache/cdc-file-transfer by the
       | Windows tools. There is no need to manually deploy them.
       | 
       | Interesting, so unlike rsync there is no need to set up a service
       | on the destination Linux machine. That always annoyed me a bit
       | about rsync.
        
         | justinsaccount wrote:
         | The most common use for rsync is to run it over ssh where it
         | starts the receiving side automatically. cdc is doing the exact
         | same thing.
         | 
         | You were misinformed if you thought using rsync required
         | setting up an rsync service.
        
       | charleshwang wrote:
       | Is this how IBM Aspera works too? I was working QA at a game
       | publisher a while ago, and they used it to upload some screen
       | recordings. I didn't understand how it worked, but it was
       | exceeding the upload speeds of the regular office internet.
       | 
       | https://www.ibm.com/products/aspera
        
       | ksherlock wrote:
       | They should have duck ducked the initialism. CDC is Control Data
       | Corporation.
        
       | shae wrote:
       | I've read lots about content defined chunking and recently heard
       | about monoidal hashing. I haven't tried it yet, but monoidal
       | hashing reads like it would be all around better, does anyone
       | know why or why not?
        
       ___________________________________________________________________
       (page generated 2025-10-01 23:02 UTC)