[HN Gopher] How useful should copy_file_range be?
       ___________________________________________________________________
        
       How useful should copy_file_range be?
        
       Author : chmaynard
       Score  : 42 points
       Date   : 2021-02-18 22:22 UTC (1 days ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | CyberRabbi wrote:
       | Why not optimistically copy the file until EOF and report number
       | of bytes copied? Why is stat() consulted at all? That seems
       | broken.
        
       | tyingq wrote:
       | Something to change the start point of an existing file would be
       | neat. Sort of like truncate(), but for the start.
        
         | jandrese wrote:
         | When you think about how files are stored on the filesystem it
         | becomes clear why this functionality doesn't exist. Each file
         | is basically a list of blocks and some metadata like the total
         | length of the file. What it doesn't have is a length for each
         | block--they are assumed to be full length except for the last,
         | which is stored as the total length of the file.
         | 
         | So if you wanted to add bytes to the front of the file you
         | would have to allocate new blocks to store it, but since there
         | is no map of length for each block you would have to only move
         | it by exact block lengths. Same for shrinking the file by
         | cutting off the head, you can't handle values other than full
         | blocks.
         | 
         | It's certainly possible to build a filesystem where this would
         | work, but when you wrote programs using the feature they
         | wouldn't be portable to any other commonly used filesystem.
         | People also don't change filesystems very often, so even if you
         | got the change into Ext and waited a decade many people would
         | still be incompatible.
         | 
         | Finally, it's a feature that is helpful only rarely. So there
         | isn't enough demand to push through such a massive change given
         | the headwinds it has.
        
         | the8472 wrote:
         | fallocate(..., FALLOC_FL_COLLAPSE_RANGE) will do that but it
         | comes with limitations such as alignment requirements and
         | limited filesystem support.
         | 
         | You could also try creating a new file and use copy_file_range
         | to copy the tail of the file to the new one, then move it over
         | the old one. That might reuse a good chunk of the storage on a
         | CoW filesystem.
        
       | callesgg wrote:
       | I used to ponder on an idea that partial file copy's could be
       | done with file fragmentaion.
        
       | dataflow wrote:
       | It sounds wrong to depend on the file length for correctness even
       | on physical file systems. What if the file length shrinks during
       | the copy? You need to just keep going until you can't anymore...
        
         | cpuguy83 wrote:
         | That's assuming you want to copy the the whole thing. And even
         | if you do want to do that, this is what you do with
         | copy_file_range as well, just that in many cases you can do it
         | with a single call instead of multiple read/write calls in
         | addition to being able to take advantage of performance
         | optimizations (such as reflinking).
        
           | dataflow wrote:
           | > That's assuming you want to copy the the whole thing
           | 
           | Right, but if you don't, then you _definitely_ don 't need to
           | query the file length in the first place...
        
       | jeffrallen wrote:
       | As a Go user and former contributor, it makes me pleased that the
       | rigor of the Go team occasionally gives the Linux kernel
       | developers heartburn. As long as everyone stays professional, the
       | end result is better for both groups.
       | 
       | Linux gets feature velocity by playing fast and loose sometimes
       | with stability. Demanding users like the Go authors are a
       | necessary and welcome counterbalance.
        
       | jandrese wrote:
       | It seems kind of strange that this is a kernel function at all.
       | It seems like something that should live in libc or the like. Is
       | there a performance benefit from having it up in kernel space? It
       | seems a bit outside of the scope of the kernel IMHO.
       | 
       | I can understand functions like sendfile() being able to cut down
       | on context switches and being helpful for bulk data transfer, but
       | is that the case here? How much benefit do you get from
       | copy_file_range() vs. a read/write loop?
       | 
       | I do note with some amusement how the kernel developer basically
       | went "Why are you using copy_ _file_ _range on things that aren't
       | actually files?"
        
         | 7786655 wrote:
         | >The copy_file_range() system call looks like a relatively
         | straightforward feature; it allows user space to ask the kernel
         | to copy a range of data from one file to another, _hopefully
         | applying some optimizations along the way._
        
           | jandrese wrote:
           | That's exactly what I was asking. The Go developers are
           | knocking themselves out chasing this syscall in the hopes
           | that it might improve performance? Has it been benchmarked?
        
         | webstrand wrote:
         | Given all of the effort being put into zero-copy read/write in
         | the kernel, I would assume there are significant performance
         | gains available.
         | 
         | I suspect that some, correctly aligned, ranges could be copied
         | with CoW semantics, thereby skipping the read/write altogether.
        
           | bonzini wrote:
           | On NFS you can avoid network traffic completely.
        
             | jandrese wrote:
             | Is this theoretical or is there support for it already in
             | the kernel and NFS daemons?
             | 
             | Reading the manpage for copy_file_range the notes section
             | states:                      copy_file_range() gives
             | filesystems an opportunity to  implement  "copy
             | acceleration"  techniques,  such  as  the use of reflinks
             | (i.e., two or            more inodes that share pointers to
             | the same copy-on-write disk  blocks)            or server-
             | side-copy (in the case of NFS).
             | 
             | But doesn't mention if these techniques are actually used.
             | I guess there's some help in future proofing your code.
             | 
             | Edit: I tested this on my Ubuntu 20.04 machine and a 1GB
             | file full of random data sitting in the file cache. Using
             | copy_file_range I could make a local-local copy in 0.595
             | seconds on average. Using a primitive copy/write loop took
             | 0.600 seconds. But these values are somewhat noisy and the
             | difference is down in the error margin. It doesn't appear
             | that my 5.4.0 kernel on ext4 is employing the reflinks
             | optimization.
        
               | dallbee wrote:
               | Can't share details, I've seen the copy_file_range
               | optimization represent roughly 25% increased throughput
               | for a prior employer. It's more than theoretical.
        
               | bonzini wrote:
               | It's totally practical and already implemented by Linux
               | on both the server and the client side.
        
               | bonzini wrote:
               | ext4 does not support sharing blocks across multiple
               | files, indeed. Try btrfs or perhaps xfs.
        
               | the8472 wrote:
               | ext4 has no reflink support. btrfs does by default, xfs
               | under some configurations. server-side offload is
               | supported by nfs and cifs but also needs server-side
               | support.
        
               | gravypod wrote:
               | Even if it wasn't smart you'd still get a huge benifit
               | from not context switching to/from kernel/user space for
               | a standard copy implementation. Far fewer syscalls.
               | There's a real performance benifit to be had there.
        
       ___________________________________________________________________
       (page generated 2021-02-19 23:01 UTC)