[HN Gopher] Disk write buffering and its interactions with write...
       ___________________________________________________________________
        
       Disk write buffering and its interactions with write flushes
        
       Author : ingve
       Score  : 36 points
       Date   : 2024-03-18 07:04 UTC (3 days ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | mgerdts wrote:
       | > Rather than allowing multiple gigabytes of outstanding buffered
       | writes and deferring writeback until a gigabyte or more has
       | accumulated, you'd set things to trigger writebacks almost
       | immediately and then force processes doing write IO to wait for
       | disk writes to complete once you have more than a relatively
       | small volume of outstanding writes.
       | 
       | This is especially true when the thing doing a bunch of buffered
       | writes is in a VM. If the VMM is buffering writes to the host fs,
       | you get the described effects in the host OS and the guest OS.
        
       | M95D wrote:
       | I noticed this problem when I switched to Linux.
       | 
       | On Windows, when I copied a file, disk writes started
       | immediately. On older systems, like Win98, I had to tweak Total
       | Commander's disk buffer to improve copy speed on the same drive.
       | Total Commander even had separate settings for same disk vs.
       | different disk copy buffer sizes.
       | 
       | When I switched to Linux I was immediately surprised that disk
       | writes did not start until the memory was full, and then it would
       | stop reading while flushing dirty data. This happens even if the
       | copy is between different drives: reads stop, writes only, then
       | reads again with no writes to the other disk, repeat. It
       | basically halves the copy speed.
       | 
       | It even happens when I copy to network mounts: reads 20 GB of
       | data into memory, then reading stops and tries to flush the data
       | over the nfs. Nfs times out, transfer fails. I had to use nfs
       | timouts of 1h just to be able to do a backup.
       | 
       | It drives me crazy. Is there any way to make it write
       | immediately, or at least to put a memory limit on dirty data?
        
         | nolist_policy wrote:
         | https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-byt...
         | 
         | https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-bac...
         | 
         | The values are crazy high by default (on modern hardware
         | anyway): 10% of memory for dirty_background_bytes and 20% for
         | dirty_bytes. I wonder why no distro touches these.
        
           | suprjami wrote:
           | Because people complain their system is "slow" if it blocks
           | on disk I/O.
           | 
           | Another set of people also complain Linux takes too long to
           | safely unplug USB drives.
        
             | Rygian wrote:
             | When the choice is between "slow" file transfers and being
             | unable to do file transfers because of NFS timeouts, the
             | choice should be obvious.
             | 
             | I've lost data as a side effect of a simple file transfer
             | timing out.
        
               | suprjami wrote:
               | Just the small amount of people using NFS suggests this
               | tunable should remain the default. Nothing is stopping
               | sysadmins tuning for their environment.
               | 
               | There's no one-size-fits-all answer, which is why it's a
               | tunable.
        
               | tremon wrote:
               | This isn't just about NFS timeouts. Try playing a movie
               | from a rotational disk while simultaneously doing high-
               | volume writes. You _will_ get frequent pauses in your
               | video because the write buffer size is so large that a
               | single writeback will cause the video buffer to drain
               | empty.
               | 
               | On my desktop with 32GB ram, I can even get audio to skip
               | when ripping DVD's to disk. That's because practically
               | the entire movie fits into ram before Linux decides to
               | start the writeback process, and that writeback process
               | will hog the disk for almost a minute. Or it used to,
               | until I reduced the buffer size by a full order of
               | magnitude.
               | 
               | This is just another sad example of buffer bloat: the
               | inability to tune data buffers to the capacity of the
               | underlying stream.
        
       | pradn wrote:
       | fsync() guarantees that writes have hit the disk. But is there a
       | guarantee about what's written before an fsync()? Can it be
       | anywhere between "nothing" and "everything"? I suppose this must
       | be a loose guarantee if the "write-back" parameter can be tweaked
       | at will.
        
         | nolist_policy wrote:
         | > Can it be anywhere between "nothing" and "everything"?
         | 
         | Yes, and that won't change because the hardware with it's own
         | buffers behaves the same way.
        
         | toast0 wrote:
         | > Can it be anywhere between "nothing" and "everything"?
         | 
         | Yes, and there's not (generally) any ordering constraint,
         | either. The last thing you wrote may be persisted, and not the
         | first, etc.
        
         | loeg wrote:
         | There's some user input to this via posix_fadvise
         | POSIX_FADV_DONTNEED but it doesn't guarantee anything.
        
       | Rygian wrote:
       | My systems routinely report transferring files at speeds much
       | faster than what the physical medium asked, only to have them get
       | stuck and fail on a time out a while after.
       | 
       | Having such behavior be the default is, to my limited
       | understanding, a bug in the Linux kernel.
        
         | adr1an wrote:
         | There's "$ sync" to ask the Kernel to start the actual write.
         | And "dd" command has the option to do so, too. It's just not
         | the default, as many things on Linux, unfortunately.
        
       | toast0 wrote:
       | > Rather than allowing multiple gigabytes of outstanding buffered
       | writes and deferring writeback until a gigabyte or more has
       | accumulated, you'd set things to trigger writebacks almost
       | immediately and then force processes doing write IO to wait for
       | disk writes to complete once you have more than a relatively
       | small volume of outstanding writes.
       | 
       | I think having the trigger be size based rather than timebased is
       | the real problem. Or bounds on both...
       | 
       | I probably don't want to buffer writes for more than X seconds,
       | or let the buffer grow beyond Y% of ram. At least for the time
       | based limit, you'd really want to be able to say start writing to
       | disk when the buffer has data over 10 seconds old, but still
       | accept writes into a new buffer, only blocking writes when
       | there's a buffer being written out _and_ the current buffer is
       | too old or too big.
        
         | loeg wrote:
         | Ideally the system starts flushing buffered writes basically
         | immediately, but with low device-level queue depth so that
         | subsequently issued higher priority IOs do not suffer from a
         | ton of additional latency.
        
           | kvemkon wrote:
           | Ideally is only when the size of the complete file is known
           | in advance. To minimize avoidable fragmentation.
        
         | magicalhippo wrote:
         | ZFS has both a time-based and size-based limit. IIRC the
         | default is 15 seconds and lower of 2GB and some % of system
         | memory.
         | 
         | Though in ZFS' case it's not really a regular write cache as
         | such, as it's used to minimize updates to its on-disk copy-on-
         | write structure.
        
       ___________________________________________________________________
       (page generated 2024-03-21 23:01 UTC)