[HN Gopher] Disk write buffering and its interactions with write...
___________________________________________________________________
Disk write buffering and its interactions with write flushes
Author : ingve
Score : 36 points
Date : 2024-03-18 07:04 UTC (3 days ago)
(HTM) web link (utcc.utoronto.ca)
(TXT) w3m dump (utcc.utoronto.ca)
| mgerdts wrote:
| > Rather than allowing multiple gigabytes of outstanding buffered
| writes and deferring writeback until a gigabyte or more has
| accumulated, you'd set things to trigger writebacks almost
| immediately and then force processes doing write IO to wait for
| disk writes to complete once you have more than a relatively
| small volume of outstanding writes.
|
| This is especially true when the thing doing a bunch of buffered
| writes is in a VM. If the VMM is buffering writes to the host fs,
| you get the described effects in the host OS and the guest OS.
| M95D wrote:
| I noticed this problem when I switched to Linux.
|
| On Windows, when I copied a file, disk writes started
| immediately. On older systems, like Win98, I had to tweak Total
| Commander's disk buffer to improve copy speed on the same drive.
| Total Commander even had separate settings for same disk vs.
| different disk copy buffer sizes.
|
| When I switched to Linux I was immediately surprised that disk
| writes did not start until the memory was full, and then it would
| stop reading while flushing dirty data. This happens even if the
| copy is between different drives: reads stop, writes only, then
| reads again with no writes to the other disk, repeat. It
| basically halves the copy speed.
|
| It even happens when I copy to network mounts: reads 20 GB of
| data into memory, then reading stops and tries to flush the data
| over the nfs. Nfs times out, transfer fails. I had to use nfs
| timouts of 1h just to be able to do a backup.
|
| It drives me crazy. Is there any way to make it write
| immediately, or at least to put a memory limit on dirty data?
| nolist_policy wrote:
| https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-byt...
|
| https://docs.kernel.org/admin-guide/sysctl/vm.html#dirty-bac...
|
| The values are crazy high by default (on modern hardware
| anyway): 10% of memory for dirty_background_bytes and 20% for
| dirty_bytes. I wonder why no distro touches these.
| suprjami wrote:
| Because people complain their system is "slow" if it blocks
| on disk I/O.
|
| Another set of people also complain Linux takes too long to
| safely unplug USB drives.
| Rygian wrote:
| When the choice is between "slow" file transfers and being
| unable to do file transfers because of NFS timeouts, the
| choice should be obvious.
|
| I've lost data as a side effect of a simple file transfer
| timing out.
| suprjami wrote:
| Just the small amount of people using NFS suggests this
| tunable should remain the default. Nothing is stopping
| sysadmins tuning for their environment.
|
| There's no one-size-fits-all answer, which is why it's a
| tunable.
| tremon wrote:
| This isn't just about NFS timeouts. Try playing a movie
| from a rotational disk while simultaneously doing high-
| volume writes. You _will_ get frequent pauses in your
| video because the write buffer size is so large that a
| single writeback will cause the video buffer to drain
| empty.
|
| On my desktop with 32GB ram, I can even get audio to skip
| when ripping DVD's to disk. That's because practically
| the entire movie fits into ram before Linux decides to
| start the writeback process, and that writeback process
| will hog the disk for almost a minute. Or it used to,
| until I reduced the buffer size by a full order of
| magnitude.
|
| This is just another sad example of buffer bloat: the
| inability to tune data buffers to the capacity of the
| underlying stream.
| pradn wrote:
| fsync() guarantees that writes have hit the disk. But is there a
| guarantee about what's written before an fsync()? Can it be
| anywhere between "nothing" and "everything"? I suppose this must
| be a loose guarantee if the "write-back" parameter can be tweaked
| at will.
| nolist_policy wrote:
| > Can it be anywhere between "nothing" and "everything"?
|
| Yes, and that won't change because the hardware with it's own
| buffers behaves the same way.
| toast0 wrote:
| > Can it be anywhere between "nothing" and "everything"?
|
| Yes, and there's not (generally) any ordering constraint,
| either. The last thing you wrote may be persisted, and not the
| first, etc.
| loeg wrote:
| There's some user input to this via posix_fadvise
| POSIX_FADV_DONTNEED but it doesn't guarantee anything.
| Rygian wrote:
| My systems routinely report transferring files at speeds much
| faster than what the physical medium asked, only to have them get
| stuck and fail on a time out a while after.
|
| Having such behavior be the default is, to my limited
| understanding, a bug in the Linux kernel.
| adr1an wrote:
| There's "$ sync" to ask the Kernel to start the actual write.
| And "dd" command has the option to do so, too. It's just not
| the default, as many things on Linux, unfortunately.
| toast0 wrote:
| > Rather than allowing multiple gigabytes of outstanding buffered
| writes and deferring writeback until a gigabyte or more has
| accumulated, you'd set things to trigger writebacks almost
| immediately and then force processes doing write IO to wait for
| disk writes to complete once you have more than a relatively
| small volume of outstanding writes.
|
| I think having the trigger be size based rather than timebased is
| the real problem. Or bounds on both...
|
| I probably don't want to buffer writes for more than X seconds,
| or let the buffer grow beyond Y% of ram. At least for the time
| based limit, you'd really want to be able to say start writing to
| disk when the buffer has data over 10 seconds old, but still
| accept writes into a new buffer, only blocking writes when
| there's a buffer being written out _and_ the current buffer is
| too old or too big.
| loeg wrote:
| Ideally the system starts flushing buffered writes basically
| immediately, but with low device-level queue depth so that
| subsequently issued higher priority IOs do not suffer from a
| ton of additional latency.
| kvemkon wrote:
| Ideally is only when the size of the complete file is known
| in advance. To minimize avoidable fragmentation.
| magicalhippo wrote:
| ZFS has both a time-based and size-based limit. IIRC the
| default is 15 seconds and lower of 2GB and some % of system
| memory.
|
| Though in ZFS' case it's not really a regular write cache as
| such, as it's used to minimize updates to its on-disk copy-on-
| write structure.
___________________________________________________________________
(page generated 2024-03-21 23:01 UTC)