Subj : Re: Move bookworm system from SSD to NVME
To   : bnl@nowhere.com
From : Richard Kettlewell
Date : Sat Aug 03 2024 19:51:15

Björn Lundin <bnl@nowhere.com> writes:
> On 2024-08-03 03:27, Lawrence D'Oliveiro wrote:
>> On Fri, 2 Aug 2024 15:12:57 +0100, The Natural Philosopher wrote:
>>
>>> SSDS/NVM have their own internal  caching.
>> True for all drives, unfortunately.
>> It’s bloody stupid, because the drive caching is on the wrong side
>> of the
>> drive interface. Better to leave it to the OS, which can use main RAM for
>> its filesystem caching, on the fast side of that drive interface.
>> When a drive says to the OS driver “write is done”, it should mean
>> “write
>> has gone to actual persistent storage”, not “write is in my cache”.
>
> I think they cause grief in the postgres mail lists some 15-20 years ago.
> They were called 'lying IDE-disks' and were not popular in that crowd.

‘Lying disks’ are those that either disregard flush operations, or lie
about whether they have a write-back cache at all. That is certainly a
stupid outcome, though as a response to operating systems or
applications that are over-eager to flush, you can see why there’d be
pressure from marketing to do it, and acceptance from market segments
that either don’t value data integrity or alternatively assume that the
storage is unreliable and address the issue some other way.

I think what Lawrence is complaining about is the fact that the default
behavior, even for a non-lying disk, is when a SATA device returns a
response to the host, this may indicate only that data has been
transferred to an internal write-back cache rather than the underlying
medium.

But that’s just the normal engineering response to high physical IO
latency.

Recall that both traditional hard disks and SSDs do not have a 1:1
mapping between the logical block read/writes requested by the host. In
a hard disk it takes time to reach the correct track, and the order of
writes from the host may not match the track order. In an SSD multiple
logical blocks are grouped into pages and pages must be written in a
single operation.

The same logic turns up elsewhere. The write() syscall completing
normally only indicates that data has been transferred to the operating
system’s RAM cache, not to your SSD or hard disk (and certainly not to a
remote disk, when using a network filesystem). A memory write
instruction on a modern CPU only transfers a value to an internal write
buffer; the data may only reach the external DRAM hundreds of cycles
later, or not at all if the same location is written again soon.

The alternative is absurdly slow write IO. Usually there are some
combination of flush operations, synchronous IO modes, barrier
operations, etc, to allow data integrity requirements to be met without
sacrificing performance globally (and this is true both at the Linux
syscall layer and in the SATA protocol ... provided of course your disk
does not lie).

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
 * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3)

.