Subj : Re: Move bookworm system from SSD to NVME To : bnl@nowhere.com From : Richard Kettlewell Date : Sat Aug 03 2024 19:51:15 Björn Lundin writes: > On 2024-08-03 03:27, Lawrence D'Oliveiro wrote: >> On Fri, 2 Aug 2024 15:12:57 +0100, The Natural Philosopher wrote: >> >>> SSDS/NVM have their own internal caching. >> True for all drives, unfortunately. >> It’s bloody stupid, because the drive caching is on the wrong side >> of the >> drive interface. Better to leave it to the OS, which can use main RAM for >> its filesystem caching, on the fast side of that drive interface. >> When a drive says to the OS driver “write is done”, it should mean >> “write >> has gone to actual persistent storage”, not “write is in my cache”. > > I think they cause grief in the postgres mail lists some 15-20 years ago. > They were called 'lying IDE-disks' and were not popular in that crowd. ‘Lying disks’ are those that either disregard flush operations, or lie about whether they have a write-back cache at all. That is certainly a stupid outcome, though as a response to operating systems or applications that are over-eager to flush, you can see why there’d be pressure from marketing to do it, and acceptance from market segments that either don’t value data integrity or alternatively assume that the storage is unreliable and address the issue some other way. I think what Lawrence is complaining about is the fact that the default behavior, even for a non-lying disk, is when a SATA device returns a response to the host, this may indicate only that data has been transferred to an internal write-back cache rather than the underlying medium. But that’s just the normal engineering response to high physical IO latency. Recall that both traditional hard disks and SSDs do not have a 1:1 mapping between the logical block read/writes requested by the host. In a hard disk it takes time to reach the correct track, and the order of writes from the host may not match the track order. In an SSD multiple logical blocks are grouped into pages and pages must be written in a single operation. The same logic turns up elsewhere. The write() syscall completing normally only indicates that data has been transferred to the operating system’s RAM cache, not to your SSD or hard disk (and certainly not to a remote disk, when using a network filesystem). A memory write instruction on a modern CPU only transfers a value to an internal write buffer; the data may only reach the external DRAM hundreds of cycles later, or not at all if the same location is written again soon. The alternative is absurdly slow write IO. Usually there are some combination of flush operations, synchronous IO modes, barrier operations, etc, to allow data integrity requirements to be met without sacrificing performance globally (and this is true both at the Linux syscall layer and in the SATA protocol ... provided of course your disk does not lie). -- https://www.greenend.org.uk/rjk/ --- SoupGate-Win32 v1.05 * Origin: Agency HUB, Dunedin - New Zealand | Fido<>Usenet Gateway (3:770/3) .