[HN Gopher] What every programmer should know about SSDs
       ___________________________________________________________________
        
       What every programmer should know about SSDs
        
       Author : sprachspiel
       Score  : 191 points
       Date   : 2021-06-20 17:39 UTC (5 hours ago)
        
 (HTM) web link (databasearchitects.blogspot.com)
 (TXT) w3m dump (databasearchitects.blogspot.com)
        
       | jedberg wrote:
       | This page tells me a lot about SSDs, but it doesn't tell me why I
       | need to know these things. It doesn't really give me any
       | indication about how I should change my behavior if I know that
       | I'll be running on SSD vs spinning disk.
       | 
       | I've always been told, "just treat SSDs like slow, permanent
       | memory".
        
         | danbst wrote:
         | yeah, article should talk about periodic TRIMming, though this
         | is more an admin advice
        
           | smt88 wrote:
           | Don't modern OSes transparently TRIM periodically anyway?
        
         | [deleted]
        
       | CoolGuySteve wrote:
       | The claim about parallelism isn't true. Most benchmarks and my
       | own experience show that sequential reads are still significantly
       | faster than random reads on most NVME drives.
       | 
       | However, random read performance is only somewhere between a 3rd
       | to half as fast as sequential compared to a magnetic disk where
       | it's often 1/10th as fast.
        
         | pkaye wrote:
         | What kind of queue depth do you test the read performance? The
         | sequential can be made fast at low queue depth by the SSD
         | controller doing prefetch reads internally. I've worked on such
         | algorithms myself.
        
       | kortilla wrote:
       | The title should be "why SSDs mean programmers no longer have to
       | think about hard drives".
       | 
       | These are all reasons SSDs are much more pleasant to work with
       | than old platter disks.
        
         | cbsmith wrote:
         | Well, they no longer need to think about hard _disks_ , but
         | there are a lot assumptions from the world of hard disks that
         | play out very differently in the SSD world.
        
       | teddyh wrote:
       | What _everyone_ should know is that flash drives can lose their
       | data when left unpowered for as little as three months.
        
         | mercora wrote:
         | if that is true disks should come with a very visible note
         | stating this... seriously, 3 months would be nothing. i doubt
         | it is true because 3 months is a time frame which should be
         | surpassed quite often making this more known.
        
           | anticensor wrote:
           | Yep, they are _semivolatile limited write memory module_ s,
           | not disks. Everyone should use that SV-LWMM acronym.
        
           | AtlasBarfed wrote:
           | Is this an actual useful application if optane, replacing the
           | memory with near-ram nonvolatile ?
        
           | teddyh wrote:
           | Depending on manufacturer, and storage conditions, it can be
           | up to about ten years. But the "three months" number is real:
           | https://web.archive.org/web/20210502042514/http://www.dell.c.
           | ..
        
             | crazygringo wrote:
             | That's a document from _nine and a half years ago_ , and it
             | states:
             | 
             | > _It depends on the how much the flash has been used (P /E
             | cycle used), type of flash, and storage temperature. In MLC
             | and SLC, this can be as low as 3 months and best case can
             | be more than 10 years. The retention is highly dependent on
             | temperature and workload._
             | 
             | Are there any _modern_ sources provide _more accurate_
             | stats?  "3 months to 10 years" is so vague as to be
             | useless.
        
         | crazygringo wrote:
         | Do you have a current source for that?
         | 
         | I've turned on plenty of cell phones that hadn't been charged
         | or powered on for a couple of years and everything worked
         | normally. Same with thumb drives I've picked up after years.
         | 
         | I mean, _anything_ can fail after three months. Your statement
         | doesn 't really add anything without stating the failure
         | _rates_. For all I know the failure rate could be _less_ than
         | that of physical hard drives.
        
       | dan-robertson wrote:
       | See this paper from 2017, _The unwritten contract of solid state
       | drives_ : https://dl.acm.org/doi/10.1145/3064176.3064187
        
       | dang wrote:
       | What someone else said about that in 2014:
       | 
       |  _What every programmer should know about solid-state drives_ -
       | https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31
       | comments)
        
       | FpUser wrote:
       | It is really puzzling why "every programmer" should burden their
       | already overloaded brains with this. If they're reading/writing
       | some config/data files this knowledge would not help one bit. If
       | they're using database then it falls to the database vendor's to
       | optimize for this scenario.
       | 
       | So I think that unless this "every programmer" is a database
       | storage engine developer (not too many of them I guess) their
       | only concern would be mostly - how close my SSD to that magical
       | point where it has to be cloned and replaced before shit hits the
       | fan.
        
       | rossdavidh wrote:
       | Interesting, and fun to read and think about! And, as a
       | professional programmer for 17 years now, not once have I done
       | anything where this would have been important for me to know
       | (even if I had been running my code on a system with SSD's). So,
       | I'm not convinced the title is at all accurate.
       | 
       | But, fun to read and think about.
        
       | rabuse wrote:
       | A little off topic, but I bought a new Macbook Pro with the M1
       | chip with 8GB of RAM, and I'm worried about the swap usage of
       | this machine wearing out the SSD too quickly. Is this an actual
       | concern, as my swap has been in the multiple GB range with my
       | use?
        
         | cbsmith wrote:
         | It's an actual concern for you. For Apple it's a variant on
         | planned obsolescence. ;-)
         | 
         | Note though that memory use metrics on MacOS can been a
         | misleading. Make sure that you're seeing what's actually there.
        
         | Grazester wrote:
         | Why did you get the 8 gig version? If you are using all this
         | swap then your purchased the wrong MacBook.
        
           | rabuse wrote:
           | Honestly, don't run much, so didn't think it would be that
           | bad stepping down from my 16GB machine.
        
         | 1-6 wrote:
         | From what I've been able to gather, the excessive paging may
         | actually have to do with non-native apps running on the M1.
         | Avoid those.
        
           | rabuse wrote:
           | Most of my programs are JetBrains IDE's and browsers. Don't
           | know if they're optimized for M1.
        
       | [deleted]
        
       | personjerry wrote:
       | How big is the write cache usually and how does it work?
       | Typically I've seen the write caches be something like 32MB in
       | size, but the "top speed" seems to be sustained for files much
       | bigger than 32MB, which doesn't make sense to me if that top
       | speed is supposedly from writing to the cache. How does that
       | work?
        
         | bserge wrote:
         | On SSDs? 32 is way off, the Samsung 470 had 256MB RAM cache and
         | the 860 Pro a whopping 4GB for the top model.
         | 
         | Although they started removing it entirely for NVMe SSDs, I
         | guess the direct transfer speed is enough to not need a cache
         | at all.
        
           | mastax wrote:
           | NVMe drives can access system memory over the PCIe bus.
        
           | wtallis wrote:
           | The DRAM you're referring to is for the most part not a write
           | cache for user data. Most of that DRAM is a read cache for
           | the FTL's logical to physical address mapping table. When the
           | FTL is working with the typical granularity of 4kB, you get a
           | requirement of approximately 1GB of DRAM per 1TB of NAND.
           | 
           | Drives that include less than this amount of DRAM show
           | reduced performance, usually in the form of lower random read
           | performance because the physical address of the requested
           | data cannot be quickly found by consulting a table in DRAM
           | and must be located by first performing at least one slow
           | NAND read.
        
         | opencl wrote:
         | It varies quite a bit. There are two different types of caches:
         | SLC and DRAM. Most drives use SLC caching, higher end drives
         | often use both.
         | 
         | Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of
         | flash.
         | 
         | SLC caching is using a portion of the flash in SLC mode, where
         | it stores 1 bit per cell rather than the typical 2-4 (2 for
         | MLC, 3 for TLC, 4 for QLC) in exchange for higher performance.
         | SLC cache size varies wildly. Some SSDs allocate a fixed size
         | cache, some allocate it dynamically based on how much free
         | space is available. It can potentially be 10s of GBs on larger
         | SSDs.
        
         | wtallis wrote:
         | Getting full throughput from the SSD is less about file size
         | and more about how much work is in the SSD's queue at any given
         | moment. If the host system only issues commands one at a time
         | (as would often result from using synchronous IO APIs), then
         | the SSD will experience some idle time between finishing one
         | command and receiving the next from the host system. If the
         | host ensures there are 2+ commands in the SSD's queue, it won't
         | have that idle time.
         | 
         | Then there's the matter of how much data is in the queue,
         | rather than how many commands are queued. Imagine a 4 TB SSD
         | using 512Gbit TLC dies, and an 8-channel controller. That's 64
         | dies with 2 or 4 planes per die. A single page is 16kB for
         | current NAND, so we need 2 or 4 MB of data to write if we want
         | to light up the whole drive at once, and that much again
         | waiting in the queue to ensure the drive can begin the next
         | write as soon as the first batch completes. But you can often
         | hit a bottleneck elsewhere (either the PCIe link, or the
         | channels between the controller and NAND) before you have every
         | plane of every die 100% busy.
         | 
         | If you're working with small files, then your filesystem will
         | be producing several small IOs for each chunk of file contents
         | you read or write from the application layer, and many of those
         | small metadata/fs IOs will be in the critical path, blocking
         | your data IOs. So even though you can absolutely hit speeds in
         | excess of 3 GB/s by issuing 2MB write commands one at a time to
         | a suitably high-end SSD, you may have more difficulty hitting 3
         | GB/s by writing 2MB _files_ one at a time.
        
       | rectang wrote:
       | I wince at the amount of wear the `git clean -dxf; npm ci` cycle
       | must be putting on my SSD.
        
       | 1_player wrote:
       | A lot of talk about pages, but no mention about how big these
       | pages are. From a quick look on Google, most SSDs have 4kB pages,
       | with some reaching 8kB or even 16kB.
        
         | wtallis wrote:
         | SSDs mostly tell the host system that they have 512-byte
         | sectors or sometimes 4kB sectors, and the typical flash
         | translation layer works in 4kB sectors because that's a good
         | fit for the kind of workloads coming from a host system that
         | usually prefers to do things (eg. virtual memory) in 4kB
         | chunks. But the underlying NAND flash page size has been 16kB
         | for years.
        
           | cbsmith wrote:
           | ...and all that cruft, and the logic to try to make handling
           | of it not so bad, makes for a lot of complexity and
           | unintended consequences.
        
       | wly_cdgr wrote:
       | There's nothing whatsoever I should need to know about SSDs as a
       | Javascript programmer and if there is then the programmers on the
       | lower levels haven't done their jobs right and are wasting my
       | time
        
       | dataflow wrote:
       | What's the flash translation layer made of? Is the flash
       | technology used for that more durable than the rest of the SSD
       | itself? (like say MLC vs. QLC?)
        
         | pkaye wrote:
         | The FTL is like a virtual memory manager. It is
         | firmware/hardware to manage things like the logical to physical
         | mapping table, garbage collection, error correction, bad block
         | management. Yes there will be a lot of FTL data structures
         | stored on the flash. It can be made durable by redundant
         | copies, writing in SLC mode or having recovery algorithms. I
         | used to develop SSD firmware in the past if you have further
         | questions.
        
           | jng wrote:
           | Hey that's very interesting! How much of the FTL logic is
           | done with regular MCU code vs custom hardware? Is there any
           | open source SSD firmware out there that one could look at to
           | start experimenting in this field, or at least something
           | pointing in that direction, be it open or affordable
           | software, firmware, FPGA gateway or even IC IP? I believe
           | there is value in integrating that part of the stack with the
           | higher level software, but it seems quite difficult to
           | experiment unless one is in the right circles / close to the
           | right companies. Thanks!
        
         | SeanCline wrote:
         | You're right that the FTL has some durability concerns which,
         | in addition to performance, is why it's typically cached in
         | DRAM. Older DRAM-less SSDs were unreliable in the long-term but
         | that's been improving with the adoption of HMB, which lets the
         | SSD controller carve out some system RAM to store FTL data.
        
       | effnorwood wrote:
       | they do not spin
        
       | bob1029 wrote:
       | Things I have learned about SSDs:
       | 
       | If you want to go fast & save NAND lifetime, use append-only log
       | structures.
       | 
       | If you want to go even faster & save even more NAND lifetime,
       | batch your writes in software (i.e. some ring buffer with natural
       | back-pressure mechanism) and then serialize them with a single
       | writer into an append-only log structure. Many newer devices have
       | something like this at the hardware level, but your block size is
       | still a constraint when working in hardware. If you batch in
       | software, you can hypothetically write multiple logical business
       | transactions _per_ block I /O. When you physical block size is 4k
       | and your logical transactions are averaging 512b of data, you
       | would be leaving a lot of throughput on the table.
       | 
       | Going down 1 level of abstraction seems important if you want to
       | extract the most performance from an SSD. Unsurprisingly, the
       | above ideas also make ordinary magnetic disk drives more
       | performant & potentially last longer.
        
         | remram wrote:
         | Shouldn't the OS or libc take care of that? If I write and
         | don't immediately flush()?
        
         | pclmulqdq wrote:
         | I used to think the same thing, but now that I work on SSD-
         | based storage systems, I'm not sure this holds up in today's
         | storage stacks. Log structuring really helped with HDDs since
         | it meant fewer seeks.
         | 
         | In particular, the filesystem tends to undo a lot of the
         | benefits you get from log-structuring unless you are using a
         | filesystem designed to keep your files log-structured. Using
         | huge writes definitely still helps, though.
         | 
         | A paper that I really like goes deeper into this:
         | http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
         | 
         | Edit: I had originally said "designed for flash" instead of
         | "designed to keep your files log-structured." F2FS is designed
         | for flash, but in my testing does relatively poorly with log-
         | structured files because of how it works internally.
         | 
         | Edit 2: de-googled the link. Thank you for pointing that out.
        
           | 10000truths wrote:
           | Achieving cutting-edge storage performance tends to require
           | bypassing the filesystem anyways. Traditionally, that meant
           | using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and
           | operating on it with io_uring will get you most of the way
           | there.
           | 
           | In either case, the advice given in the article and by the OP
           | is filesystem agnostic.
        
             | nyanpasu64 wrote:
             | Will an end user downloading a video editing app (or
             | similar) have a NVME drive, know how to give your app
             | direct access to a NVME drive, and will your app not
             | corrupt the rest of the files on the drive?
        
               | 10000truths wrote:
               | Extreme performance requires extreme tradeoffs. As with
               | anything else, you have to evaluate your use cases and
               | determine for yourself whether the tradeoffs are worth
               | it. For a mass-market application that has to play nice
               | with other applications and work with a wide variety of
               | commodity hardware, it's probably not worthwhile. For a
               | state-of-the-art high performance data store that expects
               | low latencies and high throughput (a la ScyllaDB), it may
               | very well be.
        
             | quotemstr wrote:
             | Why would you want to bypass the filesystem by talking to
             | the block device directly? Doesn't O_DIRECT on a
             | preallocated regular file accomplish the same thing with
             | less management complexity and special OS permissions?
             | Granted, the file extents might be fragmented a bit, but
             | that can be fixed.
        
               | 10000truths wrote:
               | A "regular file" might reside in multiple locations on
               | disk for redundancy, or might have a checksum that needs
               | to be maintained alongside it for integrity. Or, as you
               | say, its contents might not reside in contiguous sectors
               | - or you might be writing to a hole in a sparse file.
               | There's a lot of "magic" that could go on behind the
               | scenes when operating on "regular files", depending on
               | what filesystem you're using with what options. Directly
               | operating on the block device makes it easier to reason
               | about the performance guarantees, since your reads and
               | writes map more cleanly to the underlying SCSI/ATA/NVME
               | commands issued.
        
               | lazide wrote:
               | If you understand your workload and the hardware well
               | enough to understand how doing direct I/O on a file will
               | help - then you're going to generally do better against a
               | direct block device because there are fewer intermediate
               | layers doing the wrong optimizations or otherwise messing
               | you up. From a pure performance perspective anyway.
               | Extents are one part of the issue, flushes to disk (and
               | how/when they happen), caching, etc.
               | 
               | Doesn't mean it isn't easier to deal with as a file from
               | an administration perspective (and you can do snapshots,
               | or whatever!), but Lvm can do that too for a block
               | device, and many other things.
        
               | quotemstr wrote:
               | With O_DIRECT though you're opting out of the
               | filesystem's caching (well, VFS's), forced flushes, and
               | most FS level optimizations, so I'd expect it to perform
               | on par with direct partition access.
               | 
               | Do you have numbers showing an advantage of going
               | directly to the block device? Personally, I'd consider
               | the management advantages of a filesystem compelling
               | absent specific performance numbers showing the benefit
               | of direct partition access.
        
           | trulyme wrote:
           | Degoogled link:
           | http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
        
         | scns wrote:
         | Like this?
         | 
         | https://en.wikipedia.org/wiki/NILFS?wprov=sfla1
        
         | AtlasBarfed wrote:
         | This is basically the purpose of rocksdb, and to a lesser
         | extent Cassandra
        
       | andrewmcwatters wrote:
       | My opinion is probably... not technically correct... until you
       | have to deal with drive reliability and write guarantees, but I
       | don't think programmers actually have to know anything about SSDs
       | in the same way that developers had to know particular things
       | about HDDs.
       | 
       | This is out of pure speculation, but there had to be a period of
       | time during the mass transition to SSDs that engineers said, OK,
       | how do we get the hardware to be compatible with software that
       | is, for the most part, expecting that hard disk drives are being
       | used, and just behave like really fast HDDs.
       | 
       | So, there's almost certainly some non-zero amount of code out
       | there in the wild that is or was doing some very specific write
       | optimized routine that one day was just performing 10 to 100
       | times faster, and maybe just because of the nature of software is
       | still out there today doing that same routine.
       | 
       | I don't know what that would look like, but my guess would be
       | that it would have something to do with average sized write
       | caches, and those caches look entirely different today or
       | something.
       | 
       | And today, there's probably some SSD specific code doing
       | something out there now, too.
        
         | rzzzt wrote:
         | You can optimize for less/shorter drive seeks on rotational
         | media by reordering requests:
         | https://en.wikipedia.org/wiki/Elevator_algorithm
        
         | forrestthewoods wrote:
         | Games used to spend a lot time optimizing CD/DVD layout.
         | Because reading from that is REALLY slow. Optimize mostly meant
         | keep data contiguous. But sometimes it meant duplicate data to
         | avoid seeks.
         | 
         | The canonical case is minimize time to load a level. Keep that
         | level's assets contiguous. And maybe duplicate data that is
         | shared across levels. It's a trade off between disc space and
         | load time.
         | 
         | I'm not familiar with major tricks for improving after a disc
         | is installed to drive. (PS4 games always streamed data from
         | HDD, not disc.)
         | 
         | Even consoles use different HDD manufacturers. So it'd be
         | pretty difficult to safely optimize for that. I'm sure a few
         | games do. But it's rare enough I've never heard of it.
        
           | fart32 wrote:
           | While reading through the Quake 3 source code, I noticed that
           | whenever the FS functions were reading from a CD, they were
           | doing so in a loop, because the fread/fopen functions instead
           | of hanging and waiting for the CD to spin up sometimes just
           | returned an error. It wasn't just slow, it was also hilarious
           | at times.
        
           | patmorgan23 wrote:
           | Stream loading is another technique that's used to reduce
           | load time. You start loading data for the next level as the
           | player approaches a boundry and you let them enter the next
           | level before all of the assets(ussally textures) have
           | finished loading.
        
           | andrewmcwatters wrote:
           | This reminds me of Valve's GCF (grid cache file, officially,
           | or game cache file, commonly). The benefits must have purely
           | occurred on consoles for the reasons you outlined, because
           | cracked Valve games that had GCF files extracted ran faster
           | than the official retail releases on PCs!
        
           | alpaca128 wrote:
           | Consoles also do this with HDDs. That's been one of the
           | talking points around the PS5 from the beginning, with Sony
           | saying that games would get more storage space efficient
           | because they don't need redundancy for faster loading
           | anymore.
        
             | maccard wrote:
             | This is very very true. The PS5 does hardware
             | decompression, so games by default are now going to be
             | compressed. For a real world reference of how big a
             | difference that makes, see fortnite turning on compression
             | [0] (disclaimer: I worked for epic on fortnite at the time)
             | 
             | [0] https://www.ign.com/articles/fortnites-latest-patch-
             | makes-it...
        
               | Mikealcl wrote:
               | I had not loaded ign without blockers in years. That was
               | painful.
        
               | maccard wrote:
               | Hah, sorry. Ive always found their articles to have the
               | least fluff on the topic, despite the awful awful
               | website.
        
         | hugey010 wrote:
         | Right, the average programmer probably should, or already is,
         | depending on some existing abstraction to optimizes writes
         | based on storage medium.
        
       | klodolph wrote:
       | If you care about SSDs, one paper you _should_ read is "Don't
       | Stack Your Log on My Log" by Yang et al. 2014
       | 
       | https://www.usenix.org/system/files/conference/inflow14/infl...
       | 
       | > Log-structured applications and file systems have been used to
       | achieve high write throughput by sequentializing writes. Flash-
       | based storage systems, due to flash memory's out-of-place update
       | characteristic, have also relied on log-structured approaches.
       | Our work investigates the impacts to performance and endurance in
       | flash when multiple layers of log-structured applications and
       | file systems are layered on top of a log-structured flash device.
       | We show that multiple log layers affects sequentiality and
       | increases write pressure to flash devices through randomization
       | of workloads, unaligned segment sizes, and uncoordinated multi-
       | log garbage collection. All of these effects can combine to
       | negate the intended positive affects of using a log. In this
       | paper we characterize the interactions between multiple levels of
       | independent logs, identify issues that must be considered, and
       | describe design choices to mitigate negative behaviors in multi-
       | log configurations.
        
       | DrNuke wrote:
       | A number of high-level techniques help rationalize data
       | management and transfer, but the mileage of practical
       | implementations may vary a lot. Generally speaking, only a small
       | number of applications really need to take care and add a further
       | layer of abstraction, that because the best practices already
       | codified into any widespread language do an acceptable job
       | already.
        
       ___________________________________________________________________
       (page generated 2021-06-20 23:00 UTC)