[HN Gopher] What every programmer should know about SSDs
___________________________________________________________________
What every programmer should know about SSDs
Author : sprachspiel
Score : 191 points
Date : 2021-06-20 17:39 UTC (5 hours ago)
(HTM) web link (databasearchitects.blogspot.com)
(TXT) w3m dump (databasearchitects.blogspot.com)
| jedberg wrote:
| This page tells me a lot about SSDs, but it doesn't tell me why I
| need to know these things. It doesn't really give me any
| indication about how I should change my behavior if I know that
| I'll be running on SSD vs spinning disk.
|
| I've always been told, "just treat SSDs like slow, permanent
| memory".
| danbst wrote:
| yeah, article should talk about periodic TRIMming, though this
| is more an admin advice
| smt88 wrote:
| Don't modern OSes transparently TRIM periodically anyway?
| [deleted]
| CoolGuySteve wrote:
| The claim about parallelism isn't true. Most benchmarks and my
| own experience show that sequential reads are still significantly
| faster than random reads on most NVME drives.
|
| However, random read performance is only somewhere between a 3rd
| to half as fast as sequential compared to a magnetic disk where
| it's often 1/10th as fast.
| pkaye wrote:
| What kind of queue depth do you test the read performance? The
| sequential can be made fast at low queue depth by the SSD
| controller doing prefetch reads internally. I've worked on such
| algorithms myself.
| kortilla wrote:
| The title should be "why SSDs mean programmers no longer have to
| think about hard drives".
|
| These are all reasons SSDs are much more pleasant to work with
| than old platter disks.
| cbsmith wrote:
| Well, they no longer need to think about hard _disks_ , but
| there are a lot assumptions from the world of hard disks that
| play out very differently in the SSD world.
| teddyh wrote:
| What _everyone_ should know is that flash drives can lose their
| data when left unpowered for as little as three months.
| mercora wrote:
| if that is true disks should come with a very visible note
| stating this... seriously, 3 months would be nothing. i doubt
| it is true because 3 months is a time frame which should be
| surpassed quite often making this more known.
| anticensor wrote:
| Yep, they are _semivolatile limited write memory module_ s,
| not disks. Everyone should use that SV-LWMM acronym.
| AtlasBarfed wrote:
| Is this an actual useful application if optane, replacing the
| memory with near-ram nonvolatile ?
| teddyh wrote:
| Depending on manufacturer, and storage conditions, it can be
| up to about ten years. But the "three months" number is real:
| https://web.archive.org/web/20210502042514/http://www.dell.c.
| ..
| crazygringo wrote:
| That's a document from _nine and a half years ago_ , and it
| states:
|
| > _It depends on the how much the flash has been used (P /E
| cycle used), type of flash, and storage temperature. In MLC
| and SLC, this can be as low as 3 months and best case can
| be more than 10 years. The retention is highly dependent on
| temperature and workload._
|
| Are there any _modern_ sources provide _more accurate_
| stats? "3 months to 10 years" is so vague as to be
| useless.
| crazygringo wrote:
| Do you have a current source for that?
|
| I've turned on plenty of cell phones that hadn't been charged
| or powered on for a couple of years and everything worked
| normally. Same with thumb drives I've picked up after years.
|
| I mean, _anything_ can fail after three months. Your statement
| doesn 't really add anything without stating the failure
| _rates_. For all I know the failure rate could be _less_ than
| that of physical hard drives.
| dan-robertson wrote:
| See this paper from 2017, _The unwritten contract of solid state
| drives_ : https://dl.acm.org/doi/10.1145/3064176.3064187
| dang wrote:
| What someone else said about that in 2014:
|
| _What every programmer should know about solid-state drives_ -
| https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31
| comments)
| FpUser wrote:
| It is really puzzling why "every programmer" should burden their
| already overloaded brains with this. If they're reading/writing
| some config/data files this knowledge would not help one bit. If
| they're using database then it falls to the database vendor's to
| optimize for this scenario.
|
| So I think that unless this "every programmer" is a database
| storage engine developer (not too many of them I guess) their
| only concern would be mostly - how close my SSD to that magical
| point where it has to be cloned and replaced before shit hits the
| fan.
| rossdavidh wrote:
| Interesting, and fun to read and think about! And, as a
| professional programmer for 17 years now, not once have I done
| anything where this would have been important for me to know
| (even if I had been running my code on a system with SSD's). So,
| I'm not convinced the title is at all accurate.
|
| But, fun to read and think about.
| rabuse wrote:
| A little off topic, but I bought a new Macbook Pro with the M1
| chip with 8GB of RAM, and I'm worried about the swap usage of
| this machine wearing out the SSD too quickly. Is this an actual
| concern, as my swap has been in the multiple GB range with my
| use?
| cbsmith wrote:
| It's an actual concern for you. For Apple it's a variant on
| planned obsolescence. ;-)
|
| Note though that memory use metrics on MacOS can been a
| misleading. Make sure that you're seeing what's actually there.
| Grazester wrote:
| Why did you get the 8 gig version? If you are using all this
| swap then your purchased the wrong MacBook.
| rabuse wrote:
| Honestly, don't run much, so didn't think it would be that
| bad stepping down from my 16GB machine.
| 1-6 wrote:
| From what I've been able to gather, the excessive paging may
| actually have to do with non-native apps running on the M1.
| Avoid those.
| rabuse wrote:
| Most of my programs are JetBrains IDE's and browsers. Don't
| know if they're optimized for M1.
| [deleted]
| personjerry wrote:
| How big is the write cache usually and how does it work?
| Typically I've seen the write caches be something like 32MB in
| size, but the "top speed" seems to be sustained for files much
| bigger than 32MB, which doesn't make sense to me if that top
| speed is supposedly from writing to the cache. How does that
| work?
| bserge wrote:
| On SSDs? 32 is way off, the Samsung 470 had 256MB RAM cache and
| the 860 Pro a whopping 4GB for the top model.
|
| Although they started removing it entirely for NVMe SSDs, I
| guess the direct transfer speed is enough to not need a cache
| at all.
| mastax wrote:
| NVMe drives can access system memory over the PCIe bus.
| wtallis wrote:
| The DRAM you're referring to is for the most part not a write
| cache for user data. Most of that DRAM is a read cache for
| the FTL's logical to physical address mapping table. When the
| FTL is working with the typical granularity of 4kB, you get a
| requirement of approximately 1GB of DRAM per 1TB of NAND.
|
| Drives that include less than this amount of DRAM show
| reduced performance, usually in the form of lower random read
| performance because the physical address of the requested
| data cannot be quickly found by consulting a table in DRAM
| and must be located by first performing at least one slow
| NAND read.
| opencl wrote:
| It varies quite a bit. There are two different types of caches:
| SLC and DRAM. Most drives use SLC caching, higher end drives
| often use both.
|
| Typically the SSDs with DRAM have a ratio of 1GB DRAM per TB of
| flash.
|
| SLC caching is using a portion of the flash in SLC mode, where
| it stores 1 bit per cell rather than the typical 2-4 (2 for
| MLC, 3 for TLC, 4 for QLC) in exchange for higher performance.
| SLC cache size varies wildly. Some SSDs allocate a fixed size
| cache, some allocate it dynamically based on how much free
| space is available. It can potentially be 10s of GBs on larger
| SSDs.
| wtallis wrote:
| Getting full throughput from the SSD is less about file size
| and more about how much work is in the SSD's queue at any given
| moment. If the host system only issues commands one at a time
| (as would often result from using synchronous IO APIs), then
| the SSD will experience some idle time between finishing one
| command and receiving the next from the host system. If the
| host ensures there are 2+ commands in the SSD's queue, it won't
| have that idle time.
|
| Then there's the matter of how much data is in the queue,
| rather than how many commands are queued. Imagine a 4 TB SSD
| using 512Gbit TLC dies, and an 8-channel controller. That's 64
| dies with 2 or 4 planes per die. A single page is 16kB for
| current NAND, so we need 2 or 4 MB of data to write if we want
| to light up the whole drive at once, and that much again
| waiting in the queue to ensure the drive can begin the next
| write as soon as the first batch completes. But you can often
| hit a bottleneck elsewhere (either the PCIe link, or the
| channels between the controller and NAND) before you have every
| plane of every die 100% busy.
|
| If you're working with small files, then your filesystem will
| be producing several small IOs for each chunk of file contents
| you read or write from the application layer, and many of those
| small metadata/fs IOs will be in the critical path, blocking
| your data IOs. So even though you can absolutely hit speeds in
| excess of 3 GB/s by issuing 2MB write commands one at a time to
| a suitably high-end SSD, you may have more difficulty hitting 3
| GB/s by writing 2MB _files_ one at a time.
| rectang wrote:
| I wince at the amount of wear the `git clean -dxf; npm ci` cycle
| must be putting on my SSD.
| 1_player wrote:
| A lot of talk about pages, but no mention about how big these
| pages are. From a quick look on Google, most SSDs have 4kB pages,
| with some reaching 8kB or even 16kB.
| wtallis wrote:
| SSDs mostly tell the host system that they have 512-byte
| sectors or sometimes 4kB sectors, and the typical flash
| translation layer works in 4kB sectors because that's a good
| fit for the kind of workloads coming from a host system that
| usually prefers to do things (eg. virtual memory) in 4kB
| chunks. But the underlying NAND flash page size has been 16kB
| for years.
| cbsmith wrote:
| ...and all that cruft, and the logic to try to make handling
| of it not so bad, makes for a lot of complexity and
| unintended consequences.
| wly_cdgr wrote:
| There's nothing whatsoever I should need to know about SSDs as a
| Javascript programmer and if there is then the programmers on the
| lower levels haven't done their jobs right and are wasting my
| time
| dataflow wrote:
| What's the flash translation layer made of? Is the flash
| technology used for that more durable than the rest of the SSD
| itself? (like say MLC vs. QLC?)
| pkaye wrote:
| The FTL is like a virtual memory manager. It is
| firmware/hardware to manage things like the logical to physical
| mapping table, garbage collection, error correction, bad block
| management. Yes there will be a lot of FTL data structures
| stored on the flash. It can be made durable by redundant
| copies, writing in SLC mode or having recovery algorithms. I
| used to develop SSD firmware in the past if you have further
| questions.
| jng wrote:
| Hey that's very interesting! How much of the FTL logic is
| done with regular MCU code vs custom hardware? Is there any
| open source SSD firmware out there that one could look at to
| start experimenting in this field, or at least something
| pointing in that direction, be it open or affordable
| software, firmware, FPGA gateway or even IC IP? I believe
| there is value in integrating that part of the stack with the
| higher level software, but it seems quite difficult to
| experiment unless one is in the right circles / close to the
| right companies. Thanks!
| SeanCline wrote:
| You're right that the FTL has some durability concerns which,
| in addition to performance, is why it's typically cached in
| DRAM. Older DRAM-less SSDs were unreliable in the long-term but
| that's been improving with the adoption of HMB, which lets the
| SSD controller carve out some system RAM to store FTL data.
| effnorwood wrote:
| they do not spin
| bob1029 wrote:
| Things I have learned about SSDs:
|
| If you want to go fast & save NAND lifetime, use append-only log
| structures.
|
| If you want to go even faster & save even more NAND lifetime,
| batch your writes in software (i.e. some ring buffer with natural
| back-pressure mechanism) and then serialize them with a single
| writer into an append-only log structure. Many newer devices have
| something like this at the hardware level, but your block size is
| still a constraint when working in hardware. If you batch in
| software, you can hypothetically write multiple logical business
| transactions _per_ block I /O. When you physical block size is 4k
| and your logical transactions are averaging 512b of data, you
| would be leaving a lot of throughput on the table.
|
| Going down 1 level of abstraction seems important if you want to
| extract the most performance from an SSD. Unsurprisingly, the
| above ideas also make ordinary magnetic disk drives more
| performant & potentially last longer.
| remram wrote:
| Shouldn't the OS or libc take care of that? If I write and
| don't immediately flush()?
| pclmulqdq wrote:
| I used to think the same thing, but now that I work on SSD-
| based storage systems, I'm not sure this holds up in today's
| storage stacks. Log structuring really helped with HDDs since
| it meant fewer seeks.
|
| In particular, the filesystem tends to undo a lot of the
| benefits you get from log-structuring unless you are using a
| filesystem designed to keep your files log-structured. Using
| huge writes definitely still helps, though.
|
| A paper that I really like goes deeper into this:
| http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
|
| Edit: I had originally said "designed for flash" instead of
| "designed to keep your files log-structured." F2FS is designed
| for flash, but in my testing does relatively poorly with log-
| structured files because of how it works internally.
|
| Edit 2: de-googled the link. Thank you for pointing that out.
| 10000truths wrote:
| Achieving cutting-edge storage performance tends to require
| bypassing the filesystem anyways. Traditionally, that meant
| using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and
| operating on it with io_uring will get you most of the way
| there.
|
| In either case, the advice given in the article and by the OP
| is filesystem agnostic.
| nyanpasu64 wrote:
| Will an end user downloading a video editing app (or
| similar) have a NVME drive, know how to give your app
| direct access to a NVME drive, and will your app not
| corrupt the rest of the files on the drive?
| 10000truths wrote:
| Extreme performance requires extreme tradeoffs. As with
| anything else, you have to evaluate your use cases and
| determine for yourself whether the tradeoffs are worth
| it. For a mass-market application that has to play nice
| with other applications and work with a wide variety of
| commodity hardware, it's probably not worthwhile. For a
| state-of-the-art high performance data store that expects
| low latencies and high throughput (a la ScyllaDB), it may
| very well be.
| quotemstr wrote:
| Why would you want to bypass the filesystem by talking to
| the block device directly? Doesn't O_DIRECT on a
| preallocated regular file accomplish the same thing with
| less management complexity and special OS permissions?
| Granted, the file extents might be fragmented a bit, but
| that can be fixed.
| 10000truths wrote:
| A "regular file" might reside in multiple locations on
| disk for redundancy, or might have a checksum that needs
| to be maintained alongside it for integrity. Or, as you
| say, its contents might not reside in contiguous sectors
| - or you might be writing to a hole in a sparse file.
| There's a lot of "magic" that could go on behind the
| scenes when operating on "regular files", depending on
| what filesystem you're using with what options. Directly
| operating on the block device makes it easier to reason
| about the performance guarantees, since your reads and
| writes map more cleanly to the underlying SCSI/ATA/NVME
| commands issued.
| lazide wrote:
| If you understand your workload and the hardware well
| enough to understand how doing direct I/O on a file will
| help - then you're going to generally do better against a
| direct block device because there are fewer intermediate
| layers doing the wrong optimizations or otherwise messing
| you up. From a pure performance perspective anyway.
| Extents are one part of the issue, flushes to disk (and
| how/when they happen), caching, etc.
|
| Doesn't mean it isn't easier to deal with as a file from
| an administration perspective (and you can do snapshots,
| or whatever!), but Lvm can do that too for a block
| device, and many other things.
| quotemstr wrote:
| With O_DIRECT though you're opting out of the
| filesystem's caching (well, VFS's), forced flushes, and
| most FS level optimizations, so I'd expect it to perform
| on par with direct partition access.
|
| Do you have numbers showing an advantage of going
| directly to the block device? Personally, I'd consider
| the management advantages of a filesystem compelling
| absent specific performance numbers showing the benefit
| of direct partition access.
| trulyme wrote:
| Degoogled link:
| http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
| scns wrote:
| Like this?
|
| https://en.wikipedia.org/wiki/NILFS?wprov=sfla1
| AtlasBarfed wrote:
| This is basically the purpose of rocksdb, and to a lesser
| extent Cassandra
| andrewmcwatters wrote:
| My opinion is probably... not technically correct... until you
| have to deal with drive reliability and write guarantees, but I
| don't think programmers actually have to know anything about SSDs
| in the same way that developers had to know particular things
| about HDDs.
|
| This is out of pure speculation, but there had to be a period of
| time during the mass transition to SSDs that engineers said, OK,
| how do we get the hardware to be compatible with software that
| is, for the most part, expecting that hard disk drives are being
| used, and just behave like really fast HDDs.
|
| So, there's almost certainly some non-zero amount of code out
| there in the wild that is or was doing some very specific write
| optimized routine that one day was just performing 10 to 100
| times faster, and maybe just because of the nature of software is
| still out there today doing that same routine.
|
| I don't know what that would look like, but my guess would be
| that it would have something to do with average sized write
| caches, and those caches look entirely different today or
| something.
|
| And today, there's probably some SSD specific code doing
| something out there now, too.
| rzzzt wrote:
| You can optimize for less/shorter drive seeks on rotational
| media by reordering requests:
| https://en.wikipedia.org/wiki/Elevator_algorithm
| forrestthewoods wrote:
| Games used to spend a lot time optimizing CD/DVD layout.
| Because reading from that is REALLY slow. Optimize mostly meant
| keep data contiguous. But sometimes it meant duplicate data to
| avoid seeks.
|
| The canonical case is minimize time to load a level. Keep that
| level's assets contiguous. And maybe duplicate data that is
| shared across levels. It's a trade off between disc space and
| load time.
|
| I'm not familiar with major tricks for improving after a disc
| is installed to drive. (PS4 games always streamed data from
| HDD, not disc.)
|
| Even consoles use different HDD manufacturers. So it'd be
| pretty difficult to safely optimize for that. I'm sure a few
| games do. But it's rare enough I've never heard of it.
| fart32 wrote:
| While reading through the Quake 3 source code, I noticed that
| whenever the FS functions were reading from a CD, they were
| doing so in a loop, because the fread/fopen functions instead
| of hanging and waiting for the CD to spin up sometimes just
| returned an error. It wasn't just slow, it was also hilarious
| at times.
| patmorgan23 wrote:
| Stream loading is another technique that's used to reduce
| load time. You start loading data for the next level as the
| player approaches a boundry and you let them enter the next
| level before all of the assets(ussally textures) have
| finished loading.
| andrewmcwatters wrote:
| This reminds me of Valve's GCF (grid cache file, officially,
| or game cache file, commonly). The benefits must have purely
| occurred on consoles for the reasons you outlined, because
| cracked Valve games that had GCF files extracted ran faster
| than the official retail releases on PCs!
| alpaca128 wrote:
| Consoles also do this with HDDs. That's been one of the
| talking points around the PS5 from the beginning, with Sony
| saying that games would get more storage space efficient
| because they don't need redundancy for faster loading
| anymore.
| maccard wrote:
| This is very very true. The PS5 does hardware
| decompression, so games by default are now going to be
| compressed. For a real world reference of how big a
| difference that makes, see fortnite turning on compression
| [0] (disclaimer: I worked for epic on fortnite at the time)
|
| [0] https://www.ign.com/articles/fortnites-latest-patch-
| makes-it...
| Mikealcl wrote:
| I had not loaded ign without blockers in years. That was
| painful.
| maccard wrote:
| Hah, sorry. Ive always found their articles to have the
| least fluff on the topic, despite the awful awful
| website.
| hugey010 wrote:
| Right, the average programmer probably should, or already is,
| depending on some existing abstraction to optimizes writes
| based on storage medium.
| klodolph wrote:
| If you care about SSDs, one paper you _should_ read is "Don't
| Stack Your Log on My Log" by Yang et al. 2014
|
| https://www.usenix.org/system/files/conference/inflow14/infl...
|
| > Log-structured applications and file systems have been used to
| achieve high write throughput by sequentializing writes. Flash-
| based storage systems, due to flash memory's out-of-place update
| characteristic, have also relied on log-structured approaches.
| Our work investigates the impacts to performance and endurance in
| flash when multiple layers of log-structured applications and
| file systems are layered on top of a log-structured flash device.
| We show that multiple log layers affects sequentiality and
| increases write pressure to flash devices through randomization
| of workloads, unaligned segment sizes, and uncoordinated multi-
| log garbage collection. All of these effects can combine to
| negate the intended positive affects of using a log. In this
| paper we characterize the interactions between multiple levels of
| independent logs, identify issues that must be considered, and
| describe design choices to mitigate negative behaviors in multi-
| log configurations.
| DrNuke wrote:
| A number of high-level techniques help rationalize data
| management and transfer, but the mileage of practical
| implementations may vary a lot. Generally speaking, only a small
| number of applications really need to take care and add a further
| layer of abstraction, that because the best practices already
| codified into any widespread language do an acceptable job
| already.
___________________________________________________________________
(page generated 2021-06-20 23:00 UTC)