[HN Gopher] When are SSDs slow?
       ___________________________________________________________________
        
       When are SSDs slow?
        
       Author : eatonphil
       Score  : 77 points
       Date   : 2024-06-18 15:26 UTC (1 days ago)
        
 (HTM) web link (cedardb.com)
 (TXT) w3m dump (cedardb.com)
        
       | wmf wrote:
       | This "non-production mode" has a bad side effect where benchmarks
       | on laptops will be artificially fast and people will "forget"
       | about the data loss disclaimer, just like the bad old days of
       | MongoDB.
        
         | jauntywundrkind wrote:
         | The article also mentions that enterprise SSD often have
         | supercaps & can be basically free because there's enough
         | runtime (even if power goes out) to persist any in-flight data.
         | 
         | So it seems like it wouldnt be an issue. But it would be
         | running your dev & prod in two different modes.
        
       | jakedata wrote:
       | In Postgres, synchronous_commits = off can seem like magic during
       | benchmarks but in real-life applications the anticipated benefits
       | can be underwhelming. Ultimately, a reliable non-volatile write
       | cache, sized for your workload is the answer. It evens out all
       | the peaks and valleys of flash writes and hands control back to
       | the database engine so it can get on with its work.
        
         | lvogel wrote:
         | > Ultimately, a reliable non-volatile write cache, sized for
         | your workload is the answer.
         | 
         | Author here, I agree! It's quite sad that we need such an
         | involved solution to offset the inherent complexity of the
         | flash medium (latency spikes, erase blocks, ...). We nearly had
         | the perfect solution with Optane[1]: 100ns latency, instantly
         | persisted writes and all that good stuff.
         | 
         | I'm still not over Intel killing it while I did my PhD on it.
         | 
         | [1] https://en.m.wikipedia.org/wiki/3D_XPoint
        
           | throwaway81523 wrote:
           | There used to be battery backed ramdisks with SATA
           | interfaces. Are they history now? These days it would be NVMe
           | or whatever, of course.
        
             | jakedata wrote:
             | Radian makes ultracapacitor backed RAM cards with flash
             | backup that appear to be NVMe devices. They do a nice job
             | for things like Write Ahead Logs and ZFS ARC/ZIL. They
             | offer effectively unlimited write endurance with a tradeoff
             | of cost and capacity.
        
           | hi-v-rocknroll wrote:
           | More industry experience in real world would've made it clear
           | that they're totally useless for datacenters where there are
           | rarely power outages with many 9's of electrical uptime
           | measured in months or years.
           | 
           | UPSes and BBWC evolved to bring reliability to production
           | gear running in non-DC environments when mainline servers
           | used spinning rust without backup power. Today, it's largely
           | a vendor up-charge.
           | 
           | Write barriers cause far too much latency in practice on
           | servers in tier IV datacenters, so they're almost always
           | turned off except for a tiny fraction of systems.
           | 
           | There has never been a "perfect" or a universal solution,
           | only a risk budget and suitability for a specific use-case.
        
             | jakedata wrote:
             | Write barriers aren't just for durability - they also can
             | even out major latency spikes when bloated buffers
             | ultimately must be flushed. Database, filesystem, RAID,
             | device, flash management and individual chips all become
             | saturated at some point. Managing performance at the
             | database engine layer gives you visibility into the issue
             | at a high level as opposed to rolling merrily along until
             | commits start taking 2000 ms. As an example, ZFS is
             | terrifyingly unpredictable at high loads.
        
             | otterley wrote:
             | Power availability isn't the only concern when it comes to
             | taking the risk of async writes. Kernel panics will cause
             | unflushed writes to be lost, and storage hardware can
             | experience gray or total failures, sometimes completely
             | suddenly.
        
       | sam_bristow wrote:
       | Related:
       | 
       | "What Modern NVMe Storage Can Do, and How to Exploit it"[1] by
       | Gabriel Haas and Viktor Leis.
       | 
       | [1]https://dl.acm.org/doi/10.14778/3598581.3598584
        
       | soonnow wrote:
       | We used a small embedded db in our desktop product only for
       | durability and SQL semantics. After we got multiple database
       | corruptions we switched to a small file based store written in
       | Java with simple DB semantics.
       | 
       | Looking carefully at what happens when data is lost we found that
       | it's basically all recoverable.
       | 
       | This lead to an amazing performance, as data is flushed only when
       | necessary. We flush only every 10 seconds now or when the write
       | through cache is full.
       | 
       | My lesson learned from that is to carefully think about the
       | actual requirements when it comes to transactions and having
       | loose constraints may be the quickest way to speed up
       | performance.
       | 
       | For many applications even a second or two of lost data may not
       | be a big issue.
        
       | hi-v-rocknroll wrote:
       | How far we've come from spinning rust on dev boxes and mainline
       | warm servers back in 2011.[0]
       | 
       | 0.
       | https://web.archive.org/web/20111112045055/http://buyafuckin...
        
         | sroussey wrote:
         | People don't warm servers before entering production anymore?
        
       | Havoc wrote:
       | Small optanes (50-100 gigs) are quite accessible cost wise and
       | solve most of the queue depth problem
        
         | ssl-3 wrote:
         | Optane is neat tech, but it's also dead tech.
        
           | lvogel wrote:
           | Tell me about it... Was just too expensive compared to flash.
           | But the tech was definitely awesome and I hope it'll come
           | back in one way or another.
           | 
           | When designing CedarDB, we recently had multiple instances
           | where we thought: "If we had just a few KiB of Optane here
           | ... "
        
             | ssl-3 wrote:
             | Oh, it's amazing. I use a small chunk of it at home for
             | ZIL, even though my synch writes are few and far between.
             | 
             | It's just approximately the Right Thing to use for some
             | things, and I had room for it, and it was rather
             | inexpensive when I bought it. :)
             | 
             | 10/10, would recommend and buy again. (Whoops!)
        
             | the8472 wrote:
             | If it's just a few KiB then various NVRAM techs have been
             | slowly creeping forward for decades. But they've never made
             | it to datacenter storage.
             | 
             | https://www.avalanche-technology.com/wp-
             | content/uploads/1Gb-... https://www.everspin.com/spin-
             | transfer-torque-ddr-products
        
           | Havoc wrote:
           | >dead tech.
           | 
           | Agreed in general, but surely for this usage case (i.e DB) it
           | would still solve the problem, no?
           | 
           | Or am i missing something here
        
             | ssl-3 wrote:
             | It does solve the problem for now, but the parts aren't
             | being made anymore.
             | 
             | Maybe that's good enough for some people in some
             | applications, but a lot of folks have very valid
             | reservations about using EOL'd products regardless of how
             | well-suited they may be at solving a problem.
             | 
             | I mean: It's like work trucks. I'm a huge fan of GM's
             | Astro/Safari line of RWD cargo vans. Their 4.3l V6 is solid
             | and adequately powerful, the 4L60E transmission is
             | reliable, there's a ton of space in the back with enough
             | height to load 4' tall sheet materials that are standing on
             | edge, and to sit down in front of a bolted-down workbench,
             | and the longitudinal RWD layout has advantages, they're
             | even fairly easy to work on when something inevitably wears
             | out, the deck height is low, and they're small enough to
             | drive through a fairly normal parking garage without worry
             | and to fit into a tight space.
             | 
             | They're pretty awesome (except maybe for safety and fuel
             | economy, but perhaps those aren't my primary concerns), and
             | they solve a particular set of problems very well.
             | 
             | But they aren't being made anymore, and haven't been made
             | for over 20 years. I own _one_ example of a GMC Safari that
             | I 'd like to get working well again some day and maybe kit
             | out in some interesting way, but I'm absolutely not going
             | to suggest that anyone buy a fleet of them to use in
             | business because:
             | 
             | They, like Optane, are EOL'd. All of them that will ever be
             | produced have already been made.
             | 
             | ---
             | 
             | "So can you explain to me again why you think we should be
             | buying _used, old parts_ for our new datacenter? I know you
             | 've told me before, but still not sure that I have the same
             | perspective as you do."
        
       | justinclift wrote:
       | This seems doesn't seem right:                   Unfortunately,
       | my laptop, and probably many other developers' laptops, do not
       | have         an enterprise-grade SSD, or even the option to
       | install one.
       | 
       | Enterprise grade SATA SSDs are fairly easy to obtain (Samsung
       | PM893, Kingston DC600M).
       | 
       | Enterprise M.2 SSDs are also a thing (Kingston DC1000B).
       | 
       | If they're using a mac and are limited to external connectivity,
       | then they could throw the above into a thunderbolt enclosure.
        
         | wtallis wrote:
         | Even when there are enterprise SSDs that are mechanically and
         | electrically compatible with the bays/slots in a laptop,
         | they're often untenable due to the lack of idle power
         | management features on enterprise SSDs (because sleeping when
         | idle kills the latency metrics). Consider that the 480GB
         | DC1000B is officially rated for 1.9W at idle and in practice I
         | wasn't able to get it below 1.2W. Consumer SSDs with working
         | power management would idle under 10mW.
        
           | justinclift wrote:
           | Interesting, haven't seen mention of that aspect of
           | enterprise ssds at all.
        
             | timschmidt wrote:
             | I usually have to tweak my home rackmount equipment over
             | IPMI to set the fan settings to idle when temps are low
             | instead of 24/7/365 jet engine mode which the local
             | university datacenter seems to prefer. I imagine running
             | the fans full blast all the time results in slightly more
             | reliable performance and reliability, but I don't have the
             | luxury of a soundproof HVAC controlled room to lock
             | everything away in, so the lower noise at lower utilization
             | matters to me. This trade-off definitely happens all over
             | the place in enterprise gear.
        
               | justinclift wrote:
               | You've been through all the stuff like setting max device
               | power for the NVMe drives and similar yeah?
               | 
               | https://nvmexpress.org/resource/technology-power-
               | features/
        
               | timschmidt wrote:
               | I just upgraded my main server to an https://www.supermic
               | ro.com/products/archive/motherboard/x9dr... with dual
               | Xeon E5-2680v2s. $15 for 20c/40t and 256gb ECC from the
               | local university surplus. Unfortunately, it's still a
               | little too old to offer NVME. That doesn't matter too
               | much to me, since it's connected via 6gbps SAS to several
               | 24 bay NetApp disk shelves (also $15 ea!) full of disks.
               | I have dealt with PCIe ASPM (https://en.wikipedia.org/wik
               | i/Active_State_Power_Management) which similarly trades
               | off between performance, power consumption, and latency.
               | 
               | I am not hugely concerned with power consumption as I
               | have solar panels. But noise and excess heat matter to
               | me. My whole stack is consuming around 650W currently.
               | Wildly removed from the days I was running off a
               | creatively overprovisioned OpenWRT wireless router using
               | < 20W. But I'm happy to have the increased capacity for
               | my use cases. Gigabit synchronous fiber to the home is a
               | dangerous drug :)
        
               | justsomehnguy wrote:
               | Just use PCI-E converter for NVMe.
               | 
               | You wouldn't be able to boot from it, but it would work
               | just fine.
        
               | timschmidt wrote:
               | A good suggestion, but it's a 1U server with both
               | available PCIe slots populated between the external SAS
               | controller and a dual 10gbE CNA. I boot from SATA SSDs
               | currently. Could upgrade to enterprise SAS SSDs, but
               | haven't noticed a need for my current uses. The boot
               | drive is rarely touched, as most data terminates at the
               | NetApp disk shelves, which have been a _far more
               | reliable_ way to connect unholy amounts of storage to a
               | machine than any of the consumer direct attached storage
               | options I 've tried in the past. There's a mix of SSDs
               | and spinning rust in the disk shelves - SSDs for web
               | server stuff and spinning rust for low cost bulk storage.
        
               | justinclift wrote:
               | > with both available PCIe slots populated between the
               | external SAS controller and a dual 10gbE CNA.
               | 
               | Interestingly, there do appear to be PCIe cards that have
               | both external SAS and high speed networking in one:
               | 
               | https://forums.servethehome.com/index.php?threads/unique-
               | oem...
               | 
               | Looks like a bunch of mucking around is needed to get
               | things setup, but seems like it works ok from that point
               | onwards:
               | 
               | * https://www.ebay.com/itm/314150370285
               | 
               | Have been considering grabbing one to try in a HP
               | Microserver Gen 8 that runs Proxmox. Not a lot of free
               | time at the moment though. ;)
        
               | timschmidt wrote:
               | Neat! And affordable! Thanks! I will definitely look
               | further into them. I got the 10gbE CNA from another
               | university surplus find where a half dozen were being
               | used as 16gb fibre channel adapters with SFP fibre
               | modules. A little digging revealed that they could be
               | switched to 10gbE CNA mode which is a lot more useful for
               | my purposes. Enterprise gear is full of surprises :D
               | 
               | Only potential downside I see is the included pcie switch
               | chip. It's dividing up 8gbps between three devices which
               | could independently consume all of that bandwidth.
               | Probably not a huge issue for my uses, but worth
               | considering if you plan on transferring large amounts of
               | data over network and storage simultaneously.
        
           | kjkjadksj wrote:
           | Thats hardly untenable. Most computers draw a lot more watts
           | where you wouldn't miss 1.2W imo. Even these "low watt"
           | computers are drawing like 5-7 watts at idle and once you do
           | anything on them its a lot more than that.
        
             | wtallis wrote:
             | You wouldn't miss the 1.2-1.9W on a gaming laptop with a
             | discrete GPU. But for more mainstream systems with just
             | integrated graphics, 5W is on the high end for idle power
             | and even Intel's latest mid-range laptop parts can get down
             | to about 3W if you don't have Windows doing dumb shit in
             | the background constantly.
             | 
             | Also, on modern laptops you can't trust that components
             | will be powered off when you close the lid, because
             | Microsoft and Intel have been at war with ACPI S3 suspend
             | for years now. So that 1.2W could easily be the minimum
             | power draw of the SSD any time you haven't hibernated or
             | shut down the system.
        
         | otterley wrote:
         | It's also a strange comment: Enterprise DBs are usually run on
         | enterprise-grade hardware in production. To be sure, you can
         | test them on consumer-grade hardware, but if most folks are
         | going to run this stuff in production using enterprise-grade
         | SSDs, the software should be optimized for that hardware.
         | Mechanical sympathy is an important software engineering
         | practice.
        
       | jeffbee wrote:
       | The problem of commit latency bottlenecking global throughput
       | seems to be a matter of treating the device as a black box that
       | must be quiesced as a unit. If the NVMe device presented multiple
       | namespaces to the operating system, and if other unnecessary
       | abstractions like the filesystem are discarded, then the database
       | can make progress on one namespace while blocked on another, even
       | within a single host.
        
         | senderista wrote:
         | Isn't this basically what FUA does on SCSI? Presumably
         | O_DIRECT+O_SYNC/O_DSYNC will invoke this path when possible?
        
         | lmz wrote:
         | There is a feature known as NVMe namespaces but not sure how
         | well it's supported outside of expensive enterprise gear
         | https://unix.stackexchange.com/questions/520231/what-are-nvm...
        
           | jeffbee wrote:
           | To a close approximation, NVMe devices that come in U.2 or
           | U.3 form factors support multiple namespaces, and those that
           | come in M.2 support only 1, and 4KiB LBA format support
           | follows the same pattern with a few exceptions. But you don't
           | have to spend an insane amount to get these features. The
           | Micro 7x00 series costs ~$100/TB.
        
       | louwrentius wrote:
       | Cool! They used my bench-fio and fio-plot tool for the 3D graph
       | :-) [0]
       | 
       | What the 3D graph (also) shows is how poorly SSDs actually
       | perform in single-threaded, low queue depth situations. Many
       | enterprise or datacenter grade SSDs often perform much better and
       | consistently on this front than the average consumer SSD.
       | 
       | [0] https://github.com/louwrentius/fio-plot
        
       | refset wrote:
       | Slightly off-topic but CedarDB is extremely exciting. It's the
       | commercialization of the widely cited Umbra research DBMS [0]
       | that has been in the works for several years, which benchmarks
       | faster than DuckDB for OLAP [1] whilst simultaneously being
       | really strong for transactional workloads. Also discussed
       | recently here [2].
       | 
       | [0] https://umbra-db.com/
       | 
       | [1] https://cedardb.com/blog/ode_to_postgres/
       | 
       | [2] https://news.ycombinator.com/item?id=40241150
        
         | riku_iki wrote:
         | > which benchmarks faster than DuckDB for OLAP [1]
         | 
         | that link doesn't do any performance comparison. They claim
         | that CedarDB executes less "code branches" than duckdb, which
         | may or may not translate to faster performance.
        
           | ChrisWint wrote:
           | Author of that blogpost here.
           | 
           | > less "code branches" than duckdb, which may or may not
           | translate to faster performance.
           | 
           | In that case it was about 2.5x faster than DuckDB end to end,
           | so a bit less than the difference in branches.
           | 
           | If you want to see some independent benchmarks on Umbra, our
           | underlying technology, its currently first place on
           | Clickbench [1]. You can compare against duckdb there as well.
           | 
           | [1] https://benchmark.clickhouse.com/
        
             | riku_iki wrote:
             | clickbench is a toy benchmark: small dataset, very specific
             | queries.
             | 
             | Benchmarking full tcp-h (not just one query like in your
             | post) on sizable dataset (few TBs) would be very good close
             | to real world scenario, but vendors usually avoid this.
        
               | pfent wrote:
               | There are TPC-H numbers in another post:
               | https://cedardb.com/blog/simple_efficient_hash_tables/
        
               | riku_iki wrote:
               | its great starting insight, but again its small dataset
               | (100GB) which almost fits memory, and I think many
               | details are missing (for example clickbench publishes all
               | configs and queries, and more detailed report, so vendors
               | can reproduce/optimize/dispute them).
        
         | jauntywundrkind wrote:
         | A little sad finding out it's proprietary but to be expected I
         | suppose.
         | 
         | It's neat seeing postgres gearing up for async support. There's
         | also folks like OrioleDB doing massive revamps of postgres &
         | doing disaggregated storage in public.
         | https://github.com/orioledb/orioledb
         | https://hn.algolia.com/?query=orioledb&sort=byDate
         | 
         | Oh they were bought by Suprabase two months ago... Fingers
         | crossed! The Suprabase CEO commented at the time, with a nice
         | basic overview, https://news.ycombinator.com/item?id=40039138
        
           | pdimitar wrote:
           | It is Supabase, dude, _not_ SupRRRRRabase.
        
       | JosephRedfern wrote:
       | Maybe slight off topic, but does anyone know what the deal is
       | with the NVMe K/V command set spec? There was a fair bit of noise
       | made about it at the time, but I've not seen any drives that
       | support it (despite a few enquiries)
        
         | wmf wrote:
         | I think the plan is for ZNS to never be available at retail and
         | I imagine key-value will be the same.
        
         | wtallis wrote:
         | Those new command sets that require a re-write of your entire
         | software storage stack exist solely to satisfy the hyperscalers
         | for whom re-writing the entire software storage stack can be a
         | worthwhile optimization effort. If you don't command enough
         | purchasing power to already be getting custom SKUs from your
         | drive vendors, they don't have enough incentive to add those
         | features to the models they offer you.
        
           | riku_iki wrote:
           | I imagine it could be cool for some DB vendors too, so they
           | can optimize base on direct access and skip all OS/FS
           | overhead.
        
       | floating-io wrote:
       | The seemingly critical parallel job performance issue sounds like
       | it can and should be handled transparently by the OS when
       | multiple pages are involved, and should thus be an area of
       | research if it isn't already.
       | 
       | Frankly, the idea that it isn't already a solved problem makes me
       | wonder if that could possibly be the case...
       | 
       | The writes thing... Outside of very rare circumstances, I would
       | think that properly batching inserts/updates within transactions
       | at the app level would resolve half of that, and multi-user
       | concurrency would buy back the rest.
       | 
       | What am I missing?
       | 
       | P.S. I'm mad at the author for enlightening me to the real-world
       | performance impact of enterprise SSD. i'm about to buy, and that
       | might just cost me...
        
         | wtallis wrote:
         | > The seemingly critical parallel job performance issue sounds
         | like it can and should be handled transparently by the OS when
         | multiple pages are involved,
         | 
         | A thread can only fault on one page at a time, and only
         | sequential access patterns can benefit from the OS bringing in
         | multiple pages per fault. Having all of your CPU cores busy
         | handling page faults and context switches to threads that are
         | about to trigger another page fault is only going to generate
         | enough read traffic to keep a few SSDs busy, not a full shelf
         | of 24+ drives. mmap() and synchronous IO don't scale well
         | enough for today's SSDs.
        
       | prewett wrote:
       | A little off-topic, but since the article mentioned macOS not
       | necessary actually doing a flush even on fsync(), I've been
       | wondering how long it is before macOS actually writes to the
       | drive. A non-trivial number of my writes are a compile followed
       | by, oh yeah, quick fix, recompile. It'd be nice if it would hold
       | off for a while. Given that I'm on a laptop, the risk of losing
       | data is limited to the very rare times I actually run out of
       | power or a kernel panic.
        
       | the8472 wrote:
       | > If you want your data to be stored persistent for real, you
       | need to issue a sync command to make the operating system block
       | and only return control after the data has been persisted.
       | 
       | Well on linux the work that has to be done during a blocking
       | sync() can be minimized by triggering async writeback via
       | sync_file_range in advance.
        
       | cogman10 wrote:
       | > Instead of writing every single change to the SSD immediately,
       | we can instead queue many such writes and persist the whole queue
       | in one go. Instead of having one latency-sensitive round trip per
       | commit, we now have one round trip per queue flush, let's say
       | every 100 commits. We just have to be careful not to tell the
       | user that their data has been committed before the queue
       | containing their data has been flushed.
       | 
       | I'm a little bit unclear, is the suggested method here issuing
       | 100 `write` commands and then a `sync` (or something similar) and
       | only informing the downstream of success when the `sync` is
       | finish? Or is the DB actually maintaining a queue of commits to
       | be written that get fsynced all at once?
        
       ___________________________________________________________________
       (page generated 2024-06-19 23:01 UTC)