[HN Gopher] When are SSDs slow?
___________________________________________________________________
When are SSDs slow?
Author : eatonphil
Score : 77 points
Date : 2024-06-18 15:26 UTC (1 days ago)
(HTM) web link (cedardb.com)
(TXT) w3m dump (cedardb.com)
| wmf wrote:
| This "non-production mode" has a bad side effect where benchmarks
| on laptops will be artificially fast and people will "forget"
| about the data loss disclaimer, just like the bad old days of
| MongoDB.
| jauntywundrkind wrote:
| The article also mentions that enterprise SSD often have
| supercaps & can be basically free because there's enough
| runtime (even if power goes out) to persist any in-flight data.
|
| So it seems like it wouldnt be an issue. But it would be
| running your dev & prod in two different modes.
| jakedata wrote:
| In Postgres, synchronous_commits = off can seem like magic during
| benchmarks but in real-life applications the anticipated benefits
| can be underwhelming. Ultimately, a reliable non-volatile write
| cache, sized for your workload is the answer. It evens out all
| the peaks and valleys of flash writes and hands control back to
| the database engine so it can get on with its work.
| lvogel wrote:
| > Ultimately, a reliable non-volatile write cache, sized for
| your workload is the answer.
|
| Author here, I agree! It's quite sad that we need such an
| involved solution to offset the inherent complexity of the
| flash medium (latency spikes, erase blocks, ...). We nearly had
| the perfect solution with Optane[1]: 100ns latency, instantly
| persisted writes and all that good stuff.
|
| I'm still not over Intel killing it while I did my PhD on it.
|
| [1] https://en.m.wikipedia.org/wiki/3D_XPoint
| throwaway81523 wrote:
| There used to be battery backed ramdisks with SATA
| interfaces. Are they history now? These days it would be NVMe
| or whatever, of course.
| jakedata wrote:
| Radian makes ultracapacitor backed RAM cards with flash
| backup that appear to be NVMe devices. They do a nice job
| for things like Write Ahead Logs and ZFS ARC/ZIL. They
| offer effectively unlimited write endurance with a tradeoff
| of cost and capacity.
| hi-v-rocknroll wrote:
| More industry experience in real world would've made it clear
| that they're totally useless for datacenters where there are
| rarely power outages with many 9's of electrical uptime
| measured in months or years.
|
| UPSes and BBWC evolved to bring reliability to production
| gear running in non-DC environments when mainline servers
| used spinning rust without backup power. Today, it's largely
| a vendor up-charge.
|
| Write barriers cause far too much latency in practice on
| servers in tier IV datacenters, so they're almost always
| turned off except for a tiny fraction of systems.
|
| There has never been a "perfect" or a universal solution,
| only a risk budget and suitability for a specific use-case.
| jakedata wrote:
| Write barriers aren't just for durability - they also can
| even out major latency spikes when bloated buffers
| ultimately must be flushed. Database, filesystem, RAID,
| device, flash management and individual chips all become
| saturated at some point. Managing performance at the
| database engine layer gives you visibility into the issue
| at a high level as opposed to rolling merrily along until
| commits start taking 2000 ms. As an example, ZFS is
| terrifyingly unpredictable at high loads.
| otterley wrote:
| Power availability isn't the only concern when it comes to
| taking the risk of async writes. Kernel panics will cause
| unflushed writes to be lost, and storage hardware can
| experience gray or total failures, sometimes completely
| suddenly.
| sam_bristow wrote:
| Related:
|
| "What Modern NVMe Storage Can Do, and How to Exploit it"[1] by
| Gabriel Haas and Viktor Leis.
|
| [1]https://dl.acm.org/doi/10.14778/3598581.3598584
| soonnow wrote:
| We used a small embedded db in our desktop product only for
| durability and SQL semantics. After we got multiple database
| corruptions we switched to a small file based store written in
| Java with simple DB semantics.
|
| Looking carefully at what happens when data is lost we found that
| it's basically all recoverable.
|
| This lead to an amazing performance, as data is flushed only when
| necessary. We flush only every 10 seconds now or when the write
| through cache is full.
|
| My lesson learned from that is to carefully think about the
| actual requirements when it comes to transactions and having
| loose constraints may be the quickest way to speed up
| performance.
|
| For many applications even a second or two of lost data may not
| be a big issue.
| hi-v-rocknroll wrote:
| How far we've come from spinning rust on dev boxes and mainline
| warm servers back in 2011.[0]
|
| 0.
| https://web.archive.org/web/20111112045055/http://buyafuckin...
| sroussey wrote:
| People don't warm servers before entering production anymore?
| Havoc wrote:
| Small optanes (50-100 gigs) are quite accessible cost wise and
| solve most of the queue depth problem
| ssl-3 wrote:
| Optane is neat tech, but it's also dead tech.
| lvogel wrote:
| Tell me about it... Was just too expensive compared to flash.
| But the tech was definitely awesome and I hope it'll come
| back in one way or another.
|
| When designing CedarDB, we recently had multiple instances
| where we thought: "If we had just a few KiB of Optane here
| ... "
| ssl-3 wrote:
| Oh, it's amazing. I use a small chunk of it at home for
| ZIL, even though my synch writes are few and far between.
|
| It's just approximately the Right Thing to use for some
| things, and I had room for it, and it was rather
| inexpensive when I bought it. :)
|
| 10/10, would recommend and buy again. (Whoops!)
| the8472 wrote:
| If it's just a few KiB then various NVRAM techs have been
| slowly creeping forward for decades. But they've never made
| it to datacenter storage.
|
| https://www.avalanche-technology.com/wp-
| content/uploads/1Gb-... https://www.everspin.com/spin-
| transfer-torque-ddr-products
| Havoc wrote:
| >dead tech.
|
| Agreed in general, but surely for this usage case (i.e DB) it
| would still solve the problem, no?
|
| Or am i missing something here
| ssl-3 wrote:
| It does solve the problem for now, but the parts aren't
| being made anymore.
|
| Maybe that's good enough for some people in some
| applications, but a lot of folks have very valid
| reservations about using EOL'd products regardless of how
| well-suited they may be at solving a problem.
|
| I mean: It's like work trucks. I'm a huge fan of GM's
| Astro/Safari line of RWD cargo vans. Their 4.3l V6 is solid
| and adequately powerful, the 4L60E transmission is
| reliable, there's a ton of space in the back with enough
| height to load 4' tall sheet materials that are standing on
| edge, and to sit down in front of a bolted-down workbench,
| and the longitudinal RWD layout has advantages, they're
| even fairly easy to work on when something inevitably wears
| out, the deck height is low, and they're small enough to
| drive through a fairly normal parking garage without worry
| and to fit into a tight space.
|
| They're pretty awesome (except maybe for safety and fuel
| economy, but perhaps those aren't my primary concerns), and
| they solve a particular set of problems very well.
|
| But they aren't being made anymore, and haven't been made
| for over 20 years. I own _one_ example of a GMC Safari that
| I 'd like to get working well again some day and maybe kit
| out in some interesting way, but I'm absolutely not going
| to suggest that anyone buy a fleet of them to use in
| business because:
|
| They, like Optane, are EOL'd. All of them that will ever be
| produced have already been made.
|
| ---
|
| "So can you explain to me again why you think we should be
| buying _used, old parts_ for our new datacenter? I know you
| 've told me before, but still not sure that I have the same
| perspective as you do."
| justinclift wrote:
| This seems doesn't seem right: Unfortunately,
| my laptop, and probably many other developers' laptops, do not
| have an enterprise-grade SSD, or even the option to
| install one.
|
| Enterprise grade SATA SSDs are fairly easy to obtain (Samsung
| PM893, Kingston DC600M).
|
| Enterprise M.2 SSDs are also a thing (Kingston DC1000B).
|
| If they're using a mac and are limited to external connectivity,
| then they could throw the above into a thunderbolt enclosure.
| wtallis wrote:
| Even when there are enterprise SSDs that are mechanically and
| electrically compatible with the bays/slots in a laptop,
| they're often untenable due to the lack of idle power
| management features on enterprise SSDs (because sleeping when
| idle kills the latency metrics). Consider that the 480GB
| DC1000B is officially rated for 1.9W at idle and in practice I
| wasn't able to get it below 1.2W. Consumer SSDs with working
| power management would idle under 10mW.
| justinclift wrote:
| Interesting, haven't seen mention of that aspect of
| enterprise ssds at all.
| timschmidt wrote:
| I usually have to tweak my home rackmount equipment over
| IPMI to set the fan settings to idle when temps are low
| instead of 24/7/365 jet engine mode which the local
| university datacenter seems to prefer. I imagine running
| the fans full blast all the time results in slightly more
| reliable performance and reliability, but I don't have the
| luxury of a soundproof HVAC controlled room to lock
| everything away in, so the lower noise at lower utilization
| matters to me. This trade-off definitely happens all over
| the place in enterprise gear.
| justinclift wrote:
| You've been through all the stuff like setting max device
| power for the NVMe drives and similar yeah?
|
| https://nvmexpress.org/resource/technology-power-
| features/
| timschmidt wrote:
| I just upgraded my main server to an https://www.supermic
| ro.com/products/archive/motherboard/x9dr... with dual
| Xeon E5-2680v2s. $15 for 20c/40t and 256gb ECC from the
| local university surplus. Unfortunately, it's still a
| little too old to offer NVME. That doesn't matter too
| much to me, since it's connected via 6gbps SAS to several
| 24 bay NetApp disk shelves (also $15 ea!) full of disks.
| I have dealt with PCIe ASPM (https://en.wikipedia.org/wik
| i/Active_State_Power_Management) which similarly trades
| off between performance, power consumption, and latency.
|
| I am not hugely concerned with power consumption as I
| have solar panels. But noise and excess heat matter to
| me. My whole stack is consuming around 650W currently.
| Wildly removed from the days I was running off a
| creatively overprovisioned OpenWRT wireless router using
| < 20W. But I'm happy to have the increased capacity for
| my use cases. Gigabit synchronous fiber to the home is a
| dangerous drug :)
| justsomehnguy wrote:
| Just use PCI-E converter for NVMe.
|
| You wouldn't be able to boot from it, but it would work
| just fine.
| timschmidt wrote:
| A good suggestion, but it's a 1U server with both
| available PCIe slots populated between the external SAS
| controller and a dual 10gbE CNA. I boot from SATA SSDs
| currently. Could upgrade to enterprise SAS SSDs, but
| haven't noticed a need for my current uses. The boot
| drive is rarely touched, as most data terminates at the
| NetApp disk shelves, which have been a _far more
| reliable_ way to connect unholy amounts of storage to a
| machine than any of the consumer direct attached storage
| options I 've tried in the past. There's a mix of SSDs
| and spinning rust in the disk shelves - SSDs for web
| server stuff and spinning rust for low cost bulk storage.
| justinclift wrote:
| > with both available PCIe slots populated between the
| external SAS controller and a dual 10gbE CNA.
|
| Interestingly, there do appear to be PCIe cards that have
| both external SAS and high speed networking in one:
|
| https://forums.servethehome.com/index.php?threads/unique-
| oem...
|
| Looks like a bunch of mucking around is needed to get
| things setup, but seems like it works ok from that point
| onwards:
|
| * https://www.ebay.com/itm/314150370285
|
| Have been considering grabbing one to try in a HP
| Microserver Gen 8 that runs Proxmox. Not a lot of free
| time at the moment though. ;)
| timschmidt wrote:
| Neat! And affordable! Thanks! I will definitely look
| further into them. I got the 10gbE CNA from another
| university surplus find where a half dozen were being
| used as 16gb fibre channel adapters with SFP fibre
| modules. A little digging revealed that they could be
| switched to 10gbE CNA mode which is a lot more useful for
| my purposes. Enterprise gear is full of surprises :D
|
| Only potential downside I see is the included pcie switch
| chip. It's dividing up 8gbps between three devices which
| could independently consume all of that bandwidth.
| Probably not a huge issue for my uses, but worth
| considering if you plan on transferring large amounts of
| data over network and storage simultaneously.
| kjkjadksj wrote:
| Thats hardly untenable. Most computers draw a lot more watts
| where you wouldn't miss 1.2W imo. Even these "low watt"
| computers are drawing like 5-7 watts at idle and once you do
| anything on them its a lot more than that.
| wtallis wrote:
| You wouldn't miss the 1.2-1.9W on a gaming laptop with a
| discrete GPU. But for more mainstream systems with just
| integrated graphics, 5W is on the high end for idle power
| and even Intel's latest mid-range laptop parts can get down
| to about 3W if you don't have Windows doing dumb shit in
| the background constantly.
|
| Also, on modern laptops you can't trust that components
| will be powered off when you close the lid, because
| Microsoft and Intel have been at war with ACPI S3 suspend
| for years now. So that 1.2W could easily be the minimum
| power draw of the SSD any time you haven't hibernated or
| shut down the system.
| otterley wrote:
| It's also a strange comment: Enterprise DBs are usually run on
| enterprise-grade hardware in production. To be sure, you can
| test them on consumer-grade hardware, but if most folks are
| going to run this stuff in production using enterprise-grade
| SSDs, the software should be optimized for that hardware.
| Mechanical sympathy is an important software engineering
| practice.
| jeffbee wrote:
| The problem of commit latency bottlenecking global throughput
| seems to be a matter of treating the device as a black box that
| must be quiesced as a unit. If the NVMe device presented multiple
| namespaces to the operating system, and if other unnecessary
| abstractions like the filesystem are discarded, then the database
| can make progress on one namespace while blocked on another, even
| within a single host.
| senderista wrote:
| Isn't this basically what FUA does on SCSI? Presumably
| O_DIRECT+O_SYNC/O_DSYNC will invoke this path when possible?
| lmz wrote:
| There is a feature known as NVMe namespaces but not sure how
| well it's supported outside of expensive enterprise gear
| https://unix.stackexchange.com/questions/520231/what-are-nvm...
| jeffbee wrote:
| To a close approximation, NVMe devices that come in U.2 or
| U.3 form factors support multiple namespaces, and those that
| come in M.2 support only 1, and 4KiB LBA format support
| follows the same pattern with a few exceptions. But you don't
| have to spend an insane amount to get these features. The
| Micro 7x00 series costs ~$100/TB.
| louwrentius wrote:
| Cool! They used my bench-fio and fio-plot tool for the 3D graph
| :-) [0]
|
| What the 3D graph (also) shows is how poorly SSDs actually
| perform in single-threaded, low queue depth situations. Many
| enterprise or datacenter grade SSDs often perform much better and
| consistently on this front than the average consumer SSD.
|
| [0] https://github.com/louwrentius/fio-plot
| refset wrote:
| Slightly off-topic but CedarDB is extremely exciting. It's the
| commercialization of the widely cited Umbra research DBMS [0]
| that has been in the works for several years, which benchmarks
| faster than DuckDB for OLAP [1] whilst simultaneously being
| really strong for transactional workloads. Also discussed
| recently here [2].
|
| [0] https://umbra-db.com/
|
| [1] https://cedardb.com/blog/ode_to_postgres/
|
| [2] https://news.ycombinator.com/item?id=40241150
| riku_iki wrote:
| > which benchmarks faster than DuckDB for OLAP [1]
|
| that link doesn't do any performance comparison. They claim
| that CedarDB executes less "code branches" than duckdb, which
| may or may not translate to faster performance.
| ChrisWint wrote:
| Author of that blogpost here.
|
| > less "code branches" than duckdb, which may or may not
| translate to faster performance.
|
| In that case it was about 2.5x faster than DuckDB end to end,
| so a bit less than the difference in branches.
|
| If you want to see some independent benchmarks on Umbra, our
| underlying technology, its currently first place on
| Clickbench [1]. You can compare against duckdb there as well.
|
| [1] https://benchmark.clickhouse.com/
| riku_iki wrote:
| clickbench is a toy benchmark: small dataset, very specific
| queries.
|
| Benchmarking full tcp-h (not just one query like in your
| post) on sizable dataset (few TBs) would be very good close
| to real world scenario, but vendors usually avoid this.
| pfent wrote:
| There are TPC-H numbers in another post:
| https://cedardb.com/blog/simple_efficient_hash_tables/
| riku_iki wrote:
| its great starting insight, but again its small dataset
| (100GB) which almost fits memory, and I think many
| details are missing (for example clickbench publishes all
| configs and queries, and more detailed report, so vendors
| can reproduce/optimize/dispute them).
| jauntywundrkind wrote:
| A little sad finding out it's proprietary but to be expected I
| suppose.
|
| It's neat seeing postgres gearing up for async support. There's
| also folks like OrioleDB doing massive revamps of postgres &
| doing disaggregated storage in public.
| https://github.com/orioledb/orioledb
| https://hn.algolia.com/?query=orioledb&sort=byDate
|
| Oh they were bought by Suprabase two months ago... Fingers
| crossed! The Suprabase CEO commented at the time, with a nice
| basic overview, https://news.ycombinator.com/item?id=40039138
| pdimitar wrote:
| It is Supabase, dude, _not_ SupRRRRRabase.
| JosephRedfern wrote:
| Maybe slight off topic, but does anyone know what the deal is
| with the NVMe K/V command set spec? There was a fair bit of noise
| made about it at the time, but I've not seen any drives that
| support it (despite a few enquiries)
| wmf wrote:
| I think the plan is for ZNS to never be available at retail and
| I imagine key-value will be the same.
| wtallis wrote:
| Those new command sets that require a re-write of your entire
| software storage stack exist solely to satisfy the hyperscalers
| for whom re-writing the entire software storage stack can be a
| worthwhile optimization effort. If you don't command enough
| purchasing power to already be getting custom SKUs from your
| drive vendors, they don't have enough incentive to add those
| features to the models they offer you.
| riku_iki wrote:
| I imagine it could be cool for some DB vendors too, so they
| can optimize base on direct access and skip all OS/FS
| overhead.
| floating-io wrote:
| The seemingly critical parallel job performance issue sounds like
| it can and should be handled transparently by the OS when
| multiple pages are involved, and should thus be an area of
| research if it isn't already.
|
| Frankly, the idea that it isn't already a solved problem makes me
| wonder if that could possibly be the case...
|
| The writes thing... Outside of very rare circumstances, I would
| think that properly batching inserts/updates within transactions
| at the app level would resolve half of that, and multi-user
| concurrency would buy back the rest.
|
| What am I missing?
|
| P.S. I'm mad at the author for enlightening me to the real-world
| performance impact of enterprise SSD. i'm about to buy, and that
| might just cost me...
| wtallis wrote:
| > The seemingly critical parallel job performance issue sounds
| like it can and should be handled transparently by the OS when
| multiple pages are involved,
|
| A thread can only fault on one page at a time, and only
| sequential access patterns can benefit from the OS bringing in
| multiple pages per fault. Having all of your CPU cores busy
| handling page faults and context switches to threads that are
| about to trigger another page fault is only going to generate
| enough read traffic to keep a few SSDs busy, not a full shelf
| of 24+ drives. mmap() and synchronous IO don't scale well
| enough for today's SSDs.
| prewett wrote:
| A little off-topic, but since the article mentioned macOS not
| necessary actually doing a flush even on fsync(), I've been
| wondering how long it is before macOS actually writes to the
| drive. A non-trivial number of my writes are a compile followed
| by, oh yeah, quick fix, recompile. It'd be nice if it would hold
| off for a while. Given that I'm on a laptop, the risk of losing
| data is limited to the very rare times I actually run out of
| power or a kernel panic.
| the8472 wrote:
| > If you want your data to be stored persistent for real, you
| need to issue a sync command to make the operating system block
| and only return control after the data has been persisted.
|
| Well on linux the work that has to be done during a blocking
| sync() can be minimized by triggering async writeback via
| sync_file_range in advance.
| cogman10 wrote:
| > Instead of writing every single change to the SSD immediately,
| we can instead queue many such writes and persist the whole queue
| in one go. Instead of having one latency-sensitive round trip per
| commit, we now have one round trip per queue flush, let's say
| every 100 commits. We just have to be careful not to tell the
| user that their data has been committed before the queue
| containing their data has been flushed.
|
| I'm a little bit unclear, is the suggested method here issuing
| 100 `write` commands and then a `sync` (or something similar) and
| only informing the downstream of success when the `sync` is
| finish? Or is the DB actually maintaining a queue of commits to
| be written that get fsynced all at once?
___________________________________________________________________
(page generated 2024-06-19 23:01 UTC)