[HN Gopher] SSDs have become fast, except in the cloud
___________________________________________________________________
SSDs have become fast, except in the cloud
Author : greghn
Score : 308 points
Date : 2024-02-20 16:59 UTC (6 hours ago)
(HTM) web link (databasearchitects.blogspot.com)
(TXT) w3m dump (databasearchitects.blogspot.com)
| teaearlgraycold wrote:
| What's a good small cloud competitor to AWS? For teams that just
| need two AZs to get HA and your standard stuff like VMs, k8s,
| etc.
| leroman wrote:
| DigitalOcean has been great, have managed k8s and generally
| good bang for your buck
| ThrowawayTestr wrote:
| Buy a server
| nullindividual wrote:
| I don't like this answer.
|
| When I look at cloud, I get to think "finally! No more
| hardware to manage. No OS to manage". It's the best thing
| about the cloud, provided your workload is amenable to PaaS.
| It's great because I don't have to manage Windows or IIS.
| Microsoft does that part for me and significantly cheaper
| than it would be to employ me to do that work.
| adastra22 wrote:
| There is tooling to provide the same PaaS interface though,
| so your cost doing those things amounts to running OS
| updates.
| cynicalsecurity wrote:
| A cloud is just someone else's computer.
|
| When you rent a bare metal server, you don't manage your
| hardware either. The failed parts are replaced for you.
| Unless you can't figure out what hardware configuration you
| need - which would be a really big red flag for your level
| of expertise.
| tjoff wrote:
| And now you have to manage the cloud instead. Which turns
| out to be more hassle and with no overlap with the actual
| problem you are trying to solve.
|
| So not only do you spend time on the wrong thing you don't
| even know how it works. And the providers goals are not
| aligned either as all they care about is locking you in.
|
| How is that better?
| deathanatos wrote:
| > _No more hardware to manage. No OS to manage_
|
| We must be using different clouds.
|
| For some of the much higher-level services ... maybe some
| semblance of that statement holds. But for VMs? Definitely
| not "no OS to manage" ... the OS is usually on the
| customer. There might be OS-level agents from your cloud of
| choice that make certain operations easier ... but I'm
| still on the hook for updates.
|
| Even "No machine" is a stretch, though I've found this is
| much more dependent on cloud. AWS typically notices
| failures before I do, and by the time I notice something is
| up, the VM has been migrated to a new host and I'm none the
| wiser sans the reboot that cost. But other clouds I've been
| less lucky with: we've caught host failures well before the
| cloud provider, to an extent where I've wished there was a
| "vote of no confidence" API call I could make to say "give
| me new HW, and I personally think this HW is suss".
|
| Even on higher level services like RDS, or S3, I've noticed
| failures prior to AWS ... or even to the extent that I
| don't know that AWS would have noticed those failures
| unless we had opened the support ticket. (E.g., in the S3
| case, even though we clearly reported the problem, and the
| problem was occurring on basically every request, we
| _still_ had to provide example request IDs before they 'd
| believe us. The service was basically in an outage as far
| as we could tell ... though I think AWS ended up claiming
| it was "just us".)
|
| That said, S3 in particular is still an excellent service,
| and I'd happily use it again. But cloud == 0 time on my
| part. It depends heavily on the cloud, and less heavily on
| the service how _much_ time, and sometimes, it is still
| worthwhile.
| pritambarhate wrote:
| Digital Ocean. But they don't have concept of multi az as far
| as I know. But they have multiple data centers in same region.
| But I am not aware if there is any private networking between
| DCs in the same region.
| prisenco wrote:
| Really needs more information for a good answer. How many req/s
| are you handling? Do you have dedicated devops? Write heavy,
| read heavy or pipeline/etl heavy? Is autoscaling a must or
| would you rather failure over a bank breaking cloud bill?
| dimgl wrote:
| I am _super_ impressed with Supabase. I can see how a lot of
| people may not like it given how reliant it is on Postgres, but
| I find it to be absolutely genius.
|
| It basically allows me to forego having to make a server for
| the CRUD operations so I can focus on the actual business
| implications. My REST API is automatically managed for me
| (mostly with lightweight views and functions) and all of my
| other core logic is either spread out through edge functions or
| in a separate data store (Redis) where I perform the more CPU
| intensive operations related to my business.
|
| There's some rough edges around their documentation and DX but
| I'm really loving it so far.
| catherinecodes wrote:
| Hetzner and Entrywan are pure-play cloud companies with good
| prices and support. Hetzner is based in Germany and Entrywan in
| the US.
| infamia wrote:
| Thanks for mentioning Entrywan, they look great from what I
| can tell on their site. Have you used their services? If so,
| I'm curious about your experiences with them.
| catherinecodes wrote:
| My irc bouncer and two kubernetes clusters are running
| there. So far the service has been good.
| madars wrote:
| Hetzner has a reputation for locking accounts for "identity
| verification" (Google "hetzner kyc" or "hetzner identity
| verification"). Might be worthwhile to go through a reseller
| just to avoid downtime like that.
| infecto wrote:
| AWS is pretty great and I think reasonably cheap, I would
| include any of the other large cloud players. The amount saved
| is just not reasonable enough for me, I would rather work with
| something that I have used over and over with.
|
| Along with that, ChatGPT has knocked down most of the remaining
| barriers I have had when permissions get confusing in one of
| the cloud services.
| mciancia wrote:
| I like using scaleway for personal projects, but they are
| available only in Europe
| bombcar wrote:
| I think the obvious answer is there's not much demand, and
| keeping it "low" allows trickery and funny business with the
| virtualization layer (think: SAN, etc) that you can't do with
| "raw hardware speed".
| _Rabs_ wrote:
| Sure, but it does make me wonder what kind of speeds we are
| paying for if we can't even get raw hardware speeds...
|
| Sounds like one more excuse for AWS to obfuscate any meaning in
| their billing structure and take control of the narrative.
|
| How much are they getting away with by virtualization. (Think
| how banks use your money for loans and stuff)
|
| You actually don't get to really see the internals other than
| IOPS which doesn't help when it's gatekept already.
| bombcar wrote:
| The biggest "scam" if you can call it that is reducing all
| factors of CPU performance to "cores".
| Nextgrid wrote:
| I'd argue the even bigger scam is charging for egress data
| transfer rather than just for the size of the pipe.
| s1gnp0st wrote:
| True but the funny business buys a lot of fault tolerance, and
| predictable performance if not maximum performance.
| zokier wrote:
| There is no trickery with AWS instance stores, they are honest
| to god local disks.
| Aachen wrote:
| I ended up buying a SATA SSD for 50 euros to stick in an old
| laptop that I was already using as server and, my god, it is so
| much faster than the thing I was trying to run on digitalocean.
| The DO VPS barely beat the old 5400 rpm spinning rust that was
| in the laptop originally (the reason why I was trying to rent a
| fast, advertised-with-SSD, server). Doing this i/o task
| effectively in the cloud, at least with DO, seems to require
| putting it in RAM which was a bit expensive for the few hundred
| gigabytes of data I wanted to process into an indexed format
|
| So there is demand, but I'm certainly not interested in paying
| many multiples of 50 euros over an expected lifespan of a few
| years, so it may not make economic sense for them to offer it
| to users like me at least. On the other hand, for the couple
| hours this should have taken (rather than the days it initially
| did), I'd certainly have been willing to pay that cloud premium
| and that's why I tried to get me one of these allegedly SSD-
| backed VPSes... but now that I have a fast system permanently,
| I don't think that was a wise decision of past me
| pclmulqdq wrote:
| This was a huge technical problem I worked on at Google, and is
| sort of fundamental to a cloud. I believe this is actually a big
| deal that drives peoples' technology directions.
|
| SSDs in the cloud are attached over a network, and fundamentally
| have to be. The problem is that this network is so large and slow
| that it can't give you anywhere near the performance of a local
| SSD. This wasn't a problem for hard drives, which was the backing
| technology when a lot of these network attached storage systems
| were invented, because they are fundamentally slow compared to
| networks, but it is a problem for SSD.
| brucethemoose2 wrote:
| Yeah this was my impression.
|
| I am but an end user, but I noticed that disk IO for a certain
| app was glacial compared to a local test deployment, and I
| chalked it up to networking/VM overhead
| vlovich123 wrote:
| Why do they fundamentally need to be network attached storage
| instead of local to the VM?
| Filligree wrote:
| They don't. Some cloud providers (i.e. Hetzner) let you rent
| VMs with locally attached NVMe, which is dramatically faster
| than network-attached even factoring in the VM tax.
|
| Of course then you have a single point of failure, in the
| PCIe fabric of the machine you're running on if not the NVMe
| itself. But if you have good backups, which you should, then
| the juice really isn't worth the squeeze for NAS storage.
| ssl-3 wrote:
| A network adds more points of failure. It does not reduce
| them.
| supriyo-biswas wrote:
| A network attached, replicated storage hedges against
| data loss but increases latency; however most customers
| usually prefer higher latency to data loss. As an
| example, see the highly upvoted fly.io thread[1] with
| customers complaining about the same thing.
|
| [1] https://news.ycombinator.com/item?id=36808296
| ssl-3 wrote:
| Locally-attached, replicated storage also hedges against
| data loss.
| supriyo-biswas wrote:
| RAID rebuild times make it an unviable option and
| customers typically expect problematic VMs to be live-
| migrated to other hosts with the disks still having their
| intended data.
|
| The self hosted version of this is GlusterFS and Ceph,
| which have the same dynamics as EBS and its equivalents
| in other cloud providers.
| mike_hearn wrote:
| With NVMe SSDs? What makes RAID unviable in that
| environment?
| dijit wrote:
| This depends, like all things.
|
| When you say RAID, what level? Software-raid or hardware
| raid? What controller?
|
| Let's take best-case:
|
| RAID10, small enough (but many) NVMe drives _and_ an LVM
| /Software RAID like ZFS, which is data aware so only
| rebuilds actual data: rebuilds will degrade performance
| enough potentially that your application can become
| unavailable if your IOPS are 70%+ of maximum.
|
| That's an ideal scenario, if you use hardware raid which
| is not data-aware then your rebuild times depend entirely
| on the size of the drive being rebuilt _and_ it can
| punish IOPs even more during the rebuild. But it will
| affect your CPU less.
|
| There's no panacea. Most people opt for higher latency
| distributed storage where the RAID is spread across an
| _enormous_ amount of drives, which makes rebuilds much
| less painful.
| crazygringo wrote:
| A network adds more _points_ of failures but also
| _reduces user-facing failures_ overall when properly
| architected.
|
| If one CPU attached to storage dies, another can take
| over and reattach -- or vice-versa. If one network link
| dies, it can be rerouted around.
| bombcar wrote:
| Using a SAN (which is what networked storage is, after
| all) also lets you get various "tricks" such as
| snapshots, instant migration, etc for "free".
| Retric wrote:
| Redundancy, local storage is a single point of failure.
|
| You can use local SSD's as slow RAM, but anything on it can
| go away at any moment.
| cduzz wrote:
| I've seen SANs get nuked by operator error or by
| environmental issues (overheated DC == SAN shuts itself
| down).
|
| Distributed clusters of things can work just fine on
| ephemeral local storage (aka _local storage_ ). A kafka
| cluster or an opensearch cluster will be fine using
| instance local storage, for instance.
|
| As with everything else.... "it depends"
| Retric wrote:
| Sure distributed clusters get back to network/workload
| limitations.
| pclmulqdq wrote:
| Reliability. SSDs break and screw up a lot more frequently
| and more quickly than CPUs. Amazon has published a lot on the
| architecture of EBS, and they go through a good analysis of
| this. If you have a broken disk and you locally attach, you
| have a broken machine.
|
| RAID helps you locally, but fundamentally relies on locality
| and low latency (and maybe custom hardware) to minimize the
| time window where you get true data corruption on a bad disk.
| That is insufficient for cloud storage.
| SteveNuts wrote:
| Because even if you can squeeze 100TB or more of SSD/NVMe in
| a server, and there are 10 tenants using the machine, you're
| limited to 10TB as a hard ceiling.
|
| What happens when one tenant needs 200TB attached to a
| server?
|
| Cloud providers are starting to offer local SSD/NVMe, but
| you're renting the entire machine, and you're still limited
| to exactly what's installed in that server.
| jalk wrote:
| How is that different from how cores, mem and network
| bandwidth is allotted to tenants?
| pixl97 wrote:
| Because a fair number of customers spin up another image
| when cores/mem/bandwidth run low. Dedicated storage
| breaks that paradigm.
|
| Also, adding, if I am on an 8 core machine and need 16,
| network storage can be detached from host A and connected
| to host B. In dedicated storage it must be fully copied
| over first.
| baq wrote:
| It isn't. You could ask for network-attached CPUs or RAM.
| You'd be the only one, though, so in practice only
| network-attached storage makes sense business-wise. It
| also makes sense if you need to provision larger-than-
| usual amounts like tens of TB - these are usually hard to
| come by in a single server, but quite mundane for storage
| appliances.
| vel0city wrote:
| Given AWS and GCP offer multiple sizes for the same
| processor version with local SSDs, I don't think you have
| to rent the entire machine.
|
| Search for i3en API names and you'll see:
|
| i3en.large, 2x CPU, 1250GB SSD
|
| i3en.xlarge, 4x CPU, 2500GB SSD
|
| i3en.2xlarge, 8x CPU, 2x2500GB SSD
|
| i3en.3xlarge, 12x CPU, 7500GB SSD
|
| i3en.6xlarge, 24x CPU, 2x7500GB SSD
|
| i3en.12xlarge, 48x CPU, 4x7500GB SSD
|
| i3en.24xlarge, 96x CPU, 8x7500GB SSD
|
| i3en.metal, 96x CPU, 8x7500GB SSD
|
| So they've got servers with 96 CPUs and 8x7500GB SSDs. You
| can get a slice of one, or you can get the whole one. All
| of these are the ratio of 625GB of local SSD per CPU core.
|
| https://instances.vantage.sh/
|
| On GCP you can get a 2-core N2 instance type and attach
| multiple local SSDs. I doubt they have many physical 2-core
| Xeons in their datacenters.
| taneq wrote:
| > What happens when one tenant needs 200TB attached to a
| server?
|
| Link to this mythical hosting service that expects far less
| than 200TB of data per client but just pulls a sad face and
| takes the extra cost on board when a client demands it. :D
| drewda wrote:
| The major clouds do offer VMs with fast local storage, such
| as SSDs connected by NVMe connections directly to the VM host
| machine:
|
| - https://cloud.google.com/compute/docs/disks/local-ssd
|
| - https://learn.microsoft.com/en-us/azure/virtual-
| machines/ena...
|
| - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-
| inst...
|
| They sell these VMs at a higher cost because it requires more
| expensive components and is limited to host machines with
| certain configurations. In our experience, it's also harder
| to request quota increases to get more of these VMs -- some
| of the public clouds have a limited supply of these specific
| types of configurations in some regions/zones.
|
| As others have noted, instance storage isn't as dependable.
| But it can be the most performant way to do IO-intense
| processing or to power one node of a distributed database.
| _Rabs_ wrote:
| So much of this. The amount of times I've seen someone complain
| about slow DB performance when they're trying to connect to it
| from a different VPC, and bottlenecking themselves to 100Mbits
| is stupidly high.
|
| Literally depending on where things are in a data center... If
| you're looking for closely coupled and on a 10G line on the
| same switch, going to the same server rack. I bet you
| performance will be so much more consistent.
| bugbuddy wrote:
| Aren't 10G and 100G connections standard nowadays in data
| centers? Heck, I thought they were standard 10 years ago.
| geerlingguy wrote:
| Datacenters are up to 400 Gbps and beyond (many places are
| adopting 1+ Tbps on core switching).
|
| However, individual servers may still operate at 10, 25, or
| 40 Gbps to save cost on the thousands of NICs in a row of
| racks. Alternatively, servers with multiple 100G
| connections split that bandwidth allocation up among dozens
| of VMs so each one gets 1 or 10G.
| nixass wrote:
| 400G is fairly normal thing in DCs nowadays
| pixl97 wrote:
| Bandwidth delay product does not help serialized
| transactions. If you're reaching out to disk for results,
| or if you have locking transactions on a table the
| achievable operations drops dramatically as latency between
| the host and the disk increases.
| bee_rider wrote:
| The typical way to trade bandwidth away for latency
| would, I guess, be speculative requests. In the CPU world
| at least. I wonder if any cloud providers have some sort
| of framework built around speculative disk reads (or
| maybe it is a totally crazy trade to make in this
| context)?
| pixl97 wrote:
| I mean we already have readahead in the kernel.
|
| This said the problem can get more complex than this
| really fast. Write barriers for example and dirty caches.
| Any application that forces writes and the writes are
| enforced by the kernel are going to suffer.
|
| The same is true for SSD settings. There are a number of
| tweakable values on SSDs when it comes to write commit
| and cache usage which can affect performance. Desktop
| OS's tend to play more fast and loose with these settings
| and servers defaults tend to be more conservative.
| treflop wrote:
| Often times it's the app (or something high level) that
| would need speculative requests, which it may not be
| possible in the given domain.
|
| I don't think it's possible in most domains.
| KaiserPro wrote:
| Yes, but you have to think about contention. Whilst the Top
| of rack _might_ have 2x400 gig links to the core, thats
| shared with the entire rack, and all the other machines
| trying to shout at the core switching infra.
|
| Then stuff goes away, or route congested, etc, etc, etc.
| silverquiet wrote:
| > Literally depending on where things are in a data center
|
| I thought cloud was supposed to abstract this away? That's a
| bit of a sarcastic question from a long-time cloud skeptic,
| but... wasn't it?
| doubled112 wrote:
| Reality always beats the abstraction. After all, it's just
| somebody else's computer in somebody else's data center.
| bombcar wrote:
| Which can cause considerable "amusement" depending on the
| provider - one I won't name directly but is much more
| centered on actual renting racks than their (now) cloud
| offering - if you had a virtual machine older than a year
| or so, deleting and restoring it would get you on a newer
| "host" and you'd be faster for the same cost.
|
| Otherwise it'd stay on the same physical piece of
| hardware it was allocated to when new.
| doubled112 wrote:
| Amusing is a good description.
|
| "Hardware degradation detected, please turn it off and
| back on again"
|
| I could do a migration with zero downtime in VMware for a
| decade but they can't seamlessly move my VM to a machine
| that works in 2024? Great, thanks. Amusing.
| bombcar wrote:
| I have always been incredibly saddened that apparently
| the cloud providers usually have nothing as advanced as
| old VMware was.
| wmf wrote:
| Cloud providers have live migration now but I guess they
| don't want to guarantee anything.
| bombcar wrote:
| It's better (and better still with other providers) but I
| naively thought that "add more RAM" or "add more disk"
| was something they would be able to do with a reboot at
| most.
|
| Nope, some require a full backup and restore.
| wmf wrote:
| Resizing VMs doesn't really fit the "cattle" thinking of
| public cloud, although IMO that was kind of a premature
| optimization. This would be a perfect use case for live
| migration.
| kccqzy wrote:
| It's more of a matter of adding additional abstraction
| layers. For example in most public clouds the best you can
| hope for is to place two things in the same availability
| zone to get the best performance. But when I worked at
| Google, internally they had more sophisticated colocation
| constraint than that: for example you can require two
| things to be on the same rack.
| treflop wrote:
| Cloud makes provisioning more servers quicker because you
| are paying someone to basically have a bunch of servers
| ready to go right away with an API call instead of a phone
| call, maintained by a team that isn't yours, with economies
| of scale working for the provider.
|
| Cloud does not do anything else.
|
| None of these latency/speed problems are cloud-specific. If
| you have on-premise servers and you are storing your data
| on network-attached storage, you have the exact same
| problems (and also the same advantages).
|
| Unfortunately the gap between local and network storage is
| wide. You win some, you lose some.
| silverquiet wrote:
| Oh, I'm not a complete neophyte (in what seems like a
| different life now, I worked for a big hosting provider
| actually), I was just surprised that there was a big
| penalty for cross-VPC traffic implied by the parent
| poster.
| ejb999 wrote:
| How much faster would the network need to get, in order to meet
| (or at least approach) the speed of a local SSD? are we talking
| about needing to 2x or 3x the speed, or by factors of hundreds
| or thousands?
| Filligree wrote:
| The Samsung 990 in my desktop provides ~3.5 GB/s streaming
| reads, ~2 GB/s 4k random-access reads, all at a latency
| measured at around 20-30 microseconds. My exact numbers might
| be a little off, but that's the ballpark you're looking at,
| and a 990 is a relatively cheap device.
|
| 10GbE is about the best you can hope for from a local network
| these days, but that's 1/5th the bandwidth and many times the
| latency. 100GbE would work, except the latency would still
| mean any read dependencies would be far slower than local
| storage, and I'm not sure there's much to be done about that;
| at these speeds the physical distance matters.
|
| In practice I'm having to architecture the entire system
| around the SSD just to not bottleneck it. So far ext4 is the
| only filesystem that even gets close to the SSD's limits,
| which is a bit of a pity.
| wmf wrote:
| Around 4x-10x depending on how many SSDs you want. A single
| SSD is around the speed of a 100 Gbps Ethernet link.
| selectodude wrote:
| SATA3 is 6 Gbit, so each VM on a machine multiplied by 6
| Gbit. For NVMe, probably closer to 4-5x that. You'd need some
| serious interconnects to get a server rack access to un-
| bottlenecked SSD storage.
| Nextgrid wrote:
| The problem isn't necessarily speed, it's random access
| latency. What makes SSDs fast and "magical" is their low
| random-access latency compared to a spinning disk. The
| sequential-access read speed is merely a bonus.
|
| Networked storage negates that significantly, absolutely
| killing performance for certain applications. You could have
| a 100Gbps network and it still won't match a direct-attached
| SSD in terms of latency (it can only match it in terms of
| sequential access throughput).
|
| For many applications such as databases, random access is
| crucial, thus why nowadays' mid-range consumer hardware often
| outperforms hosted databases such as RDS unless they're so
| overprovisioned on RAM that the dataset is effectively always
| in there.
| baq wrote:
| 100Gbps direct _shouldn 't be_ too bad, but it might be
| difficult to get anyone to sell it to you for exclusive
| usage in a vm...
| Ericson2314 wrote:
| Um... why the hell does the network care whether I am doing
| random or sequential access? Your left that part out of
| your argument.
| zokier wrote:
| > SSDs in the cloud are attached over a network, and
| fundamentally have to be
|
| Not on AWS. Instance stores (what the article is about) are
| physical local disks.
| mkoubaa wrote:
| Dumb question. Why does the network have to be slow? If the
| SSDs are two feet away from the motherboard and there's an
| optical connection to it, shouldn't it be fast? Are data
| centers putting SSDs super far away from motherboards?
| bugbuddy wrote:
| > One theory is that EC2 intentionally caps the write speed
| at 1 GB/s to avoid frequent device failure, given the total
| number of writes per SSD is limited.
|
| This is the theory that I would bet on because it lines up
| with their bottom line.
| Dylan16807 wrote:
| But the sentence right after undermines it.
|
| > However, this does not explain why the read bandwidth is
| stuck at 2 GB/s.
|
| Faster read speeds would give them a more enticing product
| without wearing drives out.
| bugbuddy wrote:
| They may be limiting the read artificially to increase
| your resource utilization else where. If you have disk
| bottleneck then you would be more likely to use more
| instances. It is still about the bottom line.
| Dylan16807 wrote:
| That could be. But it's a completely different reason. If
| you summarize everything as "bottom line", you lose all
| the valuable information.
| formercoder wrote:
| What happens when your vm is live migrated 1000 feet away or
| to a different zone?
| supriyo-biswas wrote:
| It's not the network being slow, but dividing the available
| network bandwidth amongst all users, while also distributing
| the written data to multiple nodes reliably so that one
| tenant doesn't hog resources is quite challenging. The
| pricing structure is meant to control resource usage; a
| discussion of the exact prices and how much profit AWS or any
| other cloud provider makes is a separate discussion.
| jsnell wrote:
| According to the submitted article, the numbers are from AWS
| instance types where the SSD is "physically attached" to the
| host, not about SSD-backed NAS solutions.
|
| Also, the article isn't just about SSDs being no faster than a
| network. It's about SSDs being two orders of magnitude slower
| than datacenter networks.
| pclmulqdq wrote:
| It's because the "local" SSDs are not actually physically
| attached and there's a network protocol in the way.
| zokier wrote:
| What makes you think that?
| ddorian43 wrote:
| Do you have a link to explain this? I dont think its true.
| candiddevmike wrote:
| Depends on the cloud provider. Local SSDs are physically
| attached to the host on GCP, but that makes them only
| useful for temporary storage.
| pclmulqdq wrote:
| If you're at G, you should read the internal docs on
| exactly how this happens and it will be interesting.
| rfoo wrote:
| Why would I lose all data on these SSDs when I initiate a
| power off of the VM on console, then?
|
| I believe local SSDs are definitely attached to the host.
| They are just not exposed via NVMe ZNS hence the
| performance hit.
| manquer wrote:
| It is because on reboot you may not get the same physical
| server . They are not rebooting the physical server for
| you , just the VM
|
| Same VM is not allocated for a variety of reasons ,
| scheduled maintenance, proximity to other hosts on the
| vpc , balancing quiet and noisy neighbors so on.
|
| It is not that the disk will always wiped , sometimes the
| data is still there on reboot just that there is no
| guarantee allowing them to freely move between hosts
| res0nat0r wrote:
| Your EC2 instance with instance-store storage when
| stopped can be launched on any other random host in the
| AZ when you power it back on. Since your rootdisk is an
| EBS volume attached across the network, so when you start
| your instance back up you're going to be launched likely
| somewhere else with an empty slot, and empty local-
| storage. This is why there is always a disclaimer that
| this local storage is ephemeral and don't count on it
| being around long-term.
| mrcarrot wrote:
| I think the parent was agreeing with you. If the "local"
| SSDs _weren't_ actually local, then presumably they
| wouldn't need to be ephemeral since they could be
| connected over the network to whichever host your
| instance was launched on.
| amluto wrote:
| Which is a weird sort of limitation. For any sort of you-
| own-the-hardware arrangement, NVMe disks are fine for
| long term storage. (Obviously one should have backups,
| but that's a separate issue. One should have a DR plan
| for data on EBS, too.)
|
| You need to _migrate_ that data if you replace an entire
| server, but this usually isn't a very big deal.
| supriyo-biswas wrote:
| This is Hyrum's law at play: AWS wants to make sure that
| the instance stores aren't seen as persistent, and
| therefore enforce the failure mode for normal operations
| as well.
|
| You should also see how they enforce similar things for
| their other products and APIs, for example, most of their
| services have encrypted pagination tokens.
| throwawaaarrgh wrote:
| Yes, that's what their purpose is in cloud applications:
| temporary high performance storage only.
|
| If you want long term local storage you'll have to
| reserve an instance host.
| mike_hearn wrote:
| They do this because they want SSDs to be in a physically
| separate part of the building for operational reasons, or
| what's the point in giving you a "local" SSD that isn't
| actually plugged into the real machine?
| ianburrell wrote:
| The reason for having most instances use network storage
| is that it makes possible migrating instances to other
| hosts. If the host fails, the network storage can be
| pointed at the new host with a reboot. AWS sends out
| notices regularly when they are going to reboot or
| migrate instances.
|
| Their probably should be more local instance storage
| types for using with instances that can be recreated
| without loss. But it is simple for them to have a single
| way of doing things.
|
| At work, someone used fast NVMe instance storage for
| Clickhouse which is a database. It was a huge hassle to
| copy data when instances were going to be restarted
| because the data would be lost.
| mike_hearn wrote:
| Sure, I understand that, but this user is claiming that
| on GCP even local SSDs aren't really local, which raises
| the question of why not.
|
| I suspect the answer is something to do with their
| manufacturing processes/rack designs. When I worked there
| (pre GCP) machines had only a tiny disk used for booting
| and they wanted to get rid of that. Storage was handled
| by "diskful" machines that had dedicated trays of HDDs
| connected to their motherboards. If your datacenters and
| manufacturing processes are optimized for building
| machines that are either compute or storage but not both,
| perhaps the more normal cloud model is hard to support
| and that pushes you towards trying to aggregate storage
| even for "local" SSD or something.
| deadmutex wrote:
| The GCE claim is unverified. OP seems to be referring to
| PD-SSD and not LocalSSD
| youngtaff wrote:
| > At work, someone used fast NVMe instance storage for
| Clickhouse which is a database. It was a huge hassle to
| copy data when instances were going to be restarted
| because the data would be lost.
|
| This post on how Discord RAIDed local NVMe volumes with
| slower remote volumes might be on interest
| https://discord.com/blog/how-discord-supercharges-
| network-di...
| ianburrell wrote:
| We moved to running Clickhouse on EKS with EBS volumes
| for storage. It can better survive instances going down.
| I didn't work on it so don't how much slower it is.
| Lowering the management burden was big priority.
| wiredfool wrote:
| Are you saying that a reboot wipes the ephemeral disks?
| Or a stop the instance and start the instance from AWS
| console/api?
| ianburrell wrote:
| Reboot keeps the instance storage volumes. Restarting
| wipes them. Starting frequently migrates to new host. And
| the "restart" notices AWS sends are likely cause the host
| has a problem and need to migrate it.
| yolovoe wrote:
| The comment you're responding to is wrong. AWS offers
| many kinds of storage. Instance local storage is
| physically attached to the droplet. EBS isn't but that's
| a separate thing entirely.
|
| I literally work in EC2 Nitro.
| colechristensen wrote:
| For AWS there are EBS volumes attached through a custom
| hardware NVMe interface and then there's Instance Store
| which is actually local SSD storage. These are different
| things.
|
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instanc
| e...
| kwillets wrote:
| EBS is also slower than local NVMe mounts on i3's.
|
| Also, both features use Nitro SSD cards, according to AWS
| docs. The Nitro architecture is all locally attached --
| instance storage to the instance, EBS to the EBS server.
| jrullman wrote:
| I can attest to the fact that on EC2, "instance store"
| volumes are actually physically attached.
| jsnell wrote:
| I think you're wrong about that. AWS calls this class of
| storage "instance storage" [0], and defines it as:
|
| > Many Amazon EC2 instances can also include storage from
| devices that are located inside the host computer, referred
| to as instance storage.
|
| There might be some wiggle room in "physically attached",
| but there's none in "storage devices located inside the
| host computer". It's not some kind of AWS-only thing
| either. GCP has "local SSD disks"[1], which I'm going to
| claim are likewise local, not over the network block
| storage. (Though the language isn't as explicit as for
| AWS.)
|
| [0] https://aws.amazon.com/ec2/instance-types/
|
| [1] https://cloud.google.com/compute/docs/disks#localssds
| 20after4 wrote:
| If the SSD is installed in the host server, doesn't that
| still allow for it to be shared among many instances
| running on said host? I can imagine that a compute node
| has just a handful of SSDs and many hundreds of instances
| sharing the I/O bandwidth.
| discodave wrote:
| If you have one of the metal instance types, then you get
| the whole host, e.g. i4i.metal:
|
| https://aws.amazon.com/ec2/instance-types/i4i/
| aeyes wrote:
| On AWS yes, the older instances which I am familiar with
| had 900GB drives and they sliced that up into volumes of
| 600, 450, 300, 150, 75GB depending on instance size.
|
| But they also tell you how much IOPS you get: https://doc
| s.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...
| ownagefool wrote:
| PCI bus, etc too
| throwawaaarrgh wrote:
| Instance storage is not networked. That's why it's there.
| queuebert wrote:
| How _do_ these machines manage the sharing of one local
| SSD across multiple VMs? Is there some wrapper around the
| I /O stack? Does it appear as a network share? Geniuinely
| curious...
| felixg3 wrote:
| Probably NVME namespaces [0]?
|
| [0]: https://nvmexpress.org/resource/nvme-namespaces/
| bravetraveler wrote:
| Less fancy, quite often... at least on VPS providers [1].
| They like to use reflinked files off the base images.
| This way they only store what differs.
|
| 1: Which is really a cloud without a certain degree of
| software defined networking/compute/storage/whatever.
| dan-robertson wrote:
| AWS have custom firmware for at least some of their SSDs,
| so could be that
| magicalhippo wrote:
| In say VirtualBox you can create a file backed on the
| physical disk, and attach it to the VM so the VM sees it
| as a NVMe drive.
|
| In my experience this is also orders of magnitude slower
| that true direct access, ie PCIe pass-through, as all
| access has to pass through the VM storage driver and so
| _could_ explain what is happening.
| icedchai wrote:
| With Linux and KVM/QEMU, you can map an entire physical
| disk, disk partition, or file to a block device in the
| VM. For my own VM hosts, I use LVM and map a logical
| volume to the VM. I assumed cloud providers did something
| conceptually similar, only much more sophisticated.
| pclmulqdq wrote:
| That's the abstraction they want you to work with, yes.
| That doesn't mean it's what is actually happening - at
| least not in the same way that you're thinking.
|
| As a hint for you, I said " _a_ network ", not " _the_
| network. " You can also look at public presentations
| about how Nitro works.
| jng wrote:
| Nitro "virtual NVME" device are mostly (only?) for EBS --
| remote network storage, transparently managed, using a
| separate network backbone, and presented to the host as a
| regular local NVME device. SSD drives in instances such
| as i4i, etc. are physically attached in a different way
| -- but physically, unlike EBS, they are ephemeral and the
| content becomes unavaiable as you stop the instance, and
| when you restart, you get a new "blank slate". Their
| performance is 1 order of magnitude faster than standard-
| level EBS, and the cost structure is completely different
| (and many orders of magnitude more affordable than EBS
| volumes configured to have comparable I/O performance).
| jsnell wrote:
| I've linked to public documentation that is pretty
| clearly in conflict with what you said. There's no wiggle
| room in how AWS describes their service without it being
| false advertising. There's no "ah, but what if we define
| the entire building to be the host computer, then the
| networked SSDs really _are_ inside the host computer "
| sleight of hand to pull off here.
|
| You've provided cryptic hints and a suggestion to watch
| some unnamed presentation.
|
| At this point I really think the burden of proof is on
| you.
| dekhn wrote:
| it sounds like you're trying to say "PCI switch" without
| saying "PCI switch" (I worked at Google for over a
| decade, including hardware division).
| jasonwatkinspdx wrote:
| Both the documentation and Amazon employees are in here
| telling you that you're wrong. Can you resolve that
| contradiction or do you just want to act coy like you
| know some secret? The latter behavior is not productive.
| wstuartcl wrote:
| the tests were for these local (metal direct connect
| ssds). The issue is not network overhead -- its that just
| like everything else in cloud the performance of 10 years
| ago was used as the baseline that carries over today with
| upcharges to buy back the gains.
|
| there is a reason why vcpu performance is still locked to
| the typical core from 10 years ago when every core on a
| machine today in those data scenters is 3-5x or more
| speed basis. Its cause they can charge you for 5x the
| cores to get that gain.
| dekhn wrote:
| I suspect you must be conflating several different storage
| products. Are you saying
| https://cloud.google.com/compute/docs/disks/local-ssd
| devices talk to the host through a network (say, ethernet
| with some layer on top)? Because the documentation very
| clearly says otherwise, "This is because Local SSD disks
| are physically attached to the server that hosts your VM.
| For this same reason, Local SSD disks can only provide
| temporary storage." (at least, I'm presuming that by
| physically attached, they mean it's connected to the PCI
| bus without a network in between).
|
| I suspect you're thinking of SSD-PD. If "local" SSDs are
| not actually local and go through a network, I need to have
| a discussion with my GCS TAM about truth in advertising.
| op00to wrote:
| > physically attached
|
| Believe it or not, superglue and a wifi module! /s
| mint2 wrote:
| I don't really agree with assuming the form of physical
| attachment and interaction unless it is spelled out.
|
| If that's what's meant it will be stated in some fine
| print, if it's not stated anywhere then there is no
| guarantee what the term means, except I would guess they
| may want people to infer things that may not necessarily
| be true.
| dekhn wrote:
| "Physically attached" has had a fairly well defined
| meaning and i don't normally expect a cloud provider to
| play word salad to convince me a network drive is locally
| attached (like I said, if true, I would need to have a
| chat with my TAM about it).
|
| Physically attached for servers, for the past 20+ years,
| has meant a direct electrical connection to a host bus
| (such as the PCI bus attached to the front-side bus). I'd
| like to see some alternative examples that violate that
| convention.
| adgjlsfhk1 wrote:
| Ethernet cables are physical...
| dekhn wrote:
| The NIC is attached to the host bus through the north
| bridge. But other hosts on the same ethernetwork are not
| considered to be "local". We dont need to get crazy about
| teh semantics to know that when a cloud provider says an
| SSD is locally attached, that it's closer than an
| ethernetwork away.
| crotchfire wrote:
| This is incorrect.
|
| Amazon offers both locally-attached storage devices as well
| as instance-attached storage devices. The article is about
| the latter kind.
| bfncieezo wrote:
| instances can have block storage which is network attached,
| or local attached ssd/nvme. its 2 separate things.
| choppaface wrote:
| Nope! Well not as advertised. There are instances, usually
| more expensive ones, where there are supposed to be local
| NVME disks dedicated to the instance. You're totally right
| that providing good I/O is a big problem! And I have done
| studies myself showing just how bad Google Cloud is here,
| and have totally ditched Google Cloud for providing crappy
| compute service (and even worse customer service).
| yolovoe wrote:
| You're wrong. Instance local means SSD is physically
| attached to the droplet and is inside the server chassis,
| connected via PCIe.
|
| Sourece: I work on nitro cards.
| tptacek wrote:
| "Attached to the droplet"?
| hipadev23 wrote:
| digitalocean squad
| sargun wrote:
| Droplets are what EC2 calls their hosts. Confusing? I
| know.
| tptacek wrote:
| Yes! That is confusing! Tell them to stop it!
| hathawsh wrote:
| That seems like a big opportunity for other cloud
| providers. They could provide SSDs that are actually
| physically attached and boast (rightfully) that their SSDs
| are a lot faster, drawing away business from older cloud
| providers.
| solardev wrote:
| For what kind of workloads would a slower SSD be a
| significant bottleneck?
| ddorian43 wrote:
| Next thing the other clouds will offer is cheaper
| bandwidth pricing, right?
| Salgat wrote:
| At first you'd think maybe they can do a volume copy from a
| snapshot to a local drive on instance creation but even at
| 100gbps you're looking at almost 3 minutes for a 2TB drive.
| crazygringo wrote:
| > _It 's about SSDs being two orders of magnitude slower than
| datacenter networks._
|
| Could that have to do with every operation requiring a round
| trip, rather than being able to queue up operations in a
| buffer to saturate throughput?
|
| It seems plausible if the interface protocol was built for a
| device it assumed was physically local and so waited for
| confirmation after each operation before performing the next.
|
| In this case it's not so much the throughput rate that
| matters, but the latency -- which can also be heavily
| affected by buffering of other network traffic.
| Nextgrid wrote:
| Underlying protocol limitations wouldn't be an issue - the
| cloud provider's implementation can work around that.
| They're unlikely to be sending sequential SCSI/NVMe
| commands over the wire - instead, the hypervisor pretends
| to be the NVME device, but then converts to some internal
| protocol (that's less chatty and can coalesce requests
| without waiting on individual ACKs) before sending that to
| the storage server.
|
| The problem is that ultimately your application often
| requires the outcome of a given IO operation to decide
| which operation to perform next - let's say when it comes
| to a database, it should first read the index (and wait for
| that to complete) before it knows the on-disk location of
| the actual row data which it needs to be able to issue the
| next IO operation.
|
| In this case, there's no other solution than to move that
| application closer to the data itself. Instead of the
| networked storage node being a dumb blob storage returning
| bytes, the networked "storage" node is your database
| itself, returning query results. I believe that's what RDS
| Aurora does for example, every storage node can itself
| understand query predicates.
| wmf wrote:
| No, the i3/i4 VMs discussed in the blog have local SSDs. The
| network isn't the reason local SSDs are slow.
| fngjdflmdflg wrote:
| So do cloud vendors simply not use fast SSDs? If so I would
| expect the SSD manufacturers themselves to work on this
| problem. Perhaps they already are.
| adrr wrote:
| If the local drives are network drives(eg: SAN) then why are
| they ephemeral?
| baq wrote:
| live vm migrations, perhaps
| Dylan16807 wrote:
| Even assuming that "local" storage is a lie, hasn't the network
| gotten a lot faster? The author is only asking for a 5x
| increase at the end of the post.
| boulos wrote:
| I'm not sure which external or internal product you're talking
| about, but there are no networks involved for Local SSD on GCE:
| https://cloud.google.com/compute/docs/disks/local-ssd
|
| Are you referring to PD-SSD? Internal storage usage?
| nostrademons wrote:
| Makes me wonder if we're on the crux of a shift back to client-
| based software. Historically changes in the relative cost of
| computing components have driven most of the shifts in the
| computing industry. Cheap teletypes & peripherals fueled the
| shift from batch-processing mainframes to timesharing
| minicomputers. Cheap CPUs & RAM fueled the shift from
| minicomputers to microcomputers. Cheap and fast networking
| fueled the shift from desktop software to the cloud. Will cheap
| SSDs & TPU/GPUs fuel a shift back toward thicker clients?
|
| There are a bunch of supporting social trends toward this as
| well. Renewed emphasis on privacy. Big Tech canceling beloved
| products, bricking devices, and generally enshittifying
| everything - a lot of people want locally-controlled software
| that isn't going to get worse at the next update. Ever-rising
| prices which make people want to lock in a price for the device
| and not deal with increasing rents for computing power.
| davkan wrote:
| I think a major limiting factor here for many applications is
| that mobile users are a huge portion of the user base. In
| that space storage, and more importantly battery life, are
| still at a premium. Granted the storage cost just seems to be
| gouging from my layman's point of view, so industry needs
| might force a shift upwards.
| tw04 wrote:
| >The problem is that this network is so large and slow that it
| can't give you anywhere near the performance of a local SSD.
|
| *As implemented in the public cloud providers.
|
| You can absolutely get better than local disk speeds from SAN
| devices and we've been doing it for decades. To do it on-prem
| with flash devices will require NVMe over FC or Ethernet and an
| appropriate storage array. Modern all-flash array performance
| is measured in millions of IOPS.
|
| Will there be a slight uptick in latency? Sure, but it's well
| worth it for the data services and capacity of an external
| array for nearly every workload.
| scottlamb wrote:
| > The problem is that this network is so large and slow that it
| can't give you anywhere near the performance of a local SSD.
| This wasn't a problem for hard drives, which was the backing
| technology when a lot of these network attached storage systems
| were invented, because they are fundamentally slow compared to
| networks, but it is a problem for SSD.
|
| Certainly true that SSD bandwidth and latency improvements are
| hard to match, but I don't understand why intra-datacenter
| network latency in particular is so bad. This ~2020-I-think
| version of the "Latency Numbers Everyone Should Know" says 0.5
| ms round trip (and mentions "10 Gbps network" on another line).
| [1] It was the same thing in a 2012 version (that only mentions
| "1 Gbps network"). [2] Why no improvement? I think that 2020
| version might have been a bit conservative on this line, and
| nice datacenters may even have multiple 100 Gbit/sec NICs per
| machine in 2024, but still I think the round trip actually is
| strangely bad.
|
| I've seen experimental networking stuff (e.g. RDMA) that claims
| significantly better latency, so I don't think it's a physical
| limitation of the networking gear but rather something at the
| machine/OS interaction area. I would design large distributed
| systems significantly differently (be much more excited about
| extra tiers in my stack) if the standard RPC system offered say
| 10 us typical round trip latency.
|
| [1]
| https://static.googleusercontent.com/media/sre.google/en//st...
|
| [2] https://gist.github.com/jboner/2841832
| dekhn wrote:
| Modern data center networks don't have full cross
| connectivity. Instead they are built using graphs and
| hierarchies that provide less than the total bandwidth
| required for all pairs of hosts to be communicating. This
| means, as workloads start to grow and large numbers of
| compute hosts demand data IO to/from storage hosts, the
| network eventually gets congested, which typically exhibits
| as higher latencies and more dropped packets. Batch jobs are
| often relegated to "spare" bandwidth while serving jobs often
| get dedicated bandwidth
|
| At the same time, ethernetworks with layered network
| protocols on top typically have a fair amount of latency
| overhead, that makes it much slower than bus-based direct-
| host-attached storage. I was definitely impressed at how
| quickly SSDs reached and then exceeded SATA bandwidth. nvme
| has made a HUGE difference here.
| kccqzy wrote:
| That document is probably deliberately on the pessimistic
| side to encourage your code to be portable across all kinds
| of "data centers" (however that is defined). When I
| previously worked at Google, the standard RPC system
| definitely offered 50 microseconds of round trip latency at
| the median (I measured it myself in a real application), and
| their advanced user-space implementation called Snap could
| offer about 10 microseconds of round trip latency. The latter
| figure comes from page 9 of
| https://storage.googleapis.com/gweb-
| research2023-media/pubto...
|
| > nice datacenters may even have multiple 100 Gbit/sec NICs
| per machine in 2024,
|
| Google exceeded 100Gbps per machine long before 2024. IIRC it
| had been 400Gbps for a while.
| scottlamb wrote:
| Interesting. I worked at Google until January 2021. I see
| 2019 dates on that PDF, but I wasn't aware of snap when I
| left. There was some alternate RPC approach (Pony Express,
| maybe? I get the names mixed up) that claimed 10 us or so
| but was advertised as experimental (iirc had some bad
| failure modes at the time in practice) and was simply
| unavailable in many of the datacenters I needed to deploy
| in. Maybe they're two names for the same thing. [edit: oh,
| yes, starting to actually read the paper now, and: "Through
| Snap, we created a new communication stack called Pony
| Express that implements a custom reliable transport and
| communications API."]
|
| Actual latency with standard Stubby-over-TCP and warmed
| channels...it's been a while, so I don't remember the
| number I observed, but I remember it wasn't _that_ much
| better than 0.5 ms. It was still bad enough that I didn 't
| want to add a tier that would have helped with isolation in
| a particularly high-reliability system.
| kccqzy wrote:
| Snap was the external name for the internal project known
| as User Space Packet Service (abbreviated USPS) so
| naturally they renamed it prior to publication. I
| deployed an app using Pony Express in 2023 and it was
| available in the majority of cells worldwide. Pony
| Express supported more than just RPC though. The
| alternate RPC approach that you spoke of was called Void.
| It had been experimental for a long time and indeed it
| wasn't well known even inside Google.
|
| > but I remember it wasn't that much better than 0.5 ms.
|
| If you and I still worked at Google I'd just give you an
| automon dashboard link showing latency an order of
| magnitude better than that to prove myself...
| scottlamb wrote:
| Interesting, thanks!
|
| > If you and I still worked at Google I'd just give you
| an automon dashboard link showing latency an order of
| magnitude better than that to prove myself...
|
| I believe you, and I think in principle we should all be
| getting the 50 us latency you're describing within a
| datacenter with no special effort.
|
| ...but it doesn't match what I observed, and I'm not sure
| why. Maybe difference of a couple years. Maybe I was
| checking somewhere with older equipment, or some
| important config difference in our tests. And obviously
| my memory's a bit fuzzy by now but I know I didn't like
| the result I got.
| Szpadel wrote:
| with such speed and CXL gaining traction (think ram and
| GPUs over network) why network SSD is still issue? you
| could have like one storage server per rack that would
| serve storage only for that particular rack
|
| you could easily have like 40GB/s with some over
| provisioning / bucketing
| KaiserPro wrote:
| Networks are not reliable, despite what you hear, so latency
| is used to mask re-tries and delays.
|
| The other thing to note about big inter-DC links are heavily
| QoS'd and contented, because they are both expensive and a
| bollock to maintain.
|
| Also, from what I recall, 40gig links are just parallel 10
| gig links, so have no lower latency. I'm not sure if 100/400
| gigs are ten/fourty lines of ten gigs in parallel or actually
| able to issue packets at 10/40 times a ten gig link. I've
| been away from networking too long
| scottlamb wrote:
| > Networks are not reliable, despite what you hear, so
| latency is used to mask re-tries and delays.
|
| Of course, but even the 50%ile case is strangely slow, and
| if that involves retries something is deeply wrong.
| KaiserPro wrote:
| You're right, but TCP doesn't like packets being dropped
| halfway through a stream. If you have a highly QoS'd link
| then you'll see latency spikes.
| scottlamb wrote:
| Again, I'm not talking about spikes (though better tail
| latency is always desirable) but poor latency in the
| 50%ile case. And for high-QoS applications, not batch
| stuff. The snap paper linked elsewhere in the thread
| shows 10 us latencies; they've put in some optimization
| to achieve that, but I don't really understand why we
| don't expect close to that with standard kernel
| networking and TCP.
| wmf wrote:
| _40gig links are just parallel 10 gig links, so have no
| lower latency_
|
| That's not correct. Higher link speeds do have lower
| serialization latency, although that's a small fraction of
| overall network latency.
| throwawaaarrgh wrote:
| > SSDs in the cloud are attached over a network, and
| fundamentally have to be
|
| SANs can still be quite fast, and instance storage is fast,
| both of which are available in cloud providers
| samstave wrote:
| Forgive if stupid, but when netflix was doing all their edge
| content boxes, where they were putting machines much more
| latency-close to customers... is/does/can this model kinda work
| for SSDs in a SScDN type of network
| (client---->CDN->SScDN-------<>SSD?
| dan-robertson wrote:
| I can see network attached SSDs having poor latency, but
| shouldn't the networking numbers quoted in the article allow
| for higher throughput than observed?
| paulddraper wrote:
| > and fundamentally have to be
|
| Can you expound?
| PaulHoule wrote:
| They don't have to be. Architecturally there are many benefits
| to storage area networks but I have built plenty of systems
| which are self-contained, download a dataset to a cloud
| instance with a direct attached SSD, load it into a database
| and provide a different way.
| siliconc0w wrote:
| Core count plus modern nvme actually make a great case for moving
| away from the cloud- before it was, "your data probably fits into
| memory". These are so fast that they're close enough to memory so
| it's "your data surely fits on disk". This reduces the complexity
| of a lot of workloads so you can just buy a beefy server and do
| pretty insane caching/calculation/serving with just a single box
| or two for redundancy.
| echelon wrote:
| The reasons to switch away from cloud keep piling up.
|
| We're doing some amount of on-prem, and I'm eager to do more.
| aarmenaa wrote:
| I've previously worked for a place that ran most of their
| production network "on-prem". They had a few thousand
| physical machines spread across 6 or so colocation sites on
| three continents. I enjoyed that job immensely; I'd jump at
| the chance to build something like it from the ground up. I'm
| not sure if that actually makes sense for very many
| businesses though.
| malfist wrote:
| I keep hearing that, but that's simply not true. SSDs are fast,
| but they're several orders of magnitude slower than RAM, which
| is orders of magnitude slower than CPU Cache.
|
| Samsung 990 Pro 2TB has a latency of 40 ms
|
| DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.
|
| DDR4 latency is 0.035% of one of the fastest SSDs, or to put it
| another way, DDR4 is 2,857x faster than an SSD.
|
| L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz
| cpu like the i7-10700, L1 cache latency is sub 1ns.
| avg_dev wrote:
| pretty cool comparisons. quite some differences there.
|
| tangent, I remember reading some post called something like
| "Latency numbers every programmer should know" and being
| slightly ashamed when I could not internalize it.
| malfist wrote:
| Oh don't feel bad. I had to look up every one of those
| numbers
| nzgrover wrote:
| You might enjoy Grace Hopper's leacture which includes this
| snippet:
| https://youtu.be/ZR0ujwlvbkQ?si=vjEQHIGmffjqfHBN&t=2706
| darzu wrote:
| probably this one:
| https://gist.github.com/hellerbarde/2843375
| LeifCarrotson wrote:
| I wonder how many people have built failed businesses that
| never had enough customer data to exceed the DDR4 in the
| average developer laptop, and never had so many simultaneous
| queries it couldn't be handled by a single core running
| SQLite, but built the software architecture on a distributed
| cloud system just in case it eventually scaled to hundreds of
| terabytes and billions of simultaneous queries.
| Repulsion9513 wrote:
| A _LOT_... especially here.
| malfist wrote:
| I totally hear you about that. I work for FAANG, and I'm
| working on a service that has to be capable of sending 1.6m
| text messages in less than 10 minutes.
|
| The amount of complexity the architecture has because of
| those constraints is insane.
|
| When I worked at my previous job, management kept asking
| for that scale of designs for less than 1/1000 of the
| throughput and I was constantly pushing back. There's real
| costs to building for more scale than you need. It's not as
| simple as just tweaking a few things.
|
| To me there's a couple of big breakpoints in scale:
|
| * When you can run on a single server
|
| * When you need to run on a single server, but with HA
| redundancies
|
| * When you have to scale beyond a single server
|
| * When you have to adapt your scale to deal with the limits
| of a distributed system, i.e. designing for DyanmoDB's
| partition limits.
|
| Each step in that chain add irrevocable complexity, adds to
| OE, adds to cost to run and cost to build. Be sure you have
| to take those steps before you decide too.
| disqard wrote:
| I'm trying to guess what "OE" stands for... over
| engineering? operating expenditure? I'd love to know what
| you meant :)
| madisp wrote:
| probably operating expenses
| malfist wrote:
| Sorry, thought it was a common term. Operational
| Excellence. All the effort and time it takes to keep a
| service online, on call included
| kuschku wrote:
| Maybe I'm misunderstanding something, but that's about
| 2700 a second. Or about 3Mbps.
|
| Even a very unoptimized application running on a dev
| laptop can serve 1Gbps nowadays without issues.
|
| So what are the constraints that demand a complex
| architecture?
| goguy wrote:
| That really doesn't require that much complexity.
|
| I used to send something like 250k a minute complete with
| delivery report processing from a single machine running
| a bunch of other services like 10 years ago.
| icedchai wrote:
| Many. I regularly see systems built for "big data", built
| for scale using "serverless" and some proprietary cloud
| database (like DynamoDB), storing a few hundred megabytes
| total. 20 years ago we would've built this on PHP and MySQL
| and called it a day.
| Szpadel wrote:
| In may day job I often see systems that have the opposite.
| Especially for database queries, developers tested on local
| machine with 100s of records and everything was quick and
| snappy and on production with mere millions of records I
| often see queries taking minutes up to a hour just because
| some developer didn't see need for creating indexes or
| created query in a way there is no way to even create any
| index that would work
| layer8 wrote:
| That's true, but has little to do with distributed cloud
| architecture vs. single local instance.
| kristopolous wrote:
| You're not considered serious if you don't. Kinda stupid.
| BackBlast wrote:
| You're missing the purpose of the cache. At least for this
| argument it's mostly for network responses.
|
| HDD was 10ms, which was noticeable for cached network request
| that needs to go back out on the wire. This was also bottle
| necked by IOPS, after 100-150 IOPS you were done. You could
| do a bit better with raid, but not the 2-3 orders of
| magnitude you really needed to be an effective cache. So it
| just couldn't work as a serious cache, the next step up was
| RAM. This is the operational environment which redis and such
| memory caches evolved.
|
| 40 us latency is fine for caching. Even the high load
| 500-600us latency is fine for the network request cache
| purpose. You can buy individual drives with > 1 million read
| IOPS. Plenty for a good cache. HDD couldn't fit the bill for
| the above reasons. RAM is faster, no question, but the lower
| latency of the RAM over the SSD isn't really helping
| performance here as the network latency is dominating.
|
| Rails conference 2023 has a talk that mentions this. They
| moved from a memory based cache system to an SSD based cache
| system. The Redis RAM based system latency was 0.8ms and the
| SSD based system was 1.2ms for some known system. Which is
| fine. It saves you a couple of orders of magnitude on cost
| and you can do much much larger and more aggressive caching
| with the extra space.
|
| Often times these RAM caching servers are a network hop away
| anyway, or at least a loopback TCP request. Making the
| question of comparing SSD latency to RAM totally irrelevant.
| jeffbee wrote:
| "I will simply have another box for redundancy" is already a
| system so complex that having it in or out of the cloud won't
| make a difference.
| Nextgrid wrote:
| It really depends on business requirements. Real-time
| redundancy is hard. Taking backups at 15-min intervals and
| having the standby box merely pull down the last backup when
| starting up is much easier, and this may actually be fine for
| a lot of applications.
|
| Unfortunately very few actually think about failure modes,
| set realistic targets, and actually _test_ the process.
| Everyone _thinks_ they need 100% uptime and consistency, few
| actually achieve it in practice (many think they do, but when
| shit hits the fan it uncovers an edge-case they haven 't
| thought of), but it turns out that in most cases it doesn't
| matter and they could've saved themselves a lot of trouble
| and complexity.
| littlestymaar wrote:
| So much this.
|
| I'd github can afford the amount of downtime they do, it's
| likely that your business can afford 15 minutes of downtime
| every once in a while due to a failing server.
|
| Also, the less servers you have overall, the least common a
| failure will be.
|
| Backups and cold failover server are mandatory, but
| anything past that should be weighted on a rational
| cost/benefit analysis, and for most people the cost/benefit
| ratio just isn't enough to justify infrastructure
| complexity.
| zokier wrote:
| > Since then, several NVMe instance types, including i4i and
| im4gn, have been launched. Surprisingly, however, the performance
| has not increased; seven years after the i3 launch, we are still
| stuck with 2 GB/s per SSD.
|
| AWS marketing claims otherwise: Up to 800K
| random write IOPS Up to 1 million random read IOPS
| Up to 5600 MB/second of sequential writes Up to 8000
| MB/second of sequential reads
|
| https://aws.amazon.com/blogs/aws/new-storage-optimized-amazo...
| sprachspiel wrote:
| This is for 8 SSDs and a single modern PCIe 5.0 has better
| specs than this.
| jeffbee wrote:
| Those claims are per device. There isn't even an instance in
| that family with 8 devices.
| nik_0_0 wrote:
| Is it? The line preceding the bullet list on that page seems
| to state otherwise:
|
| "" Each storage volume can deliver the
| following performance (all measured using 4 KiB blocks):
| * Up to 8000 MB/second of sequential reads
|
| ""
| sprachspiel wrote:
| Just tested a i4i.32xlarge: $ lsblk
| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
| loop0 7:0 0 24.9M 1 loop /snap/amazon-ssm-
| agent/7628 loop1 7:1 0 55.7M 1 loop
| /snap/core18/2812 loop2 7:2 0 63.5M 1
| loop /snap/core20/2015 loop3 7:3 0 111.9M
| 1 loop /snap/lxd/24322 loop4 7:4 0 40.9M
| 1 loop /snap/snapd/20290 nvme0n1 259:0 0
| 8G 0 disk +-nvme0n1p1 259:1 0 7.9G 0 part /
| +-nvme0n1p14 259:2 0 4M 0 part +-nvme0n1p15
| 259:3 0 106M 0 part /boot/efi nvme2n1
| 259:4 0 3.4T 0 disk nvme4n1 259:5 0
| 3.4T 0 disk nvme1n1 259:6 0 3.4T 0 disk
| nvme5n1 259:7 0 3.4T 0 disk nvme7n1
| 259:8 0 3.4T 0 disk nvme6n1 259:9 0
| 3.4T 0 disk nvme3n1 259:10 0 3.4T 0 disk
| nvme8n1 259:11 0 3.4T 0 disk
|
| Since nvme0n1 is the EBS boot volume, we have 8 SSDs. And
| here's the read bandwidth for one of them:
| $ sudo fio --name=bla --filename=/dev/nvme2n1 --rw=read
| --iodepth=128 --ioengine=libaio --direct=1 --blocksize=16m
| bla: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W)
| 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio,
| iodepth=128 fio-3.28 Starting 1 process
| ^Cbs: 1 (f=1): [R(1)][0.5%][r=2704MiB/s][r=169 IOPS][eta
| 20m:17s]
|
| So we should have a total bandwidth of 2.7*8=21 GB/s. Not
| that great for 2024.
| Nextgrid wrote:
| If you still have this machine, I wonder if you can get
| this bandwidth in parallel across all SSDs? There could
| be some hypervisor-level or host-level bottleneck that
| means while any SSD in isolation will give you the
| observed bandwidth, you can't actually reach that if you
| try to access them all in parallel?
| Aachen wrote:
| So if I'm reading it right, the quote from the original
| article that started this thread was ballpark correct?
|
| > we are still stuck with 2 GB/s per SSD
|
| Versus the ~2.7 GiB/s your benchmark shows (bit hard to
| know where to look on mobile with all that line-wrapped
| output, and when not familiar with the fio tool; not your
| fault but that's why I'm double checking my conclusion)
| dangoodmanUT wrote:
| that's 16m blocks, not 4k
| wtallis wrote:
| Last I checked, Linux splits up massive IO requests like
| that before sending them to the disk. But there's no
| benefit to splitting a sequential IO request all the way
| down to 4kB.
| zokier wrote:
| I wonder if there is some tuning that needs to be done
| here, it seems suprising that the advertised rate would
| be this much off otherwise.
| jeffbee wrote:
| I would start with the LBA format, which is likely to be
| suboptimal for compatibility.
| zokier wrote:
| somehow I4g drives don't like to get formatted
| # nvme format /dev/nvme1 -n1 -f NVMe status:
| INVALID_OPCODE: The associated command opcode field is
| not valid(0x2001) # nvme id-ctrl /dev/nvme1 |
| grep oacs oacs : 0
|
| but the LBA format indeed is sus: LBA
| Format 0 : Metadata Size: 0 bytes - Data Size: 512
| bytes - Relative Performance: 0 Best (in use)
| jeffbee wrote:
| It's a shame. The recent "datacenter nvme" standards
| involving fb, goog, et al mandate 4K LBA support.
| dekhn wrote:
| Can you addjust --blocksize to correspond to the block
| size on the device? And with/without --direct=1
| cogman10 wrote:
| There's a 4th option. Cost.
|
| The fastest SSDs tend to also be MLC which tend to have much
| lower write life vs other technologies. This isn't unusual,
| increasing data density generally also makes it easier to
| increase performance. However, it's at the cost that the writes
| are typically done for a block/cell in memory rather than for
| single bits. So if one cell goes bad, they all fail.
|
| But even if that's not the problem, there is a problem of
| upgrading the fleet in a cost effective mechanism. When you start
| introducing new tech into the stack, replacing that tech now
| requires your datacenters to have 2 different types of hardware
| on hand AND for the techs swapping drives to have a way to
| identify and replace that stuff when it goes bad.
| c0l0 wrote:
| Seeing the really just puny "provisioned IOPS" numbers on hugely
| expensive cloud instances made me chuckle (first in disbelief,
| then in horror) when I joined a "cloud-first" enterprise shop in
| 2020 (having come from a company that hosted their own hardware
| at a colo).
|
| It's no wonder that many people nowadays, esp. those who are so
| young that they've never experienced anything but cloud
| instances, seem to have little idea of how much performance you
| can actually pack in just one or two RUs today. Ultra-fast (I'm
| not parroting some marketing speak here - I just take a look at
| IOPS numbers, and compare them to those from highest-end storage
| some 10-12 years ago) NVMe storage is a _big_ part of that
| astonishing magic.
| zokier wrote:
| Has anyone disk benchmarks for M7gd (or C/R equivalents) instance
| stores? While probably not at the level of I4g, would still be
| interesting comparison.
| Twirrim wrote:
| Disclaimer: I work for OCI, opinion my own etc.
|
| We offer faster NVMe drives in instances. Our E4 Dense shapes
| ship with SAMSUNG MZWLJ7T6HALA-00AU3, which supports Sequential
| Reads of 7000 MB/s, and Sequential Write 3800 MB/s.
|
| From a general perspective, I would say the _likely_ answer to
| why AWS doesn't have faster NVMes at the moment is likely to be
| lack of specific demand. That's a guess, but that's generally how
| things go. If there's not enough specific demand being fed in
| through TAMs and the like for faster disks, upgrades are likely
| to be more of an after-thought, or reflecting supply chain.
|
| I know there's a tendency when you engineer things, to just work
| around, or work with the constraints, and grumble amongst your
| team, but it's incredibly invaluable if you can make sure your
| account manager knows what shortcomings you've had to work
| around.
| fabioyy wrote:
| it is not worth to use cloud if you need a lot of iops/bandwidth
|
| heck, its not worth for anything besides scalability
|
| dedicated servers are wayyyy cheaper
| kstrauser wrote:
| I'm not certain that's true if you look at TCO. Yes, you can
| probably buy a server for less than the yearly rent on the
| equivalent EC2 instance. But then you've got to put that server
| somewhere, with reliable power and probably redundant Internet
| connections. You have to pay someone's salary to set it up and
| load it to the point that a user can SSH in and configure it.
| You have to maintain an inventory of spares, and pay someone to
| swap it out if it breaks. You have to pay to put its backups
| somewhere.
|
| Yeah, you _can_ skip a lot of that if your goal is to get _a_
| server online as cheaply as possible, reliability be damned. As
| soon as you start caring about keeping it in a business-ready
| state, costs start to skyrocket.
|
| I've worn the sysadmin hat. If AWS burned down, I'd be ready
| and willing to recreate the important parts locally so that my
| company could stay in business. But wow, would they ever be in
| for some sticker shock.
| Nextgrid wrote:
| > But then you've got to put that server somewhere, with
| reliable power and probably redundant Internet connections.
| You have to pay someone's salary to set it up and load it to
| the point that a user can SSH in and configure it. You have
| to maintain an inventory of spares, and pay someone to swap
| it out if it breaks.
|
| There's a middle-ground between cloud and colocation. There
| are plenty of providers such as OVH, Hetzner, Equinix, etc
| which will do all of the above for you.
| the8472 wrote:
| At least in the workstation segment cloud doesn't compete. We
| use Threadrippers + A6000 GPUs at work. Getting the
| equivalent datacenter-type GPUs and EPYC processors is more
| expensive, even after accounting for IT and utilization.
| layer8 wrote:
| Where I live, a number of SMEs are doing this. It's really
| not that costly, unless you are a tiny startup I guess.
| justsomehnguy wrote:
| > as cheaply as possible, _reliability be damned_. As soon as
| you start caring about keeping it in a business-ready state,
| costs start to skyrocket.
|
| The demand for five-nines is greatly exaggerated.
| BackBlast wrote:
| > I'm not certain that's true if you look at TCO.
|
| Sigh. This old trope from ancient history in internet time.
|
| > Yes, you can probably buy a server for less than the yearly
| rent on the equivalent EC2 instance.
|
| Or a monthly bill... I can oft times buy a higher performing
| server for the cost of a rental for a single month.
|
| > But then you've got to put that server somewhere, with
| reliable power and probably redundant Internet connections
|
| Power:
|
| The power problem is a lot lower with modern systems because
| they can use a lot less of it per unit of compute/memory/disk
| performance. Idle power has improved a lot too. You don't
| need 700 watts of server power anymore for a 2 socket 8 core
| monster that is outclassed by a modern $400 mini-pc that
| maxes out at 45 watts.
|
| You can buy server rack batteries now in a modern chemistry
| that'll go 20 years with zero maintenance. 4U sized 5kwh cost
| 1000-1500. EVs have pushed battery cost down a LOT. How much
| do you really need? Do you even need a generator if your
| battery just carries the day? Even if your power reliability
| totally sucks.
|
| Network:
|
| Never been easier to buy network transfer. Fiber is available
| in many places, even cable speeds are well beyond the past,
| and there's starlink if you want to be fully resistant to
| local power issues. Sure, get two vendors for redundancy.
| Then you can hit cloud-style uptimes out of your closet.
|
| Overlay networks like tailscale make the networking issues
| within the reach of almost anyone.
|
| > You have to maintain an inventory of spares, and pay
| someone to swap it out if it breaks. You have to pay to put
| its backups somewhere.
|
| Have you seen the size of M.2 sticks? Memory sticks? They
| aren't very big... I happened to like opening up systems and
| actually touching the hardware I use.
|
| But yeah, if you just can't make it work or be bothered in
| the modern era of computing. Then stick with the cloud and
| the 10-100x premium they charge for their services.
| pkstn wrote:
| UpCloud has super fast MaxIOPS:
| https://upcloud.com/products/block-storage
|
| Here's referral link with free credits:
| https://upcloud.com/signup/?promo=J3JYWZ
| rbranson wrote:
| It's awfully similar to a lane of PCIe 4.0, if we're talking
| about instance storage. It reads like they've chosen to map each
| physical device to a single PCIe lane. Surely the AWS Nitro
| hardware platform has longer cycle times than PCIe. Note that
| once the instance type has multiple block devices exposed (i.e.
| im4gn.8xlarge or higher), striping across these will reach higher
| throughput (2 devices yields 4G/s, 4 yields 8G/s).
| 0cf8612b2e1e wrote:
| Serious question, for a consumer does it make any sense to
| compare SSD benchmarks? I assume the best and worst models give a
| user an identical experience in 99% of cases, and it is only
| prosumer activities (video? sustained writes?) which would
| differentiate them.
| wmf wrote:
| Yeah, that's pretty much the case. Cheap SSDs provide good
| enough performance for desktop use.
| eisa01 wrote:
| Would this be a consequence of the cloud providers not being on
| the latest technology CPU-wise?
|
| At least I have the impression they are lagging, eg., still
| offering things like: z1d: Skylake (2017)
| https://aws.amazon.com/ec2/instance-types/z1d/ x2i: Cascade Lake
| (2019) and Ice lake (2021) https://aws.amazon.com/ec2/instance-
| types/x2i/
|
| I have not been able to find instances powered by the 4th (Q1
| 2023) or 5th generation (Q4 2023) Xeons?
|
| We solve large capacity expansion power market models that need
| as fast single-threaded performance as possible coupled with lots
| of RAM (32:1 ratio or higher ideal). One model may take 256-512
| GB RAM, but not being able to use more than 4 threads effectively
| (interior point algorithms have very diminishing returns past
| this point)
|
| Our dispatch models do not have the same RAM requirement, but you
| still wish to have the fastest single-threaded processors
| available (and then parallelize)
| deadmutex wrote:
| You can find Intel Sapphire Rapids powered VM instances on GCE
| kwillets wrote:
| AWS docs and blogs describe the Nitro SSD architecture, which is
| locally attached with custom firmware.
|
| > The Nitro Cards are physically connected to the system main
| board and its processors via PCIe, but are otherwise logically
| isolated from the system main board that runs customer workloads.
|
| https://docs.aws.amazon.com/whitepapers/latest/security-desi...
|
| > In order to make the [SSD] devices last as long as possible,
| the firmware is responsible for a process known as wear
| leveling.... There's some housekeeping (a form of garbage
| collection) involved in this process, and garden-variety SSDs can
| slow down (creating latency spikes) at unpredictable times when
| dealing with a barrage of writes. We also took advantage of our
| database expertise and built a very sophisticated, power-fail-
| safe journal-based database into the SSD firmware.
|
| https://aws.amazon.com/blogs/aws/aws-nitro-ssd-high-performa...
|
| This firmware layer seems like a good candidate for the slowdown.
| dan-robertson wrote:
| Yeah, I'm curious how they would respond to the claims in the
| article. In [1], they talk about aiming for low latency, for
| consistent performance (apparently other SSDs could stall at
| inopportune times), and support on-disk encryption. Latency is
| often in direct conflict with throughput (eg batching usually
| trades one for the other), and also matters a lot for plenty of
| filesystem or database tasks (indeed the OP links to a paper
| showing that popular databases, even column stores, struggle to
| use the full disk throughput, though I didn't read why).
| Encryption is probably not the reason - dedicated hardware on
| modern chips can do AES at 50GB/s, though maybe it is if it
| increases latency? So maybe there's something else to it like
| sharing between many vms on one host
|
| [1] https://m.youtube.com/watch?v=Cxie0FgLogg
| kwillets wrote:
| The Nitro chipset claims 100 GB/s encryption, so that doesn't
| seem to be the reason.
| someguydave wrote:
| Yeah I wonder how Nitro balances the latency & bandwidth
| demands of multiple VMs while also minimizing memory cache
| misses on the CPU (I am assuming it uses DMA to talk to the
| main CPU cores)
| akira2501 wrote:
| I've noticed the firmware programming positions have been more
| common in their job listings lately.
| dan-robertson wrote:
| Is read/write throughput the only difference? Eg I don't know how
| latency compares, or indeed failure rates or whether fsync lies
| or not and writes won't always survive power failures.
| jiggawatts wrote:
| There's a lot of talk about cloud network and disk performance in
| this thread. I recently benchmarked both Azure and AWS and found
| that:
|
| - Azure network latency is about 85 microseconds.
|
| - AWS network latency is about 55 microseconds.
|
| - Both can do better, but only in special circumstances such as
| RDMA NICs in HPC clusters.
|
| - Cross-VPC or cross-VNET is basically identical. Some people
| were saying it's terribly slow, but I didn't see that in my
| tests.
|
| - Cross-zone is 300-1200 microseconds due to the inescapable
| speed of light delay.
|
| - VM-to-VM bandwidth is over 10 Gbps (>1 GB/s) for both clouds,
| even for the _smallest_ two vCPU VMs!
|
| - Azure Premium SSD v1 latency varies between about 800 to 3,000
| microseconds, which is many times worse than the network latency.
|
| - Azure Premium SSD v2 latency is about 400 to 2,000
| microseconds, which isn't that much better, because:
|
| - Local SSD _caches_ in Azure are so much faster than remote disk
| that we found that Premium SSD v1 is almost always faster than
| Premium SSD v2 because the latter doesn 't support caching.
|
| - Again in Azure, the local SSD "cache" and also the local "temp
| disks" both have latency as low as 40 microseconds, on par with a
| modern laptop NVMe drive. We found that switching to the latest-
| gen VM SKU and turning on the "read caching" for the data disks
| was the magic "go-fast" button for databases... without the risk
| of losing out data.
|
| We investigated the various local-SSD VM SKUs in both clouds such
| as the Lasv3 series, and as the article mentioned, the
| performance delta didn't blow my skirt up, but the data loss risk
| made these not worth the hassle.
| computerdork wrote:
| Interesting. And would you happen to have the numbers on the
| performance of the local SSD? Is it's read and write throughput
| up to the level of modern SSD's?
| jiggawatts wrote:
| It's pretty much like how the article said. The cloud local
| SSDs are notably slower than what you'd get in an ordinary
| laptop, let alone a high-end server.
|
| I'm not an insider and don't have any exclusive knowledge,
| but from reading a lot about the topic my impression is that
| the issue in both clouds is the virtualization overheads.
|
| That is, having the networking or storage go through _any_
| hypervisor software layer is what kills the performance. I
| 've seen similar numbers with on-prem VMware, Xen, and
| Nutanix setups as well.
|
| Both clouds appear to be working on next-generation VM SKUs
| where the hypervisor network and storage functions are
| offloaded into 100% hardware, either into FPGAs or custom
| ASICs.
|
| "Azure Boost" is Microsoft's marketing name for this, and it
| basically amounts to both local and remote disks going
| through an NVMe controller directly mapped into the memory
| space of the VM. That is, the VM OS kernel talks _directly_
| to the hardware, bypassing the hypervisor completely. This is
| shown in their documentation diagrams:
| https://learn.microsoft.com/en-us/azure/azure-boost/overview
|
| They're claiming up to 3.8M IOPS for a single VM, which is
| 3-10x what you'd get out of a single NVMe SSD stick, so...
| not too shabby at all!
|
| Similarly, Microsoft Azure Network Adapter (MANA) is the
| equivalent for the NIC, which will similarly connect the VM
| OS directly into the network, bypassing the hypervisor
| software.
|
| I'm not an AWS expert, but from what I've seen they've been
| working on similar tech (Nitro) for years.
| StillBored wrote:
| Its worse than the article mentions. Because bandwidth isn't the
| problem its IOPS that are the problem.
|
| Last time (about a year ago) I ran a couple random IO benchmarks
| against a storage optimized instances and the random IOPs
| behavior is closer to a large spinning RAID array than SSDs if
| the disk size is over some threshold.
|
| IIRC, What it looks like is that there is a fast local SSD cache
| with a couple hundred GB of storage and then the rest is backed
| by remote spinning media.
|
| Its one of the many reasons I have a hard time taking cloud
| optimization seriously, the lack of direct tiering controls means
| that database/etc style workloads are not going to optimize well
| and that will end up costing a lot of $$$$$.
|
| So, maybe it was the instance types/configuration I was using,
| but <shrug> it was just something I was testing in passing.
| Ericson2314 wrote:
| The cloud really is a scam for those afraid of hardware
___________________________________________________________________
(page generated 2024-02-20 23:00 UTC)