[HN Gopher] SSDs have become fast, except in the cloud
       ___________________________________________________________________
        
       SSDs have become fast, except in the cloud
        
       Author : greghn
       Score  : 308 points
       Date   : 2024-02-20 16:59 UTC (6 hours ago)
        
 (HTM) web link (databasearchitects.blogspot.com)
 (TXT) w3m dump (databasearchitects.blogspot.com)
        
       | teaearlgraycold wrote:
       | What's a good small cloud competitor to AWS? For teams that just
       | need two AZs to get HA and your standard stuff like VMs, k8s,
       | etc.
        
         | leroman wrote:
         | DigitalOcean has been great, have managed k8s and generally
         | good bang for your buck
        
         | ThrowawayTestr wrote:
         | Buy a server
        
           | nullindividual wrote:
           | I don't like this answer.
           | 
           | When I look at cloud, I get to think "finally! No more
           | hardware to manage. No OS to manage". It's the best thing
           | about the cloud, provided your workload is amenable to PaaS.
           | It's great because I don't have to manage Windows or IIS.
           | Microsoft does that part for me and significantly cheaper
           | than it would be to employ me to do that work.
        
             | adastra22 wrote:
             | There is tooling to provide the same PaaS interface though,
             | so your cost doing those things amounts to running OS
             | updates.
        
             | cynicalsecurity wrote:
             | A cloud is just someone else's computer.
             | 
             | When you rent a bare metal server, you don't manage your
             | hardware either. The failed parts are replaced for you.
             | Unless you can't figure out what hardware configuration you
             | need - which would be a really big red flag for your level
             | of expertise.
        
             | tjoff wrote:
             | And now you have to manage the cloud instead. Which turns
             | out to be more hassle and with no overlap with the actual
             | problem you are trying to solve.
             | 
             | So not only do you spend time on the wrong thing you don't
             | even know how it works. And the providers goals are not
             | aligned either as all they care about is locking you in.
             | 
             | How is that better?
        
             | deathanatos wrote:
             | > _No more hardware to manage. No OS to manage_
             | 
             | We must be using different clouds.
             | 
             | For some of the much higher-level services ... maybe some
             | semblance of that statement holds. But for VMs? Definitely
             | not "no OS to manage" ... the OS is usually on the
             | customer. There might be OS-level agents from your cloud of
             | choice that make certain operations easier ... but I'm
             | still on the hook for updates.
             | 
             | Even "No machine" is a stretch, though I've found this is
             | much more dependent on cloud. AWS typically notices
             | failures before I do, and by the time I notice something is
             | up, the VM has been migrated to a new host and I'm none the
             | wiser sans the reboot that cost. But other clouds I've been
             | less lucky with: we've caught host failures well before the
             | cloud provider, to an extent where I've wished there was a
             | "vote of no confidence" API call I could make to say "give
             | me new HW, and I personally think this HW is suss".
             | 
             | Even on higher level services like RDS, or S3, I've noticed
             | failures prior to AWS ... or even to the extent that I
             | don't know that AWS would have noticed those failures
             | unless we had opened the support ticket. (E.g., in the S3
             | case, even though we clearly reported the problem, and the
             | problem was occurring on basically every request, we
             | _still_ had to provide example request IDs before they 'd
             | believe us. The service was basically in an outage as far
             | as we could tell ... though I think AWS ended up claiming
             | it was "just us".)
             | 
             | That said, S3 in particular is still an excellent service,
             | and I'd happily use it again. But cloud == 0 time on my
             | part. It depends heavily on the cloud, and less heavily on
             | the service how _much_ time, and sometimes, it is still
             | worthwhile.
        
         | pritambarhate wrote:
         | Digital Ocean. But they don't have concept of multi az as far
         | as I know. But they have multiple data centers in same region.
         | But I am not aware if there is any private networking between
         | DCs in the same region.
        
         | prisenco wrote:
         | Really needs more information for a good answer. How many req/s
         | are you handling? Do you have dedicated devops? Write heavy,
         | read heavy or pipeline/etl heavy? Is autoscaling a must or
         | would you rather failure over a bank breaking cloud bill?
        
         | dimgl wrote:
         | I am _super_ impressed with Supabase. I can see how a lot of
         | people may not like it given how reliant it is on Postgres, but
         | I find it to be absolutely genius.
         | 
         | It basically allows me to forego having to make a server for
         | the CRUD operations so I can focus on the actual business
         | implications. My REST API is automatically managed for me
         | (mostly with lightweight views and functions) and all of my
         | other core logic is either spread out through edge functions or
         | in a separate data store (Redis) where I perform the more CPU
         | intensive operations related to my business.
         | 
         | There's some rough edges around their documentation and DX but
         | I'm really loving it so far.
        
         | catherinecodes wrote:
         | Hetzner and Entrywan are pure-play cloud companies with good
         | prices and support. Hetzner is based in Germany and Entrywan in
         | the US.
        
           | infamia wrote:
           | Thanks for mentioning Entrywan, they look great from what I
           | can tell on their site. Have you used their services? If so,
           | I'm curious about your experiences with them.
        
             | catherinecodes wrote:
             | My irc bouncer and two kubernetes clusters are running
             | there. So far the service has been good.
        
           | madars wrote:
           | Hetzner has a reputation for locking accounts for "identity
           | verification" (Google "hetzner kyc" or "hetzner identity
           | verification"). Might be worthwhile to go through a reseller
           | just to avoid downtime like that.
        
         | infecto wrote:
         | AWS is pretty great and I think reasonably cheap, I would
         | include any of the other large cloud players. The amount saved
         | is just not reasonable enough for me, I would rather work with
         | something that I have used over and over with.
         | 
         | Along with that, ChatGPT has knocked down most of the remaining
         | barriers I have had when permissions get confusing in one of
         | the cloud services.
        
         | mciancia wrote:
         | I like using scaleway for personal projects, but they are
         | available only in Europe
        
       | bombcar wrote:
       | I think the obvious answer is there's not much demand, and
       | keeping it "low" allows trickery and funny business with the
       | virtualization layer (think: SAN, etc) that you can't do with
       | "raw hardware speed".
        
         | _Rabs_ wrote:
         | Sure, but it does make me wonder what kind of speeds we are
         | paying for if we can't even get raw hardware speeds...
         | 
         | Sounds like one more excuse for AWS to obfuscate any meaning in
         | their billing structure and take control of the narrative.
         | 
         | How much are they getting away with by virtualization. (Think
         | how banks use your money for loans and stuff)
         | 
         | You actually don't get to really see the internals other than
         | IOPS which doesn't help when it's gatekept already.
        
           | bombcar wrote:
           | The biggest "scam" if you can call it that is reducing all
           | factors of CPU performance to "cores".
        
             | Nextgrid wrote:
             | I'd argue the even bigger scam is charging for egress data
             | transfer rather than just for the size of the pipe.
        
         | s1gnp0st wrote:
         | True but the funny business buys a lot of fault tolerance, and
         | predictable performance if not maximum performance.
        
         | zokier wrote:
         | There is no trickery with AWS instance stores, they are honest
         | to god local disks.
        
         | Aachen wrote:
         | I ended up buying a SATA SSD for 50 euros to stick in an old
         | laptop that I was already using as server and, my god, it is so
         | much faster than the thing I was trying to run on digitalocean.
         | The DO VPS barely beat the old 5400 rpm spinning rust that was
         | in the laptop originally (the reason why I was trying to rent a
         | fast, advertised-with-SSD, server). Doing this i/o task
         | effectively in the cloud, at least with DO, seems to require
         | putting it in RAM which was a bit expensive for the few hundred
         | gigabytes of data I wanted to process into an indexed format
         | 
         | So there is demand, but I'm certainly not interested in paying
         | many multiples of 50 euros over an expected lifespan of a few
         | years, so it may not make economic sense for them to offer it
         | to users like me at least. On the other hand, for the couple
         | hours this should have taken (rather than the days it initially
         | did), I'd certainly have been willing to pay that cloud premium
         | and that's why I tried to get me one of these allegedly SSD-
         | backed VPSes... but now that I have a fast system permanently,
         | I don't think that was a wise decision of past me
        
       | pclmulqdq wrote:
       | This was a huge technical problem I worked on at Google, and is
       | sort of fundamental to a cloud. I believe this is actually a big
       | deal that drives peoples' technology directions.
       | 
       | SSDs in the cloud are attached over a network, and fundamentally
       | have to be. The problem is that this network is so large and slow
       | that it can't give you anywhere near the performance of a local
       | SSD. This wasn't a problem for hard drives, which was the backing
       | technology when a lot of these network attached storage systems
       | were invented, because they are fundamentally slow compared to
       | networks, but it is a problem for SSD.
        
         | brucethemoose2 wrote:
         | Yeah this was my impression.
         | 
         | I am but an end user, but I noticed that disk IO for a certain
         | app was glacial compared to a local test deployment, and I
         | chalked it up to networking/VM overhead
        
         | vlovich123 wrote:
         | Why do they fundamentally need to be network attached storage
         | instead of local to the VM?
        
           | Filligree wrote:
           | They don't. Some cloud providers (i.e. Hetzner) let you rent
           | VMs with locally attached NVMe, which is dramatically faster
           | than network-attached even factoring in the VM tax.
           | 
           | Of course then you have a single point of failure, in the
           | PCIe fabric of the machine you're running on if not the NVMe
           | itself. But if you have good backups, which you should, then
           | the juice really isn't worth the squeeze for NAS storage.
        
             | ssl-3 wrote:
             | A network adds more points of failure. It does not reduce
             | them.
        
               | supriyo-biswas wrote:
               | A network attached, replicated storage hedges against
               | data loss but increases latency; however most customers
               | usually prefer higher latency to data loss. As an
               | example, see the highly upvoted fly.io thread[1] with
               | customers complaining about the same thing.
               | 
               | [1] https://news.ycombinator.com/item?id=36808296
        
               | ssl-3 wrote:
               | Locally-attached, replicated storage also hedges against
               | data loss.
        
               | supriyo-biswas wrote:
               | RAID rebuild times make it an unviable option and
               | customers typically expect problematic VMs to be live-
               | migrated to other hosts with the disks still having their
               | intended data.
               | 
               | The self hosted version of this is GlusterFS and Ceph,
               | which have the same dynamics as EBS and its equivalents
               | in other cloud providers.
        
               | mike_hearn wrote:
               | With NVMe SSDs? What makes RAID unviable in that
               | environment?
        
               | dijit wrote:
               | This depends, like all things.
               | 
               | When you say RAID, what level? Software-raid or hardware
               | raid? What controller?
               | 
               | Let's take best-case:
               | 
               | RAID10, small enough (but many) NVMe drives _and_ an LVM
               | /Software RAID like ZFS, which is data aware so only
               | rebuilds actual data: rebuilds will degrade performance
               | enough potentially that your application can become
               | unavailable if your IOPS are 70%+ of maximum.
               | 
               | That's an ideal scenario, if you use hardware raid which
               | is not data-aware then your rebuild times depend entirely
               | on the size of the drive being rebuilt _and_ it can
               | punish IOPs even more during the rebuild. But it will
               | affect your CPU less.
               | 
               | There's no panacea. Most people opt for higher latency
               | distributed storage where the RAID is spread across an
               | _enormous_ amount of drives, which makes rebuilds much
               | less painful.
        
               | crazygringo wrote:
               | A network adds more _points_ of failures but also
               | _reduces user-facing failures_ overall when properly
               | architected.
               | 
               | If one CPU attached to storage dies, another can take
               | over and reattach -- or vice-versa. If one network link
               | dies, it can be rerouted around.
        
               | bombcar wrote:
               | Using a SAN (which is what networked storage is, after
               | all) also lets you get various "tricks" such as
               | snapshots, instant migration, etc for "free".
        
           | Retric wrote:
           | Redundancy, local storage is a single point of failure.
           | 
           | You can use local SSD's as slow RAM, but anything on it can
           | go away at any moment.
        
             | cduzz wrote:
             | I've seen SANs get nuked by operator error or by
             | environmental issues (overheated DC == SAN shuts itself
             | down).
             | 
             | Distributed clusters of things can work just fine on
             | ephemeral local storage (aka _local storage_ ). A kafka
             | cluster or an opensearch cluster will be fine using
             | instance local storage, for instance.
             | 
             | As with everything else.... "it depends"
        
               | Retric wrote:
               | Sure distributed clusters get back to network/workload
               | limitations.
        
           | pclmulqdq wrote:
           | Reliability. SSDs break and screw up a lot more frequently
           | and more quickly than CPUs. Amazon has published a lot on the
           | architecture of EBS, and they go through a good analysis of
           | this. If you have a broken disk and you locally attach, you
           | have a broken machine.
           | 
           | RAID helps you locally, but fundamentally relies on locality
           | and low latency (and maybe custom hardware) to minimize the
           | time window where you get true data corruption on a bad disk.
           | That is insufficient for cloud storage.
        
           | SteveNuts wrote:
           | Because even if you can squeeze 100TB or more of SSD/NVMe in
           | a server, and there are 10 tenants using the machine, you're
           | limited to 10TB as a hard ceiling.
           | 
           | What happens when one tenant needs 200TB attached to a
           | server?
           | 
           | Cloud providers are starting to offer local SSD/NVMe, but
           | you're renting the entire machine, and you're still limited
           | to exactly what's installed in that server.
        
             | jalk wrote:
             | How is that different from how cores, mem and network
             | bandwidth is allotted to tenants?
        
               | pixl97 wrote:
               | Because a fair number of customers spin up another image
               | when cores/mem/bandwidth run low. Dedicated storage
               | breaks that paradigm.
               | 
               | Also, adding, if I am on an 8 core machine and need 16,
               | network storage can be detached from host A and connected
               | to host B. In dedicated storage it must be fully copied
               | over first.
        
               | baq wrote:
               | It isn't. You could ask for network-attached CPUs or RAM.
               | You'd be the only one, though, so in practice only
               | network-attached storage makes sense business-wise. It
               | also makes sense if you need to provision larger-than-
               | usual amounts like tens of TB - these are usually hard to
               | come by in a single server, but quite mundane for storage
               | appliances.
        
             | vel0city wrote:
             | Given AWS and GCP offer multiple sizes for the same
             | processor version with local SSDs, I don't think you have
             | to rent the entire machine.
             | 
             | Search for i3en API names and you'll see:
             | 
             | i3en.large, 2x CPU, 1250GB SSD
             | 
             | i3en.xlarge, 4x CPU, 2500GB SSD
             | 
             | i3en.2xlarge, 8x CPU, 2x2500GB SSD
             | 
             | i3en.3xlarge, 12x CPU, 7500GB SSD
             | 
             | i3en.6xlarge, 24x CPU, 2x7500GB SSD
             | 
             | i3en.12xlarge, 48x CPU, 4x7500GB SSD
             | 
             | i3en.24xlarge, 96x CPU, 8x7500GB SSD
             | 
             | i3en.metal, 96x CPU, 8x7500GB SSD
             | 
             | So they've got servers with 96 CPUs and 8x7500GB SSDs. You
             | can get a slice of one, or you can get the whole one. All
             | of these are the ratio of 625GB of local SSD per CPU core.
             | 
             | https://instances.vantage.sh/
             | 
             | On GCP you can get a 2-core N2 instance type and attach
             | multiple local SSDs. I doubt they have many physical 2-core
             | Xeons in their datacenters.
        
             | taneq wrote:
             | > What happens when one tenant needs 200TB attached to a
             | server?
             | 
             | Link to this mythical hosting service that expects far less
             | than 200TB of data per client but just pulls a sad face and
             | takes the extra cost on board when a client demands it. :D
        
           | drewda wrote:
           | The major clouds do offer VMs with fast local storage, such
           | as SSDs connected by NVMe connections directly to the VM host
           | machine:
           | 
           | - https://cloud.google.com/compute/docs/disks/local-ssd
           | 
           | - https://learn.microsoft.com/en-us/azure/virtual-
           | machines/ena...
           | 
           | - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-
           | inst...
           | 
           | They sell these VMs at a higher cost because it requires more
           | expensive components and is limited to host machines with
           | certain configurations. In our experience, it's also harder
           | to request quota increases to get more of these VMs -- some
           | of the public clouds have a limited supply of these specific
           | types of configurations in some regions/zones.
           | 
           | As others have noted, instance storage isn't as dependable.
           | But it can be the most performant way to do IO-intense
           | processing or to power one node of a distributed database.
        
         | _Rabs_ wrote:
         | So much of this. The amount of times I've seen someone complain
         | about slow DB performance when they're trying to connect to it
         | from a different VPC, and bottlenecking themselves to 100Mbits
         | is stupidly high.
         | 
         | Literally depending on where things are in a data center... If
         | you're looking for closely coupled and on a 10G line on the
         | same switch, going to the same server rack. I bet you
         | performance will be so much more consistent.
        
           | bugbuddy wrote:
           | Aren't 10G and 100G connections standard nowadays in data
           | centers? Heck, I thought they were standard 10 years ago.
        
             | geerlingguy wrote:
             | Datacenters are up to 400 Gbps and beyond (many places are
             | adopting 1+ Tbps on core switching).
             | 
             | However, individual servers may still operate at 10, 25, or
             | 40 Gbps to save cost on the thousands of NICs in a row of
             | racks. Alternatively, servers with multiple 100G
             | connections split that bandwidth allocation up among dozens
             | of VMs so each one gets 1 or 10G.
        
             | nixass wrote:
             | 400G is fairly normal thing in DCs nowadays
        
             | pixl97 wrote:
             | Bandwidth delay product does not help serialized
             | transactions. If you're reaching out to disk for results,
             | or if you have locking transactions on a table the
             | achievable operations drops dramatically as latency between
             | the host and the disk increases.
        
               | bee_rider wrote:
               | The typical way to trade bandwidth away for latency
               | would, I guess, be speculative requests. In the CPU world
               | at least. I wonder if any cloud providers have some sort
               | of framework built around speculative disk reads (or
               | maybe it is a totally crazy trade to make in this
               | context)?
        
               | pixl97 wrote:
               | I mean we already have readahead in the kernel.
               | 
               | This said the problem can get more complex than this
               | really fast. Write barriers for example and dirty caches.
               | Any application that forces writes and the writes are
               | enforced by the kernel are going to suffer.
               | 
               | The same is true for SSD settings. There are a number of
               | tweakable values on SSDs when it comes to write commit
               | and cache usage which can affect performance. Desktop
               | OS's tend to play more fast and loose with these settings
               | and servers defaults tend to be more conservative.
        
               | treflop wrote:
               | Often times it's the app (or something high level) that
               | would need speculative requests, which it may not be
               | possible in the given domain.
               | 
               | I don't think it's possible in most domains.
        
             | KaiserPro wrote:
             | Yes, but you have to think about contention. Whilst the Top
             | of rack _might_ have 2x400 gig links to the core, thats
             | shared with the entire rack, and all the other machines
             | trying to shout at the core switching infra.
             | 
             | Then stuff goes away, or route congested, etc, etc, etc.
        
           | silverquiet wrote:
           | > Literally depending on where things are in a data center
           | 
           | I thought cloud was supposed to abstract this away? That's a
           | bit of a sarcastic question from a long-time cloud skeptic,
           | but... wasn't it?
        
             | doubled112 wrote:
             | Reality always beats the abstraction. After all, it's just
             | somebody else's computer in somebody else's data center.
        
               | bombcar wrote:
               | Which can cause considerable "amusement" depending on the
               | provider - one I won't name directly but is much more
               | centered on actual renting racks than their (now) cloud
               | offering - if you had a virtual machine older than a year
               | or so, deleting and restoring it would get you on a newer
               | "host" and you'd be faster for the same cost.
               | 
               | Otherwise it'd stay on the same physical piece of
               | hardware it was allocated to when new.
        
               | doubled112 wrote:
               | Amusing is a good description.
               | 
               | "Hardware degradation detected, please turn it off and
               | back on again"
               | 
               | I could do a migration with zero downtime in VMware for a
               | decade but they can't seamlessly move my VM to a machine
               | that works in 2024? Great, thanks. Amusing.
        
               | bombcar wrote:
               | I have always been incredibly saddened that apparently
               | the cloud providers usually have nothing as advanced as
               | old VMware was.
        
               | wmf wrote:
               | Cloud providers have live migration now but I guess they
               | don't want to guarantee anything.
        
               | bombcar wrote:
               | It's better (and better still with other providers) but I
               | naively thought that "add more RAM" or "add more disk"
               | was something they would be able to do with a reboot at
               | most.
               | 
               | Nope, some require a full backup and restore.
        
               | wmf wrote:
               | Resizing VMs doesn't really fit the "cattle" thinking of
               | public cloud, although IMO that was kind of a premature
               | optimization. This would be a perfect use case for live
               | migration.
        
             | kccqzy wrote:
             | It's more of a matter of adding additional abstraction
             | layers. For example in most public clouds the best you can
             | hope for is to place two things in the same availability
             | zone to get the best performance. But when I worked at
             | Google, internally they had more sophisticated colocation
             | constraint than that: for example you can require two
             | things to be on the same rack.
        
             | treflop wrote:
             | Cloud makes provisioning more servers quicker because you
             | are paying someone to basically have a bunch of servers
             | ready to go right away with an API call instead of a phone
             | call, maintained by a team that isn't yours, with economies
             | of scale working for the provider.
             | 
             | Cloud does not do anything else.
             | 
             | None of these latency/speed problems are cloud-specific. If
             | you have on-premise servers and you are storing your data
             | on network-attached storage, you have the exact same
             | problems (and also the same advantages).
             | 
             | Unfortunately the gap between local and network storage is
             | wide. You win some, you lose some.
        
               | silverquiet wrote:
               | Oh, I'm not a complete neophyte (in what seems like a
               | different life now, I worked for a big hosting provider
               | actually), I was just surprised that there was a big
               | penalty for cross-VPC traffic implied by the parent
               | poster.
        
         | ejb999 wrote:
         | How much faster would the network need to get, in order to meet
         | (or at least approach) the speed of a local SSD? are we talking
         | about needing to 2x or 3x the speed, or by factors of hundreds
         | or thousands?
        
           | Filligree wrote:
           | The Samsung 990 in my desktop provides ~3.5 GB/s streaming
           | reads, ~2 GB/s 4k random-access reads, all at a latency
           | measured at around 20-30 microseconds. My exact numbers might
           | be a little off, but that's the ballpark you're looking at,
           | and a 990 is a relatively cheap device.
           | 
           | 10GbE is about the best you can hope for from a local network
           | these days, but that's 1/5th the bandwidth and many times the
           | latency. 100GbE would work, except the latency would still
           | mean any read dependencies would be far slower than local
           | storage, and I'm not sure there's much to be done about that;
           | at these speeds the physical distance matters.
           | 
           | In practice I'm having to architecture the entire system
           | around the SSD just to not bottleneck it. So far ext4 is the
           | only filesystem that even gets close to the SSD's limits,
           | which is a bit of a pity.
        
           | wmf wrote:
           | Around 4x-10x depending on how many SSDs you want. A single
           | SSD is around the speed of a 100 Gbps Ethernet link.
        
           | selectodude wrote:
           | SATA3 is 6 Gbit, so each VM on a machine multiplied by 6
           | Gbit. For NVMe, probably closer to 4-5x that. You'd need some
           | serious interconnects to get a server rack access to un-
           | bottlenecked SSD storage.
        
           | Nextgrid wrote:
           | The problem isn't necessarily speed, it's random access
           | latency. What makes SSDs fast and "magical" is their low
           | random-access latency compared to a spinning disk. The
           | sequential-access read speed is merely a bonus.
           | 
           | Networked storage negates that significantly, absolutely
           | killing performance for certain applications. You could have
           | a 100Gbps network and it still won't match a direct-attached
           | SSD in terms of latency (it can only match it in terms of
           | sequential access throughput).
           | 
           | For many applications such as databases, random access is
           | crucial, thus why nowadays' mid-range consumer hardware often
           | outperforms hosted databases such as RDS unless they're so
           | overprovisioned on RAM that the dataset is effectively always
           | in there.
        
             | baq wrote:
             | 100Gbps direct _shouldn 't be_ too bad, but it might be
             | difficult to get anyone to sell it to you for exclusive
             | usage in a vm...
        
             | Ericson2314 wrote:
             | Um... why the hell does the network care whether I am doing
             | random or sequential access? Your left that part out of
             | your argument.
        
         | zokier wrote:
         | > SSDs in the cloud are attached over a network, and
         | fundamentally have to be
         | 
         | Not on AWS. Instance stores (what the article is about) are
         | physical local disks.
        
         | mkoubaa wrote:
         | Dumb question. Why does the network have to be slow? If the
         | SSDs are two feet away from the motherboard and there's an
         | optical connection to it, shouldn't it be fast? Are data
         | centers putting SSDs super far away from motherboards?
        
           | bugbuddy wrote:
           | > One theory is that EC2 intentionally caps the write speed
           | at 1 GB/s to avoid frequent device failure, given the total
           | number of writes per SSD is limited.
           | 
           | This is the theory that I would bet on because it lines up
           | with their bottom line.
        
             | Dylan16807 wrote:
             | But the sentence right after undermines it.
             | 
             | > However, this does not explain why the read bandwidth is
             | stuck at 2 GB/s.
             | 
             | Faster read speeds would give them a more enticing product
             | without wearing drives out.
        
               | bugbuddy wrote:
               | They may be limiting the read artificially to increase
               | your resource utilization else where. If you have disk
               | bottleneck then you would be more likely to use more
               | instances. It is still about the bottom line.
        
               | Dylan16807 wrote:
               | That could be. But it's a completely different reason. If
               | you summarize everything as "bottom line", you lose all
               | the valuable information.
        
           | formercoder wrote:
           | What happens when your vm is live migrated 1000 feet away or
           | to a different zone?
        
           | supriyo-biswas wrote:
           | It's not the network being slow, but dividing the available
           | network bandwidth amongst all users, while also distributing
           | the written data to multiple nodes reliably so that one
           | tenant doesn't hog resources is quite challenging. The
           | pricing structure is meant to control resource usage; a
           | discussion of the exact prices and how much profit AWS or any
           | other cloud provider makes is a separate discussion.
        
         | jsnell wrote:
         | According to the submitted article, the numbers are from AWS
         | instance types where the SSD is "physically attached" to the
         | host, not about SSD-backed NAS solutions.
         | 
         | Also, the article isn't just about SSDs being no faster than a
         | network. It's about SSDs being two orders of magnitude slower
         | than datacenter networks.
        
           | pclmulqdq wrote:
           | It's because the "local" SSDs are not actually physically
           | attached and there's a network protocol in the way.
        
             | zokier wrote:
             | What makes you think that?
        
             | ddorian43 wrote:
             | Do you have a link to explain this? I dont think its true.
        
             | candiddevmike wrote:
             | Depends on the cloud provider. Local SSDs are physically
             | attached to the host on GCP, but that makes them only
             | useful for temporary storage.
        
               | pclmulqdq wrote:
               | If you're at G, you should read the internal docs on
               | exactly how this happens and it will be interesting.
        
               | rfoo wrote:
               | Why would I lose all data on these SSDs when I initiate a
               | power off of the VM on console, then?
               | 
               | I believe local SSDs are definitely attached to the host.
               | They are just not exposed via NVMe ZNS hence the
               | performance hit.
        
               | manquer wrote:
               | It is because on reboot you may not get the same physical
               | server . They are not rebooting the physical server for
               | you , just the VM
               | 
               | Same VM is not allocated for a variety of reasons ,
               | scheduled maintenance, proximity to other hosts on the
               | vpc , balancing quiet and noisy neighbors so on.
               | 
               | It is not that the disk will always wiped , sometimes the
               | data is still there on reboot just that there is no
               | guarantee allowing them to freely move between hosts
        
               | res0nat0r wrote:
               | Your EC2 instance with instance-store storage when
               | stopped can be launched on any other random host in the
               | AZ when you power it back on. Since your rootdisk is an
               | EBS volume attached across the network, so when you start
               | your instance back up you're going to be launched likely
               | somewhere else with an empty slot, and empty local-
               | storage. This is why there is always a disclaimer that
               | this local storage is ephemeral and don't count on it
               | being around long-term.
        
               | mrcarrot wrote:
               | I think the parent was agreeing with you. If the "local"
               | SSDs _weren't_ actually local, then presumably they
               | wouldn't need to be ephemeral since they could be
               | connected over the network to whichever host your
               | instance was launched on.
        
               | amluto wrote:
               | Which is a weird sort of limitation. For any sort of you-
               | own-the-hardware arrangement, NVMe disks are fine for
               | long term storage. (Obviously one should have backups,
               | but that's a separate issue. One should have a DR plan
               | for data on EBS, too.)
               | 
               | You need to _migrate_ that data if you replace an entire
               | server, but this usually isn't a very big deal.
        
               | supriyo-biswas wrote:
               | This is Hyrum's law at play: AWS wants to make sure that
               | the instance stores aren't seen as persistent, and
               | therefore enforce the failure mode for normal operations
               | as well.
               | 
               | You should also see how they enforce similar things for
               | their other products and APIs, for example, most of their
               | services have encrypted pagination tokens.
        
               | throwawaaarrgh wrote:
               | Yes, that's what their purpose is in cloud applications:
               | temporary high performance storage only.
               | 
               | If you want long term local storage you'll have to
               | reserve an instance host.
        
             | mike_hearn wrote:
             | They do this because they want SSDs to be in a physically
             | separate part of the building for operational reasons, or
             | what's the point in giving you a "local" SSD that isn't
             | actually plugged into the real machine?
        
               | ianburrell wrote:
               | The reason for having most instances use network storage
               | is that it makes possible migrating instances to other
               | hosts. If the host fails, the network storage can be
               | pointed at the new host with a reboot. AWS sends out
               | notices regularly when they are going to reboot or
               | migrate instances.
               | 
               | Their probably should be more local instance storage
               | types for using with instances that can be recreated
               | without loss. But it is simple for them to have a single
               | way of doing things.
               | 
               | At work, someone used fast NVMe instance storage for
               | Clickhouse which is a database. It was a huge hassle to
               | copy data when instances were going to be restarted
               | because the data would be lost.
        
               | mike_hearn wrote:
               | Sure, I understand that, but this user is claiming that
               | on GCP even local SSDs aren't really local, which raises
               | the question of why not.
               | 
               | I suspect the answer is something to do with their
               | manufacturing processes/rack designs. When I worked there
               | (pre GCP) machines had only a tiny disk used for booting
               | and they wanted to get rid of that. Storage was handled
               | by "diskful" machines that had dedicated trays of HDDs
               | connected to their motherboards. If your datacenters and
               | manufacturing processes are optimized for building
               | machines that are either compute or storage but not both,
               | perhaps the more normal cloud model is hard to support
               | and that pushes you towards trying to aggregate storage
               | even for "local" SSD or something.
        
               | deadmutex wrote:
               | The GCE claim is unverified. OP seems to be referring to
               | PD-SSD and not LocalSSD
        
               | youngtaff wrote:
               | > At work, someone used fast NVMe instance storage for
               | Clickhouse which is a database. It was a huge hassle to
               | copy data when instances were going to be restarted
               | because the data would be lost.
               | 
               | This post on how Discord RAIDed local NVMe volumes with
               | slower remote volumes might be on interest
               | https://discord.com/blog/how-discord-supercharges-
               | network-di...
        
               | ianburrell wrote:
               | We moved to running Clickhouse on EKS with EBS volumes
               | for storage. It can better survive instances going down.
               | I didn't work on it so don't how much slower it is.
               | Lowering the management burden was big priority.
        
               | wiredfool wrote:
               | Are you saying that a reboot wipes the ephemeral disks?
               | Or a stop the instance and start the instance from AWS
               | console/api?
        
               | ianburrell wrote:
               | Reboot keeps the instance storage volumes. Restarting
               | wipes them. Starting frequently migrates to new host. And
               | the "restart" notices AWS sends are likely cause the host
               | has a problem and need to migrate it.
        
               | yolovoe wrote:
               | The comment you're responding to is wrong. AWS offers
               | many kinds of storage. Instance local storage is
               | physically attached to the droplet. EBS isn't but that's
               | a separate thing entirely.
               | 
               | I literally work in EC2 Nitro.
        
             | colechristensen wrote:
             | For AWS there are EBS volumes attached through a custom
             | hardware NVMe interface and then there's Instance Store
             | which is actually local SSD storage. These are different
             | things.
             | 
             | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instanc
             | e...
        
               | kwillets wrote:
               | EBS is also slower than local NVMe mounts on i3's.
               | 
               | Also, both features use Nitro SSD cards, according to AWS
               | docs. The Nitro architecture is all locally attached --
               | instance storage to the instance, EBS to the EBS server.
        
             | jrullman wrote:
             | I can attest to the fact that on EC2, "instance store"
             | volumes are actually physically attached.
        
             | jsnell wrote:
             | I think you're wrong about that. AWS calls this class of
             | storage "instance storage" [0], and defines it as:
             | 
             | > Many Amazon EC2 instances can also include storage from
             | devices that are located inside the host computer, referred
             | to as instance storage.
             | 
             | There might be some wiggle room in "physically attached",
             | but there's none in "storage devices located inside the
             | host computer". It's not some kind of AWS-only thing
             | either. GCP has "local SSD disks"[1], which I'm going to
             | claim are likewise local, not over the network block
             | storage. (Though the language isn't as explicit as for
             | AWS.)
             | 
             | [0] https://aws.amazon.com/ec2/instance-types/
             | 
             | [1] https://cloud.google.com/compute/docs/disks#localssds
        
               | 20after4 wrote:
               | If the SSD is installed in the host server, doesn't that
               | still allow for it to be shared among many instances
               | running on said host? I can imagine that a compute node
               | has just a handful of SSDs and many hundreds of instances
               | sharing the I/O bandwidth.
        
               | discodave wrote:
               | If you have one of the metal instance types, then you get
               | the whole host, e.g. i4i.metal:
               | 
               | https://aws.amazon.com/ec2/instance-types/i4i/
        
               | aeyes wrote:
               | On AWS yes, the older instances which I am familiar with
               | had 900GB drives and they sliced that up into volumes of
               | 600, 450, 300, 150, 75GB depending on instance size.
               | 
               | But they also tell you how much IOPS you get: https://doc
               | s.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...
        
               | ownagefool wrote:
               | PCI bus, etc too
        
               | throwawaaarrgh wrote:
               | Instance storage is not networked. That's why it's there.
        
               | queuebert wrote:
               | How _do_ these machines manage the sharing of one local
               | SSD across multiple VMs? Is there some wrapper around the
               | I /O stack? Does it appear as a network share? Geniuinely
               | curious...
        
               | felixg3 wrote:
               | Probably NVME namespaces [0]?
               | 
               | [0]: https://nvmexpress.org/resource/nvme-namespaces/
        
               | bravetraveler wrote:
               | Less fancy, quite often... at least on VPS providers [1].
               | They like to use reflinked files off the base images.
               | This way they only store what differs.
               | 
               | 1: Which is really a cloud without a certain degree of
               | software defined networking/compute/storage/whatever.
        
               | dan-robertson wrote:
               | AWS have custom firmware for at least some of their SSDs,
               | so could be that
        
               | magicalhippo wrote:
               | In say VirtualBox you can create a file backed on the
               | physical disk, and attach it to the VM so the VM sees it
               | as a NVMe drive.
               | 
               | In my experience this is also orders of magnitude slower
               | that true direct access, ie PCIe pass-through, as all
               | access has to pass through the VM storage driver and so
               | _could_ explain what is happening.
        
               | icedchai wrote:
               | With Linux and KVM/QEMU, you can map an entire physical
               | disk, disk partition, or file to a block device in the
               | VM. For my own VM hosts, I use LVM and map a logical
               | volume to the VM. I assumed cloud providers did something
               | conceptually similar, only much more sophisticated.
        
               | pclmulqdq wrote:
               | That's the abstraction they want you to work with, yes.
               | That doesn't mean it's what is actually happening - at
               | least not in the same way that you're thinking.
               | 
               | As a hint for you, I said " _a_ network ", not " _the_
               | network. " You can also look at public presentations
               | about how Nitro works.
        
               | jng wrote:
               | Nitro "virtual NVME" device are mostly (only?) for EBS --
               | remote network storage, transparently managed, using a
               | separate network backbone, and presented to the host as a
               | regular local NVME device. SSD drives in instances such
               | as i4i, etc. are physically attached in a different way
               | -- but physically, unlike EBS, they are ephemeral and the
               | content becomes unavaiable as you stop the instance, and
               | when you restart, you get a new "blank slate". Their
               | performance is 1 order of magnitude faster than standard-
               | level EBS, and the cost structure is completely different
               | (and many orders of magnitude more affordable than EBS
               | volumes configured to have comparable I/O performance).
        
               | jsnell wrote:
               | I've linked to public documentation that is pretty
               | clearly in conflict with what you said. There's no wiggle
               | room in how AWS describes their service without it being
               | false advertising. There's no "ah, but what if we define
               | the entire building to be the host computer, then the
               | networked SSDs really _are_ inside the host computer "
               | sleight of hand to pull off here.
               | 
               | You've provided cryptic hints and a suggestion to watch
               | some unnamed presentation.
               | 
               | At this point I really think the burden of proof is on
               | you.
        
               | dekhn wrote:
               | it sounds like you're trying to say "PCI switch" without
               | saying "PCI switch" (I worked at Google for over a
               | decade, including hardware division).
        
               | jasonwatkinspdx wrote:
               | Both the documentation and Amazon employees are in here
               | telling you that you're wrong. Can you resolve that
               | contradiction or do you just want to act coy like you
               | know some secret? The latter behavior is not productive.
        
               | wstuartcl wrote:
               | the tests were for these local (metal direct connect
               | ssds). The issue is not network overhead -- its that just
               | like everything else in cloud the performance of 10 years
               | ago was used as the baseline that carries over today with
               | upcharges to buy back the gains.
               | 
               | there is a reason why vcpu performance is still locked to
               | the typical core from 10 years ago when every core on a
               | machine today in those data scenters is 3-5x or more
               | speed basis. Its cause they can charge you for 5x the
               | cores to get that gain.
        
             | dekhn wrote:
             | I suspect you must be conflating several different storage
             | products. Are you saying
             | https://cloud.google.com/compute/docs/disks/local-ssd
             | devices talk to the host through a network (say, ethernet
             | with some layer on top)? Because the documentation very
             | clearly says otherwise, "This is because Local SSD disks
             | are physically attached to the server that hosts your VM.
             | For this same reason, Local SSD disks can only provide
             | temporary storage." (at least, I'm presuming that by
             | physically attached, they mean it's connected to the PCI
             | bus without a network in between).
             | 
             | I suspect you're thinking of SSD-PD. If "local" SSDs are
             | not actually local and go through a network, I need to have
             | a discussion with my GCS TAM about truth in advertising.
        
               | op00to wrote:
               | > physically attached
               | 
               | Believe it or not, superglue and a wifi module! /s
        
               | mint2 wrote:
               | I don't really agree with assuming the form of physical
               | attachment and interaction unless it is spelled out.
               | 
               | If that's what's meant it will be stated in some fine
               | print, if it's not stated anywhere then there is no
               | guarantee what the term means, except I would guess they
               | may want people to infer things that may not necessarily
               | be true.
        
               | dekhn wrote:
               | "Physically attached" has had a fairly well defined
               | meaning and i don't normally expect a cloud provider to
               | play word salad to convince me a network drive is locally
               | attached (like I said, if true, I would need to have a
               | chat with my TAM about it).
               | 
               | Physically attached for servers, for the past 20+ years,
               | has meant a direct electrical connection to a host bus
               | (such as the PCI bus attached to the front-side bus). I'd
               | like to see some alternative examples that violate that
               | convention.
        
               | adgjlsfhk1 wrote:
               | Ethernet cables are physical...
        
               | dekhn wrote:
               | The NIC is attached to the host bus through the north
               | bridge. But other hosts on the same ethernetwork are not
               | considered to be "local". We dont need to get crazy about
               | teh semantics to know that when a cloud provider says an
               | SSD is locally attached, that it's closer than an
               | ethernetwork away.
        
             | crotchfire wrote:
             | This is incorrect.
             | 
             | Amazon offers both locally-attached storage devices as well
             | as instance-attached storage devices. The article is about
             | the latter kind.
        
             | bfncieezo wrote:
             | instances can have block storage which is network attached,
             | or local attached ssd/nvme. its 2 separate things.
        
             | choppaface wrote:
             | Nope! Well not as advertised. There are instances, usually
             | more expensive ones, where there are supposed to be local
             | NVME disks dedicated to the instance. You're totally right
             | that providing good I/O is a big problem! And I have done
             | studies myself showing just how bad Google Cloud is here,
             | and have totally ditched Google Cloud for providing crappy
             | compute service (and even worse customer service).
        
             | yolovoe wrote:
             | You're wrong. Instance local means SSD is physically
             | attached to the droplet and is inside the server chassis,
             | connected via PCIe.
             | 
             | Sourece: I work on nitro cards.
        
               | tptacek wrote:
               | "Attached to the droplet"?
        
               | hipadev23 wrote:
               | digitalocean squad
        
               | sargun wrote:
               | Droplets are what EC2 calls their hosts. Confusing? I
               | know.
        
               | tptacek wrote:
               | Yes! That is confusing! Tell them to stop it!
        
             | hathawsh wrote:
             | That seems like a big opportunity for other cloud
             | providers. They could provide SSDs that are actually
             | physically attached and boast (rightfully) that their SSDs
             | are a lot faster, drawing away business from older cloud
             | providers.
        
               | solardev wrote:
               | For what kind of workloads would a slower SSD be a
               | significant bottleneck?
        
               | ddorian43 wrote:
               | Next thing the other clouds will offer is cheaper
               | bandwidth pricing, right?
        
             | Salgat wrote:
             | At first you'd think maybe they can do a volume copy from a
             | snapshot to a local drive on instance creation but even at
             | 100gbps you're looking at almost 3 minutes for a 2TB drive.
        
           | crazygringo wrote:
           | > _It 's about SSDs being two orders of magnitude slower than
           | datacenter networks._
           | 
           | Could that have to do with every operation requiring a round
           | trip, rather than being able to queue up operations in a
           | buffer to saturate throughput?
           | 
           | It seems plausible if the interface protocol was built for a
           | device it assumed was physically local and so waited for
           | confirmation after each operation before performing the next.
           | 
           | In this case it's not so much the throughput rate that
           | matters, but the latency -- which can also be heavily
           | affected by buffering of other network traffic.
        
             | Nextgrid wrote:
             | Underlying protocol limitations wouldn't be an issue - the
             | cloud provider's implementation can work around that.
             | They're unlikely to be sending sequential SCSI/NVMe
             | commands over the wire - instead, the hypervisor pretends
             | to be the NVME device, but then converts to some internal
             | protocol (that's less chatty and can coalesce requests
             | without waiting on individual ACKs) before sending that to
             | the storage server.
             | 
             | The problem is that ultimately your application often
             | requires the outcome of a given IO operation to decide
             | which operation to perform next - let's say when it comes
             | to a database, it should first read the index (and wait for
             | that to complete) before it knows the on-disk location of
             | the actual row data which it needs to be able to issue the
             | next IO operation.
             | 
             | In this case, there's no other solution than to move that
             | application closer to the data itself. Instead of the
             | networked storage node being a dumb blob storage returning
             | bytes, the networked "storage" node is your database
             | itself, returning query results. I believe that's what RDS
             | Aurora does for example, every storage node can itself
             | understand query predicates.
        
         | wmf wrote:
         | No, the i3/i4 VMs discussed in the blog have local SSDs. The
         | network isn't the reason local SSDs are slow.
        
         | fngjdflmdflg wrote:
         | So do cloud vendors simply not use fast SSDs? If so I would
         | expect the SSD manufacturers themselves to work on this
         | problem. Perhaps they already are.
        
         | adrr wrote:
         | If the local drives are network drives(eg: SAN) then why are
         | they ephemeral?
        
           | baq wrote:
           | live vm migrations, perhaps
        
         | Dylan16807 wrote:
         | Even assuming that "local" storage is a lie, hasn't the network
         | gotten a lot faster? The author is only asking for a 5x
         | increase at the end of the post.
        
         | boulos wrote:
         | I'm not sure which external or internal product you're talking
         | about, but there are no networks involved for Local SSD on GCE:
         | https://cloud.google.com/compute/docs/disks/local-ssd
         | 
         | Are you referring to PD-SSD? Internal storage usage?
        
         | nostrademons wrote:
         | Makes me wonder if we're on the crux of a shift back to client-
         | based software. Historically changes in the relative cost of
         | computing components have driven most of the shifts in the
         | computing industry. Cheap teletypes & peripherals fueled the
         | shift from batch-processing mainframes to timesharing
         | minicomputers. Cheap CPUs & RAM fueled the shift from
         | minicomputers to microcomputers. Cheap and fast networking
         | fueled the shift from desktop software to the cloud. Will cheap
         | SSDs & TPU/GPUs fuel a shift back toward thicker clients?
         | 
         | There are a bunch of supporting social trends toward this as
         | well. Renewed emphasis on privacy. Big Tech canceling beloved
         | products, bricking devices, and generally enshittifying
         | everything - a lot of people want locally-controlled software
         | that isn't going to get worse at the next update. Ever-rising
         | prices which make people want to lock in a price for the device
         | and not deal with increasing rents for computing power.
        
           | davkan wrote:
           | I think a major limiting factor here for many applications is
           | that mobile users are a huge portion of the user base. In
           | that space storage, and more importantly battery life, are
           | still at a premium. Granted the storage cost just seems to be
           | gouging from my layman's point of view, so industry needs
           | might force a shift upwards.
        
         | tw04 wrote:
         | >The problem is that this network is so large and slow that it
         | can't give you anywhere near the performance of a local SSD.
         | 
         | *As implemented in the public cloud providers.
         | 
         | You can absolutely get better than local disk speeds from SAN
         | devices and we've been doing it for decades. To do it on-prem
         | with flash devices will require NVMe over FC or Ethernet and an
         | appropriate storage array. Modern all-flash array performance
         | is measured in millions of IOPS.
         | 
         | Will there be a slight uptick in latency? Sure, but it's well
         | worth it for the data services and capacity of an external
         | array for nearly every workload.
        
         | scottlamb wrote:
         | > The problem is that this network is so large and slow that it
         | can't give you anywhere near the performance of a local SSD.
         | This wasn't a problem for hard drives, which was the backing
         | technology when a lot of these network attached storage systems
         | were invented, because they are fundamentally slow compared to
         | networks, but it is a problem for SSD.
         | 
         | Certainly true that SSD bandwidth and latency improvements are
         | hard to match, but I don't understand why intra-datacenter
         | network latency in particular is so bad. This ~2020-I-think
         | version of the "Latency Numbers Everyone Should Know" says 0.5
         | ms round trip (and mentions "10 Gbps network" on another line).
         | [1] It was the same thing in a 2012 version (that only mentions
         | "1 Gbps network"). [2] Why no improvement? I think that 2020
         | version might have been a bit conservative on this line, and
         | nice datacenters may even have multiple 100 Gbit/sec NICs per
         | machine in 2024, but still I think the round trip actually is
         | strangely bad.
         | 
         | I've seen experimental networking stuff (e.g. RDMA) that claims
         | significantly better latency, so I don't think it's a physical
         | limitation of the networking gear but rather something at the
         | machine/OS interaction area. I would design large distributed
         | systems significantly differently (be much more excited about
         | extra tiers in my stack) if the standard RPC system offered say
         | 10 us typical round trip latency.
         | 
         | [1]
         | https://static.googleusercontent.com/media/sre.google/en//st...
         | 
         | [2] https://gist.github.com/jboner/2841832
        
           | dekhn wrote:
           | Modern data center networks don't have full cross
           | connectivity. Instead they are built using graphs and
           | hierarchies that provide less than the total bandwidth
           | required for all pairs of hosts to be communicating. This
           | means, as workloads start to grow and large numbers of
           | compute hosts demand data IO to/from storage hosts, the
           | network eventually gets congested, which typically exhibits
           | as higher latencies and more dropped packets. Batch jobs are
           | often relegated to "spare" bandwidth while serving jobs often
           | get dedicated bandwidth
           | 
           | At the same time, ethernetworks with layered network
           | protocols on top typically have a fair amount of latency
           | overhead, that makes it much slower than bus-based direct-
           | host-attached storage. I was definitely impressed at how
           | quickly SSDs reached and then exceeded SATA bandwidth. nvme
           | has made a HUGE difference here.
        
           | kccqzy wrote:
           | That document is probably deliberately on the pessimistic
           | side to encourage your code to be portable across all kinds
           | of "data centers" (however that is defined). When I
           | previously worked at Google, the standard RPC system
           | definitely offered 50 microseconds of round trip latency at
           | the median (I measured it myself in a real application), and
           | their advanced user-space implementation called Snap could
           | offer about 10 microseconds of round trip latency. The latter
           | figure comes from page 9 of
           | https://storage.googleapis.com/gweb-
           | research2023-media/pubto...
           | 
           | > nice datacenters may even have multiple 100 Gbit/sec NICs
           | per machine in 2024,
           | 
           | Google exceeded 100Gbps per machine long before 2024. IIRC it
           | had been 400Gbps for a while.
        
             | scottlamb wrote:
             | Interesting. I worked at Google until January 2021. I see
             | 2019 dates on that PDF, but I wasn't aware of snap when I
             | left. There was some alternate RPC approach (Pony Express,
             | maybe? I get the names mixed up) that claimed 10 us or so
             | but was advertised as experimental (iirc had some bad
             | failure modes at the time in practice) and was simply
             | unavailable in many of the datacenters I needed to deploy
             | in. Maybe they're two names for the same thing. [edit: oh,
             | yes, starting to actually read the paper now, and: "Through
             | Snap, we created a new communication stack called Pony
             | Express that implements a custom reliable transport and
             | communications API."]
             | 
             | Actual latency with standard Stubby-over-TCP and warmed
             | channels...it's been a while, so I don't remember the
             | number I observed, but I remember it wasn't _that_ much
             | better than 0.5 ms. It was still bad enough that I didn 't
             | want to add a tier that would have helped with isolation in
             | a particularly high-reliability system.
        
               | kccqzy wrote:
               | Snap was the external name for the internal project known
               | as User Space Packet Service (abbreviated USPS) so
               | naturally they renamed it prior to publication. I
               | deployed an app using Pony Express in 2023 and it was
               | available in the majority of cells worldwide. Pony
               | Express supported more than just RPC though. The
               | alternate RPC approach that you spoke of was called Void.
               | It had been experimental for a long time and indeed it
               | wasn't well known even inside Google.
               | 
               | > but I remember it wasn't that much better than 0.5 ms.
               | 
               | If you and I still worked at Google I'd just give you an
               | automon dashboard link showing latency an order of
               | magnitude better than that to prove myself...
        
               | scottlamb wrote:
               | Interesting, thanks!
               | 
               | > If you and I still worked at Google I'd just give you
               | an automon dashboard link showing latency an order of
               | magnitude better than that to prove myself...
               | 
               | I believe you, and I think in principle we should all be
               | getting the 50 us latency you're describing within a
               | datacenter with no special effort.
               | 
               | ...but it doesn't match what I observed, and I'm not sure
               | why. Maybe difference of a couple years. Maybe I was
               | checking somewhere with older equipment, or some
               | important config difference in our tests. And obviously
               | my memory's a bit fuzzy by now but I know I didn't like
               | the result I got.
        
             | Szpadel wrote:
             | with such speed and CXL gaining traction (think ram and
             | GPUs over network) why network SSD is still issue? you
             | could have like one storage server per rack that would
             | serve storage only for that particular rack
             | 
             | you could easily have like 40GB/s with some over
             | provisioning / bucketing
        
           | KaiserPro wrote:
           | Networks are not reliable, despite what you hear, so latency
           | is used to mask re-tries and delays.
           | 
           | The other thing to note about big inter-DC links are heavily
           | QoS'd and contented, because they are both expensive and a
           | bollock to maintain.
           | 
           | Also, from what I recall, 40gig links are just parallel 10
           | gig links, so have no lower latency. I'm not sure if 100/400
           | gigs are ten/fourty lines of ten gigs in parallel or actually
           | able to issue packets at 10/40 times a ten gig link. I've
           | been away from networking too long
        
             | scottlamb wrote:
             | > Networks are not reliable, despite what you hear, so
             | latency is used to mask re-tries and delays.
             | 
             | Of course, but even the 50%ile case is strangely slow, and
             | if that involves retries something is deeply wrong.
        
               | KaiserPro wrote:
               | You're right, but TCP doesn't like packets being dropped
               | halfway through a stream. If you have a highly QoS'd link
               | then you'll see latency spikes.
        
               | scottlamb wrote:
               | Again, I'm not talking about spikes (though better tail
               | latency is always desirable) but poor latency in the
               | 50%ile case. And for high-QoS applications, not batch
               | stuff. The snap paper linked elsewhere in the thread
               | shows 10 us latencies; they've put in some optimization
               | to achieve that, but I don't really understand why we
               | don't expect close to that with standard kernel
               | networking and TCP.
        
             | wmf wrote:
             | _40gig links are just parallel 10 gig links, so have no
             | lower latency_
             | 
             | That's not correct. Higher link speeds do have lower
             | serialization latency, although that's a small fraction of
             | overall network latency.
        
         | throwawaaarrgh wrote:
         | > SSDs in the cloud are attached over a network, and
         | fundamentally have to be
         | 
         | SANs can still be quite fast, and instance storage is fast,
         | both of which are available in cloud providers
        
         | samstave wrote:
         | Forgive if stupid, but when netflix was doing all their edge
         | content boxes, where they were putting machines much more
         | latency-close to customers... is/does/can this model kinda work
         | for SSDs in a SScDN type of network
         | (client---->CDN->SScDN-------<>SSD?
        
         | dan-robertson wrote:
         | I can see network attached SSDs having poor latency, but
         | shouldn't the networking numbers quoted in the article allow
         | for higher throughput than observed?
        
         | paulddraper wrote:
         | > and fundamentally have to be
         | 
         | Can you expound?
        
         | PaulHoule wrote:
         | They don't have to be. Architecturally there are many benefits
         | to storage area networks but I have built plenty of systems
         | which are self-contained, download a dataset to a cloud
         | instance with a direct attached SSD, load it into a database
         | and provide a different way.
        
       | siliconc0w wrote:
       | Core count plus modern nvme actually make a great case for moving
       | away from the cloud- before it was, "your data probably fits into
       | memory". These are so fast that they're close enough to memory so
       | it's "your data surely fits on disk". This reduces the complexity
       | of a lot of workloads so you can just buy a beefy server and do
       | pretty insane caching/calculation/serving with just a single box
       | or two for redundancy.
        
         | echelon wrote:
         | The reasons to switch away from cloud keep piling up.
         | 
         | We're doing some amount of on-prem, and I'm eager to do more.
        
           | aarmenaa wrote:
           | I've previously worked for a place that ran most of their
           | production network "on-prem". They had a few thousand
           | physical machines spread across 6 or so colocation sites on
           | three continents. I enjoyed that job immensely; I'd jump at
           | the chance to build something like it from the ground up. I'm
           | not sure if that actually makes sense for very many
           | businesses though.
        
         | malfist wrote:
         | I keep hearing that, but that's simply not true. SSDs are fast,
         | but they're several orders of magnitude slower than RAM, which
         | is orders of magnitude slower than CPU Cache.
         | 
         | Samsung 990 Pro 2TB has a latency of 40 ms
         | 
         | DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.
         | 
         | DDR4 latency is 0.035% of one of the fastest SSDs, or to put it
         | another way, DDR4 is 2,857x faster than an SSD.
         | 
         | L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz
         | cpu like the i7-10700, L1 cache latency is sub 1ns.
        
           | avg_dev wrote:
           | pretty cool comparisons. quite some differences there.
           | 
           | tangent, I remember reading some post called something like
           | "Latency numbers every programmer should know" and being
           | slightly ashamed when I could not internalize it.
        
             | malfist wrote:
             | Oh don't feel bad. I had to look up every one of those
             | numbers
        
             | nzgrover wrote:
             | You might enjoy Grace Hopper's leacture which includes this
             | snippet:
             | https://youtu.be/ZR0ujwlvbkQ?si=vjEQHIGmffjqfHBN&t=2706
        
             | darzu wrote:
             | probably this one:
             | https://gist.github.com/hellerbarde/2843375
        
           | LeifCarrotson wrote:
           | I wonder how many people have built failed businesses that
           | never had enough customer data to exceed the DDR4 in the
           | average developer laptop, and never had so many simultaneous
           | queries it couldn't be handled by a single core running
           | SQLite, but built the software architecture on a distributed
           | cloud system just in case it eventually scaled to hundreds of
           | terabytes and billions of simultaneous queries.
        
             | Repulsion9513 wrote:
             | A _LOT_... especially here.
        
             | malfist wrote:
             | I totally hear you about that. I work for FAANG, and I'm
             | working on a service that has to be capable of sending 1.6m
             | text messages in less than 10 minutes.
             | 
             | The amount of complexity the architecture has because of
             | those constraints is insane.
             | 
             | When I worked at my previous job, management kept asking
             | for that scale of designs for less than 1/1000 of the
             | throughput and I was constantly pushing back. There's real
             | costs to building for more scale than you need. It's not as
             | simple as just tweaking a few things.
             | 
             | To me there's a couple of big breakpoints in scale:
             | 
             | * When you can run on a single server
             | 
             | * When you need to run on a single server, but with HA
             | redundancies
             | 
             | * When you have to scale beyond a single server
             | 
             | * When you have to adapt your scale to deal with the limits
             | of a distributed system, i.e. designing for DyanmoDB's
             | partition limits.
             | 
             | Each step in that chain add irrevocable complexity, adds to
             | OE, adds to cost to run and cost to build. Be sure you have
             | to take those steps before you decide too.
        
               | disqard wrote:
               | I'm trying to guess what "OE" stands for... over
               | engineering? operating expenditure? I'd love to know what
               | you meant :)
        
               | madisp wrote:
               | probably operating expenses
        
               | malfist wrote:
               | Sorry, thought it was a common term. Operational
               | Excellence. All the effort and time it takes to keep a
               | service online, on call included
        
               | kuschku wrote:
               | Maybe I'm misunderstanding something, but that's about
               | 2700 a second. Or about 3Mbps.
               | 
               | Even a very unoptimized application running on a dev
               | laptop can serve 1Gbps nowadays without issues.
               | 
               | So what are the constraints that demand a complex
               | architecture?
        
               | goguy wrote:
               | That really doesn't require that much complexity.
               | 
               | I used to send something like 250k a minute complete with
               | delivery report processing from a single machine running
               | a bunch of other services like 10 years ago.
        
             | icedchai wrote:
             | Many. I regularly see systems built for "big data", built
             | for scale using "serverless" and some proprietary cloud
             | database (like DynamoDB), storing a few hundred megabytes
             | total. 20 years ago we would've built this on PHP and MySQL
             | and called it a day.
        
             | Szpadel wrote:
             | In may day job I often see systems that have the opposite.
             | Especially for database queries, developers tested on local
             | machine with 100s of records and everything was quick and
             | snappy and on production with mere millions of records I
             | often see queries taking minutes up to a hour just because
             | some developer didn't see need for creating indexes or
             | created query in a way there is no way to even create any
             | index that would work
        
               | layer8 wrote:
               | That's true, but has little to do with distributed cloud
               | architecture vs. single local instance.
        
             | kristopolous wrote:
             | You're not considered serious if you don't. Kinda stupid.
        
           | BackBlast wrote:
           | You're missing the purpose of the cache. At least for this
           | argument it's mostly for network responses.
           | 
           | HDD was 10ms, which was noticeable for cached network request
           | that needs to go back out on the wire. This was also bottle
           | necked by IOPS, after 100-150 IOPS you were done. You could
           | do a bit better with raid, but not the 2-3 orders of
           | magnitude you really needed to be an effective cache. So it
           | just couldn't work as a serious cache, the next step up was
           | RAM. This is the operational environment which redis and such
           | memory caches evolved.
           | 
           | 40 us latency is fine for caching. Even the high load
           | 500-600us latency is fine for the network request cache
           | purpose. You can buy individual drives with > 1 million read
           | IOPS. Plenty for a good cache. HDD couldn't fit the bill for
           | the above reasons. RAM is faster, no question, but the lower
           | latency of the RAM over the SSD isn't really helping
           | performance here as the network latency is dominating.
           | 
           | Rails conference 2023 has a talk that mentions this. They
           | moved from a memory based cache system to an SSD based cache
           | system. The Redis RAM based system latency was 0.8ms and the
           | SSD based system was 1.2ms for some known system. Which is
           | fine. It saves you a couple of orders of magnitude on cost
           | and you can do much much larger and more aggressive caching
           | with the extra space.
           | 
           | Often times these RAM caching servers are a network hop away
           | anyway, or at least a loopback TCP request. Making the
           | question of comparing SSD latency to RAM totally irrelevant.
        
         | jeffbee wrote:
         | "I will simply have another box for redundancy" is already a
         | system so complex that having it in or out of the cloud won't
         | make a difference.
        
           | Nextgrid wrote:
           | It really depends on business requirements. Real-time
           | redundancy is hard. Taking backups at 15-min intervals and
           | having the standby box merely pull down the last backup when
           | starting up is much easier, and this may actually be fine for
           | a lot of applications.
           | 
           | Unfortunately very few actually think about failure modes,
           | set realistic targets, and actually _test_ the process.
           | Everyone _thinks_ they need 100% uptime and consistency, few
           | actually achieve it in practice (many think they do, but when
           | shit hits the fan it uncovers an edge-case they haven 't
           | thought of), but it turns out that in most cases it doesn't
           | matter and they could've saved themselves a lot of trouble
           | and complexity.
        
             | littlestymaar wrote:
             | So much this.
             | 
             | I'd github can afford the amount of downtime they do, it's
             | likely that your business can afford 15 minutes of downtime
             | every once in a while due to a failing server.
             | 
             | Also, the less servers you have overall, the least common a
             | failure will be.
             | 
             | Backups and cold failover server are mandatory, but
             | anything past that should be weighted on a rational
             | cost/benefit analysis, and for most people the cost/benefit
             | ratio just isn't enough to justify infrastructure
             | complexity.
        
       | zokier wrote:
       | > Since then, several NVMe instance types, including i4i and
       | im4gn, have been launched. Surprisingly, however, the performance
       | has not increased; seven years after the i3 launch, we are still
       | stuck with 2 GB/s per SSD.
       | 
       | AWS marketing claims otherwise:                   Up to 800K
       | random write IOPS         Up to 1 million random read IOPS
       | Up to 5600 MB/second of sequential writes         Up to 8000
       | MB/second of sequential reads
       | 
       | https://aws.amazon.com/blogs/aws/new-storage-optimized-amazo...
        
         | sprachspiel wrote:
         | This is for 8 SSDs and a single modern PCIe 5.0 has better
         | specs than this.
        
           | jeffbee wrote:
           | Those claims are per device. There isn't even an instance in
           | that family with 8 devices.
        
           | nik_0_0 wrote:
           | Is it? The line preceding the bullet list on that page seems
           | to state otherwise:
           | 
           | ""                 Each storage volume can deliver the
           | following performance (all measured using 4 KiB blocks):
           | * Up to 8000 MB/second of sequential reads
           | 
           | ""
        
             | sprachspiel wrote:
             | Just tested a i4i.32xlarge:                 $ lsblk
             | NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
             | loop0          7:0    0  24.9M  1 loop /snap/amazon-ssm-
             | agent/7628       loop1          7:1    0  55.7M  1 loop
             | /snap/core18/2812       loop2          7:2    0  63.5M  1
             | loop /snap/core20/2015       loop3          7:3    0 111.9M
             | 1 loop /snap/lxd/24322       loop4          7:4    0  40.9M
             | 1 loop /snap/snapd/20290       nvme0n1      259:0    0
             | 8G  0 disk        +-nvme0n1p1  259:1    0   7.9G  0 part /
             | +-nvme0n1p14 259:2    0     4M  0 part        +-nvme0n1p15
             | 259:3    0   106M  0 part /boot/efi       nvme2n1
             | 259:4    0   3.4T  0 disk        nvme4n1      259:5    0
             | 3.4T  0 disk        nvme1n1      259:6    0   3.4T  0 disk
             | nvme5n1      259:7    0   3.4T  0 disk        nvme7n1
             | 259:8    0   3.4T  0 disk        nvme6n1      259:9    0
             | 3.4T  0 disk        nvme3n1      259:10   0   3.4T  0 disk
             | nvme8n1      259:11   0   3.4T  0 disk
             | 
             | Since nvme0n1 is the EBS boot volume, we have 8 SSDs. And
             | here's the read bandwidth for one of them:
             | $ sudo fio --name=bla --filename=/dev/nvme2n1 --rw=read
             | --iodepth=128 --ioengine=libaio --direct=1 --blocksize=16m
             | bla: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W)
             | 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio,
             | iodepth=128       fio-3.28       Starting 1 process
             | ^Cbs: 1 (f=1): [R(1)][0.5%][r=2704MiB/s][r=169 IOPS][eta
             | 20m:17s]
             | 
             | So we should have a total bandwidth of 2.7*8=21 GB/s. Not
             | that great for 2024.
        
               | Nextgrid wrote:
               | If you still have this machine, I wonder if you can get
               | this bandwidth in parallel across all SSDs? There could
               | be some hypervisor-level or host-level bottleneck that
               | means while any SSD in isolation will give you the
               | observed bandwidth, you can't actually reach that if you
               | try to access them all in parallel?
        
               | Aachen wrote:
               | So if I'm reading it right, the quote from the original
               | article that started this thread was ballpark correct?
               | 
               | > we are still stuck with 2 GB/s per SSD
               | 
               | Versus the ~2.7 GiB/s your benchmark shows (bit hard to
               | know where to look on mobile with all that line-wrapped
               | output, and when not familiar with the fio tool; not your
               | fault but that's why I'm double checking my conclusion)
        
               | dangoodmanUT wrote:
               | that's 16m blocks, not 4k
        
               | wtallis wrote:
               | Last I checked, Linux splits up massive IO requests like
               | that before sending them to the disk. But there's no
               | benefit to splitting a sequential IO request all the way
               | down to 4kB.
        
               | zokier wrote:
               | I wonder if there is some tuning that needs to be done
               | here, it seems suprising that the advertised rate would
               | be this much off otherwise.
        
               | jeffbee wrote:
               | I would start with the LBA format, which is likely to be
               | suboptimal for compatibility.
        
               | zokier wrote:
               | somehow I4g drives don't like to get formatted
               | # nvme format /dev/nvme1 -n1 -f         NVMe status:
               | INVALID_OPCODE: The associated command opcode field is
               | not valid(0x2001)         # nvme id-ctrl /dev/nvme1 |
               | grep oacs         oacs      : 0
               | 
               | but the LBA format indeed is sus:                   LBA
               | Format  0 : Metadata Size: 0   bytes - Data Size: 512
               | bytes - Relative Performance: 0 Best (in use)
        
               | jeffbee wrote:
               | It's a shame. The recent "datacenter nvme" standards
               | involving fb, goog, et al mandate 4K LBA support.
        
               | dekhn wrote:
               | Can you addjust --blocksize to correspond to the block
               | size on the device? And with/without --direct=1
        
       | cogman10 wrote:
       | There's a 4th option. Cost.
       | 
       | The fastest SSDs tend to also be MLC which tend to have much
       | lower write life vs other technologies. This isn't unusual,
       | increasing data density generally also makes it easier to
       | increase performance. However, it's at the cost that the writes
       | are typically done for a block/cell in memory rather than for
       | single bits. So if one cell goes bad, they all fail.
       | 
       | But even if that's not the problem, there is a problem of
       | upgrading the fleet in a cost effective mechanism. When you start
       | introducing new tech into the stack, replacing that tech now
       | requires your datacenters to have 2 different types of hardware
       | on hand AND for the techs swapping drives to have a way to
       | identify and replace that stuff when it goes bad.
        
       | c0l0 wrote:
       | Seeing the really just puny "provisioned IOPS" numbers on hugely
       | expensive cloud instances made me chuckle (first in disbelief,
       | then in horror) when I joined a "cloud-first" enterprise shop in
       | 2020 (having come from a company that hosted their own hardware
       | at a colo).
       | 
       | It's no wonder that many people nowadays, esp. those who are so
       | young that they've never experienced anything but cloud
       | instances, seem to have little idea of how much performance you
       | can actually pack in just one or two RUs today. Ultra-fast (I'm
       | not parroting some marketing speak here - I just take a look at
       | IOPS numbers, and compare them to those from highest-end storage
       | some 10-12 years ago) NVMe storage is a _big_ part of that
       | astonishing magic.
        
       | zokier wrote:
       | Has anyone disk benchmarks for M7gd (or C/R equivalents) instance
       | stores? While probably not at the level of I4g, would still be
       | interesting comparison.
        
       | Twirrim wrote:
       | Disclaimer: I work for OCI, opinion my own etc.
       | 
       | We offer faster NVMe drives in instances. Our E4 Dense shapes
       | ship with SAMSUNG MZWLJ7T6HALA-00AU3, which supports Sequential
       | Reads of 7000 MB/s, and Sequential Write 3800 MB/s.
       | 
       | From a general perspective, I would say the _likely_ answer to
       | why AWS doesn't have faster NVMes at the moment is likely to be
       | lack of specific demand. That's a guess, but that's generally how
       | things go. If there's not enough specific demand being fed in
       | through TAMs and the like for faster disks, upgrades are likely
       | to be more of an after-thought, or reflecting supply chain.
       | 
       | I know there's a tendency when you engineer things, to just work
       | around, or work with the constraints, and grumble amongst your
       | team, but it's incredibly invaluable if you can make sure your
       | account manager knows what shortcomings you've had to work
       | around.
        
       | fabioyy wrote:
       | it is not worth to use cloud if you need a lot of iops/bandwidth
       | 
       | heck, its not worth for anything besides scalability
       | 
       | dedicated servers are wayyyy cheaper
        
         | kstrauser wrote:
         | I'm not certain that's true if you look at TCO. Yes, you can
         | probably buy a server for less than the yearly rent on the
         | equivalent EC2 instance. But then you've got to put that server
         | somewhere, with reliable power and probably redundant Internet
         | connections. You have to pay someone's salary to set it up and
         | load it to the point that a user can SSH in and configure it.
         | You have to maintain an inventory of spares, and pay someone to
         | swap it out if it breaks. You have to pay to put its backups
         | somewhere.
         | 
         | Yeah, you _can_ skip a lot of that if your goal is to get _a_
         | server online as cheaply as possible, reliability be damned. As
         | soon as you start caring about keeping it in a business-ready
         | state, costs start to skyrocket.
         | 
         | I've worn the sysadmin hat. If AWS burned down, I'd be ready
         | and willing to recreate the important parts locally so that my
         | company could stay in business. But wow, would they ever be in
         | for some sticker shock.
        
           | Nextgrid wrote:
           | > But then you've got to put that server somewhere, with
           | reliable power and probably redundant Internet connections.
           | You have to pay someone's salary to set it up and load it to
           | the point that a user can SSH in and configure it. You have
           | to maintain an inventory of spares, and pay someone to swap
           | it out if it breaks.
           | 
           | There's a middle-ground between cloud and colocation. There
           | are plenty of providers such as OVH, Hetzner, Equinix, etc
           | which will do all of the above for you.
        
           | the8472 wrote:
           | At least in the workstation segment cloud doesn't compete. We
           | use Threadrippers + A6000 GPUs at work. Getting the
           | equivalent datacenter-type GPUs and EPYC processors is more
           | expensive, even after accounting for IT and utilization.
        
           | layer8 wrote:
           | Where I live, a number of SMEs are doing this. It's really
           | not that costly, unless you are a tiny startup I guess.
        
           | justsomehnguy wrote:
           | > as cheaply as possible, _reliability be damned_. As soon as
           | you start caring about keeping it in a business-ready state,
           | costs start to skyrocket.
           | 
           | The demand for five-nines is greatly exaggerated.
        
           | BackBlast wrote:
           | > I'm not certain that's true if you look at TCO.
           | 
           | Sigh. This old trope from ancient history in internet time.
           | 
           | > Yes, you can probably buy a server for less than the yearly
           | rent on the equivalent EC2 instance.
           | 
           | Or a monthly bill... I can oft times buy a higher performing
           | server for the cost of a rental for a single month.
           | 
           | > But then you've got to put that server somewhere, with
           | reliable power and probably redundant Internet connections
           | 
           | Power:
           | 
           | The power problem is a lot lower with modern systems because
           | they can use a lot less of it per unit of compute/memory/disk
           | performance. Idle power has improved a lot too. You don't
           | need 700 watts of server power anymore for a 2 socket 8 core
           | monster that is outclassed by a modern $400 mini-pc that
           | maxes out at 45 watts.
           | 
           | You can buy server rack batteries now in a modern chemistry
           | that'll go 20 years with zero maintenance. 4U sized 5kwh cost
           | 1000-1500. EVs have pushed battery cost down a LOT. How much
           | do you really need? Do you even need a generator if your
           | battery just carries the day? Even if your power reliability
           | totally sucks.
           | 
           | Network:
           | 
           | Never been easier to buy network transfer. Fiber is available
           | in many places, even cable speeds are well beyond the past,
           | and there's starlink if you want to be fully resistant to
           | local power issues. Sure, get two vendors for redundancy.
           | Then you can hit cloud-style uptimes out of your closet.
           | 
           | Overlay networks like tailscale make the networking issues
           | within the reach of almost anyone.
           | 
           | > You have to maintain an inventory of spares, and pay
           | someone to swap it out if it breaks. You have to pay to put
           | its backups somewhere.
           | 
           | Have you seen the size of M.2 sticks? Memory sticks? They
           | aren't very big... I happened to like opening up systems and
           | actually touching the hardware I use.
           | 
           | But yeah, if you just can't make it work or be bothered in
           | the modern era of computing. Then stick with the cloud and
           | the 10-100x premium they charge for their services.
        
       | pkstn wrote:
       | UpCloud has super fast MaxIOPS:
       | https://upcloud.com/products/block-storage
       | 
       | Here's referral link with free credits:
       | https://upcloud.com/signup/?promo=J3JYWZ
        
       | rbranson wrote:
       | It's awfully similar to a lane of PCIe 4.0, if we're talking
       | about instance storage. It reads like they've chosen to map each
       | physical device to a single PCIe lane. Surely the AWS Nitro
       | hardware platform has longer cycle times than PCIe. Note that
       | once the instance type has multiple block devices exposed (i.e.
       | im4gn.8xlarge or higher), striping across these will reach higher
       | throughput (2 devices yields 4G/s, 4 yields 8G/s).
        
       | 0cf8612b2e1e wrote:
       | Serious question, for a consumer does it make any sense to
       | compare SSD benchmarks? I assume the best and worst models give a
       | user an identical experience in 99% of cases, and it is only
       | prosumer activities (video? sustained writes?) which would
       | differentiate them.
        
         | wmf wrote:
         | Yeah, that's pretty much the case. Cheap SSDs provide good
         | enough performance for desktop use.
        
       | eisa01 wrote:
       | Would this be a consequence of the cloud providers not being on
       | the latest technology CPU-wise?
       | 
       | At least I have the impression they are lagging, eg., still
       | offering things like: z1d: Skylake (2017)
       | https://aws.amazon.com/ec2/instance-types/z1d/ x2i: Cascade Lake
       | (2019) and Ice lake (2021) https://aws.amazon.com/ec2/instance-
       | types/x2i/
       | 
       | I have not been able to find instances powered by the 4th (Q1
       | 2023) or 5th generation (Q4 2023) Xeons?
       | 
       | We solve large capacity expansion power market models that need
       | as fast single-threaded performance as possible coupled with lots
       | of RAM (32:1 ratio or higher ideal). One model may take 256-512
       | GB RAM, but not being able to use more than 4 threads effectively
       | (interior point algorithms have very diminishing returns past
       | this point)
       | 
       | Our dispatch models do not have the same RAM requirement, but you
       | still wish to have the fastest single-threaded processors
       | available (and then parallelize)
        
         | deadmutex wrote:
         | You can find Intel Sapphire Rapids powered VM instances on GCE
        
       | kwillets wrote:
       | AWS docs and blogs describe the Nitro SSD architecture, which is
       | locally attached with custom firmware.
       | 
       | > The Nitro Cards are physically connected to the system main
       | board and its processors via PCIe, but are otherwise logically
       | isolated from the system main board that runs customer workloads.
       | 
       | https://docs.aws.amazon.com/whitepapers/latest/security-desi...
       | 
       | > In order to make the [SSD] devices last as long as possible,
       | the firmware is responsible for a process known as wear
       | leveling.... There's some housekeeping (a form of garbage
       | collection) involved in this process, and garden-variety SSDs can
       | slow down (creating latency spikes) at unpredictable times when
       | dealing with a barrage of writes. We also took advantage of our
       | database expertise and built a very sophisticated, power-fail-
       | safe journal-based database into the SSD firmware.
       | 
       | https://aws.amazon.com/blogs/aws/aws-nitro-ssd-high-performa...
       | 
       | This firmware layer seems like a good candidate for the slowdown.
        
         | dan-robertson wrote:
         | Yeah, I'm curious how they would respond to the claims in the
         | article. In [1], they talk about aiming for low latency, for
         | consistent performance (apparently other SSDs could stall at
         | inopportune times), and support on-disk encryption. Latency is
         | often in direct conflict with throughput (eg batching usually
         | trades one for the other), and also matters a lot for plenty of
         | filesystem or database tasks (indeed the OP links to a paper
         | showing that popular databases, even column stores, struggle to
         | use the full disk throughput, though I didn't read why).
         | Encryption is probably not the reason - dedicated hardware on
         | modern chips can do AES at 50GB/s, though maybe it is if it
         | increases latency? So maybe there's something else to it like
         | sharing between many vms on one host
         | 
         | [1] https://m.youtube.com/watch?v=Cxie0FgLogg
        
           | kwillets wrote:
           | The Nitro chipset claims 100 GB/s encryption, so that doesn't
           | seem to be the reason.
        
         | someguydave wrote:
         | Yeah I wonder how Nitro balances the latency & bandwidth
         | demands of multiple VMs while also minimizing memory cache
         | misses on the CPU (I am assuming it uses DMA to talk to the
         | main CPU cores)
        
         | akira2501 wrote:
         | I've noticed the firmware programming positions have been more
         | common in their job listings lately.
        
       | dan-robertson wrote:
       | Is read/write throughput the only difference? Eg I don't know how
       | latency compares, or indeed failure rates or whether fsync lies
       | or not and writes won't always survive power failures.
        
       | jiggawatts wrote:
       | There's a lot of talk about cloud network and disk performance in
       | this thread. I recently benchmarked both Azure and AWS and found
       | that:
       | 
       | - Azure network latency is about 85 microseconds.
       | 
       | - AWS network latency is about 55 microseconds.
       | 
       | - Both can do better, but only in special circumstances such as
       | RDMA NICs in HPC clusters.
       | 
       | - Cross-VPC or cross-VNET is basically identical. Some people
       | were saying it's terribly slow, but I didn't see that in my
       | tests.
       | 
       | - Cross-zone is 300-1200 microseconds due to the inescapable
       | speed of light delay.
       | 
       | - VM-to-VM bandwidth is over 10 Gbps (>1 GB/s) for both clouds,
       | even for the _smallest_ two vCPU VMs!
       | 
       | - Azure Premium SSD v1 latency varies between about 800 to 3,000
       | microseconds, which is many times worse than the network latency.
       | 
       | - Azure Premium SSD v2 latency is about 400 to 2,000
       | microseconds, which isn't that much better, because:
       | 
       | - Local SSD _caches_ in Azure are so much faster than remote disk
       | that we found that Premium SSD v1 is almost always faster than
       | Premium SSD v2 because the latter doesn 't support caching.
       | 
       | - Again in Azure, the local SSD "cache" and also the local "temp
       | disks" both have latency as low as 40 microseconds, on par with a
       | modern laptop NVMe drive. We found that switching to the latest-
       | gen VM SKU and turning on the "read caching" for the data disks
       | was the magic "go-fast" button for databases... without the risk
       | of losing out data.
       | 
       | We investigated the various local-SSD VM SKUs in both clouds such
       | as the Lasv3 series, and as the article mentioned, the
       | performance delta didn't blow my skirt up, but the data loss risk
       | made these not worth the hassle.
        
         | computerdork wrote:
         | Interesting. And would you happen to have the numbers on the
         | performance of the local SSD? Is it's read and write throughput
         | up to the level of modern SSD's?
        
           | jiggawatts wrote:
           | It's pretty much like how the article said. The cloud local
           | SSDs are notably slower than what you'd get in an ordinary
           | laptop, let alone a high-end server.
           | 
           | I'm not an insider and don't have any exclusive knowledge,
           | but from reading a lot about the topic my impression is that
           | the issue in both clouds is the virtualization overheads.
           | 
           | That is, having the networking or storage go through _any_
           | hypervisor software layer is what kills the performance. I
           | 've seen similar numbers with on-prem VMware, Xen, and
           | Nutanix setups as well.
           | 
           | Both clouds appear to be working on next-generation VM SKUs
           | where the hypervisor network and storage functions are
           | offloaded into 100% hardware, either into FPGAs or custom
           | ASICs.
           | 
           | "Azure Boost" is Microsoft's marketing name for this, and it
           | basically amounts to both local and remote disks going
           | through an NVMe controller directly mapped into the memory
           | space of the VM. That is, the VM OS kernel talks _directly_
           | to the hardware, bypassing the hypervisor completely. This is
           | shown in their documentation diagrams:
           | https://learn.microsoft.com/en-us/azure/azure-boost/overview
           | 
           | They're claiming up to 3.8M IOPS for a single VM, which is
           | 3-10x what you'd get out of a single NVMe SSD stick, so...
           | not too shabby at all!
           | 
           | Similarly, Microsoft Azure Network Adapter (MANA) is the
           | equivalent for the NIC, which will similarly connect the VM
           | OS directly into the network, bypassing the hypervisor
           | software.
           | 
           | I'm not an AWS expert, but from what I've seen they've been
           | working on similar tech (Nitro) for years.
        
       | StillBored wrote:
       | Its worse than the article mentions. Because bandwidth isn't the
       | problem its IOPS that are the problem.
       | 
       | Last time (about a year ago) I ran a couple random IO benchmarks
       | against a storage optimized instances and the random IOPs
       | behavior is closer to a large spinning RAID array than SSDs if
       | the disk size is over some threshold.
       | 
       | IIRC, What it looks like is that there is a fast local SSD cache
       | with a couple hundred GB of storage and then the rest is backed
       | by remote spinning media.
       | 
       | Its one of the many reasons I have a hard time taking cloud
       | optimization seriously, the lack of direct tiering controls means
       | that database/etc style workloads are not going to optimize well
       | and that will end up costing a lot of $$$$$.
       | 
       | So, maybe it was the instance types/configuration I was using,
       | but <shrug> it was just something I was testing in passing.
        
       | Ericson2314 wrote:
       | The cloud really is a scam for those afraid of hardware
        
       ___________________________________________________________________
       (page generated 2024-02-20 23:00 UTC)