hngopher.com

       [HN Gopher] How AWS S3 serves 1 petabyte per second on top of sl...
       ___________________________________________________________________
        
       How AWS S3 serves 1 petabyte per second on top of slow HDDs
        
       Author : todsacerdoti
       Score  : 222 points
       Date   : 2025-09-24 10:05 UTC (12 hours ago)
        
 (HTM) web link (bigdata.2minutestreaming.com)
 (TXT) w3m dump (bigdata.2minutestreaming.com)
        
       | EwanToo wrote:
       | I think a more interesting article on S3 is "Building and
       | operating a pretty big storage system called S3"
       | 
       | https://www.allthingsdistributed.com/2023/07/building-and-op...
        
         | giancarlostoro wrote:
         | Really nice read, thank you for that.
        
           | enether wrote:
           | Author of the 2minutestreaming blog here. Good point! I'll
           | add this as a reference at the end. I loved that piece. My
           | goal was to be more concise and focus on the HDD aspect
        
         | dang wrote:
         | Discussed at the time:
         | 
         |  _Building and operating a pretty big storage system called S3_
         | - https://news.ycombinator.com/item?id=36894932 - July 2023
         | (160 comments)
        
       | crabique wrote:
       | Is there an open source service designed with HDDs in mind that
       | achieves similar performance? I know none of the big ones work
       | that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they
       | all suggest flash-only deployments.
       | 
       | Recently I've been looking into Garage and liking the idea of it,
       | but it seems to have a very different design (no EC).
        
         | giancarlostoro wrote:
         | Doing some light googling aside from Ceph being listed, there's
         | one called Gluster as well. Hypes itself as "using common off-
         | the-shelf hardware you can create large, distributed storage
         | solutions for media streaming, data analysis, and other data-
         | and bandwidth-intensive tasks."
         | 
         | It's open source / free to boot. I have no direct experience
         | with it myself however.
         | 
         | https://www.gluster.org/
        
           | a012 wrote:
           | I've used GlusterFS before because I was having tens of old
           | PCs and it worked for me very well. It's basically a PoC to
           | see how it work than production though
        
           | epistasis wrote:
           | A decade ago where I worked we used gluster for ~200TB of HDD
           | for a shared file system on a SLURM compute cluster, as a
           | much better clustered version of NFS. And we used ceph for
           | its S3 interface (RadowGW) for tens of petabytes of back
           | storage after the high IO stages of compute were finished.
           | The ceph was all HDD though later we added some SSDs for a
           | caching pool.
           | 
           | For single client performance, ceph beat the performance I
           | get from S3 today for large file copies. Gluster had
           | difficult to characterize performance, but our setup with big
           | fast RAID arrays seems to still outperform what I see of
           | AWS's luster as a service today for our use case of long
           | sequential reads and writes.
           | 
           | We would occasionally try cephFS, the POSIX shared network
           | filesystem, but it couldn't match our gluster performance for
           | our workload. But also, we built the ceph long term storage
           | to maximize TB/$, so it was at a disadvantage compared to our
           | gluster install. Still, I never heard of cephFS being used
           | anywhere despite it being the original goal in the papers
           | back at UCSC. Keep an eye on CERN for news about one of the
           | bigger ceph installs with public info.
           | 
           | I love both of the systems, and see ceph used everywhere
           | today, but am surprised and happy to see that gluster is
           | still around.
        
           | mbreese wrote:
           | Gluster has been slowly declining for a while. It used to be
           | sponsored by RedHat, but tha stopped a few years ago. Since
           | then, development slowed significantly.
           | 
           | I used to keep a large cluster array with Gluster+ZFS
           | (1.5PB), and I can't say I was ever really that impressed
           | with the performance. That said -- I really didn't have
           | enough horizontal scaling to make it worthwhile from a
           | performance aspect. For us, it was mainly used to make a
           | union file system.
           | 
           | But, I can't say I'd recommend it for anything new.
        
         | elitepleb wrote:
         | Any of them will work just as well, but only with many
         | datacenters worth of drives, which very few deployments can
         | target.
         | 
         | It's the classic horizontal/vertical scaling trade off, that's
         | why flash tends to be more space/cost efficient for speedy
         | access.
        
         | bayindirh wrote:
         | Lustre and ZFS can do similar speeds.
         | 
         | However, if you need high IOPS, you need flash on MDS for
         | Lustre and some Log SSDs (esp. dedicated write and read ones)
         | for ZFS.
        
           | crabique wrote:
           | Thanks, but I forgot to specify that I'm interested in
           | S3-compatible servers only.
           | 
           | Basically, I have a single big server with 80 high-capacity
           | HDDs and 4 high-endurance NVMes, and it's the S3 endpoint
           | that gets a lot of writes.
           | 
           | So yes, for now my best candidate is ZFS + Garage, this way I
           | can get away with using replica=1 and rely on ZFS RAIDz for
           | data safety, and the NVMEs can get sliced and diced to act as
           | the fast metadata store for Garage, the "special"
           | device/small records store for the ZFS, the ZIL/SLOG device
           | and so on.
           | 
           | Currently it's a bit of a Frankenstein's monster: using
           | XFS+OpenCAS as the backing storage for an old version of
           | MinIO (containerized to run as 5 instances), I'm looking to
           | replace it with a simpler design and hopefully get a better
           | performance.
        
             | foobarian wrote:
             | Do you know if some of these systems have components to
             | periodically checksum the data at rest?
        
               | bayindirh wrote:
               | ZFS/OpenZFS can do scrub and do block-level recovery. I'm
               | not sure about Lustre, but since Petabyte sized storage
               | is its natural habitat, there should be at least one way
               | to handle that.
        
             | bayindirh wrote:
             | It might not be the most ideal solution, but did you
             | consider installing TrueNAS on that thing?
             | 
             | TrueNAS can handle the OpenZFS (zRAID, Caches and Logs)
             | part and you can deploy Garage or any other S3 gateway on
             | top of it.
             | 
             | It can be an interesting experiment, and 80 disk server is
             | not too big for a TrueNAS installation.
        
             | creiht wrote:
             | It is probably worth noting that most of the listed storage
             | systems (including S3) are designed to scale not only in
             | hard drives, but horizontally across many servers in a
             | distributed system. They really are not optimized for a
             | single storage node use case. There are also other things
             | to consider that can limit performance, like what does the
             | storage back plane look like for those 80 HDDs, and how
             | much throughput can you effectively push through that. Then
             | there is the network connectivity that will also be a
             | limiting factor.
        
               | crabique wrote:
               | It's a very beefy server with 4 NVMe and 20 HDD bays + a
               | 60-drive external enclosure, 2 enterprise grade HBA cards
               | set to multipath round-robin mode, even with 80 drives
               | it's nowhere near the data path saturation point.
               | 
               | The link is a 10G 9K MTU connection, the server is only
               | accessed via that local link.
               | 
               | Essentially, the drives being HDD are the only real
               | bottleneck (besides the obvious single-node scenario).
               | 
               | At the moment, all writes are buffered into the NVMes via
               | OpenCAS write-through cache, so the writes are very
               | snappy and are pretty much ingested at the rate I can
               | throw data at it. But the read/delete operations require
               | at least a metadata read, and due to the very high number
               | of small (most even empty) objects they take a lot more
               | time than I would like.
               | 
               | I'm willing to sacrifice the write-through cache benefits
               | (the write performance is actually an overkill for my use
               | case), in order to make it a little more balanced for
               | better List/Read/DeleteObject operations performance.
               | 
               | On paper, most "real" writes will be sequential data, so
               | writing that directly to the HDDs should be fine, while
               | metadata write operations will be handled exclusively by
               | the flash storage, thus also taking care of the
               | empty/small objects problem.
        
               | dardeaup wrote:
               | Hope you don't have expectations (over the long run) for
               | high availability. At some point that server will come
               | down (planned or unplanned).
        
               | crabique wrote:
               | For sure, there is zero expectations for any kind of
               | hardware downtime tolerance, it's a secondary backup
               | storage cobbled together from leftovers over many years
               | :)
               | 
               | For software, at least with MinIO it's possible to do
               | rolling updates/restarts since the 5 instances in docker-
               | compose are enough for proper write quorum even with any
               | single instance down.
        
               | edude03 wrote:
               | > Essentially, the drives being HDD are the only real
               | bottleneck
               | 
               | ? on the low end a single HD can deliver 100MB/s, 80 can
               | deliver 8,000MB/s, a single nvme can do 700MB/s and you
               | have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so
               | isn't your bottle neck Network and then probably CPU?
        
             | uroni wrote:
             | I'm working on something that might be suited for this use-
             | case at https://github.com/uroni/hs5 (not ready for
             | production yet).
             | 
             | It would still need a resilience/cache layer like ZFS,
             | though.
        
             | toast0 wrote:
             | If you can afford it, mirroring in some form is going to
             | give you way better read perf than RAIDz. Using zfs mirrors
             | is probably easiest but least flexible, zfs copies=2 with
             | all devices as top level vdevs in a single zpool is not
             | very unsafe, and something custom would be a lot of work
             | but could get safety and flexibility if done right.
             | 
             | You're basically seek limited, and a read on a mirror is
             | one seek, whereas a read on a RAIDz is one seek per device
             | in the stripe. (Although if most of your objects are under
             | the chunk size, you end up with more of mirroring than
             | striping)
             | 
             | You lose on capacity though.
        
             | epistasis wrote:
             | It's great to see other people's working solutions, thanks.
             | Can I ask if you have backup on something like this? In
             | many systems it's possible to store some data on ingress or
             | after processing, which serves as something that's
             | rebuildable, even if it's not a true backup. I'm not
             | familiar if your software layer has backup to off site as
             | part of their system, for example, which would be a great
             | feature.
        
             | pjdesno wrote:
             | Ceph's S3 protocol implementation is really good.
             | 
             | Getting Ceph erasure coding set up properly on a big hard
             | disk pool is a pain - you can tell that EC was shoehorned
             | into a system that was totally designed around triple
             | replication.
        
         | olavgg wrote:
         | SeaweedFS has evolved a lot the last few years, with RDMA
         | support and EC.
        
         | pickle-wizard wrote:
         | At a past job we had an object store that used SwiftStack. We
         | just used SSDs for the metadata storage but all the objects
         | were stored on regular HDDs. It worked well enough.
        
         | epistasis wrote:
         | I would say that Ceph+RadosGW works well with HDDs, as long as
         | 1) you use SDDs for the index pool, and 2) you are realistic
         | about the number of IOPs you can get out of your pool of HDDs.
         | 
         | And remember that there's a multiplication of iops for any
         | individual client iop, whether you're using triplicate storare
         | or erasure coding. S3 also has iop multiplication, which they
         | solve with tons of HDDs.
         | 
         | For big object storage that's mostly streaming 4MB chunks, this
         | is no big deal. If you have tons of small random reads and
         | writes across many keys or a single big key, that's when you
         | need to make sure your backing store can keep up.
        
         | kerneltime wrote:
         | Apache Ozone has multiple 100+ petabyte clusters in production.
         | The capacity is on HDDs and metadata is on SSDs. Updated docs
         | (staging for new docs): https://kerneltime.github.io/ozone-
         | site/
        
         | cullenking wrote:
         | We've been running a production ceph cluster for 11 years now,
         | with only one full scheduled downtime for a major upgrade in
         | all those years, across three different hardware generations. I
         | wouldn't call it easy, but I also wouldn't call it hard. I used
         | to run it with SSDs for radosgw indexes as well as a fast pool
         | for some VMs, and harddrives for bulk object storage. Since i
         | was only running 5 nodes with 10 drives each, I was tired of
         | occasional iop issues under heavy recovery so on the last
         | upgrade I just migrated to 100% nvme drives. To mitigate the
         | price I just bought used enterprise micron drives off ebay
         | whenever I saw a good deal popup. Haven't had any performance
         | issues since then no matter what we've tossed at it. I'd
         | recommend it, though I don't have experience with the other
         | options. On paper I think it's still the best option. Stay away
         | from CephFS though, performance is truly atrocious and you'll
         | footgun yourself for any use in production.
        
       | nerdjon wrote:
       | So is any of S3 powered by SSD's?
       | 
       | I honestly figured that it must be powered by SSD for the
       | standard tier and the slower tiers were the ones using HDD or
       | slower systems.
        
         | MDGeist wrote:
         | I always assumed the really slow tiers were tape.
        
           | hobs wrote:
           | Not even the higher tiers of Glacier were tape afaict (at
           | least when it was first created), just the observation that
           | hard drives are much bigger than you can reasonably access in
           | useful time.
        
             | temp0826 wrote:
             | In the early days when there were articles speculating on
             | what Glacier was backed by, it was actually on crusty old
             | S3 gear (and at the very beginning, it was just on S3
             | itself as a wrapper and a hand wavy price discount, eating
             | the costs to get people to buy in to the idea!). Later on
             | (2018 or so) they began moving to a home grown tape-based
             | solution (at least for some tiers).
        
               | luhn wrote:
               | Do you have any sources for that? I'm really curious
               | about Glacier's infrastructure and AWS has been
               | notoriously tight-lipped about it. I haven't found
               | anything better than informed speculation.
        
               | iamtedd wrote:
               | My speculation: writes are to /dev/null, and the fact
               | that reads are expensive and that you need to inventory
               | your data before reading means Amazon is recreating your
               | data from network transfer logs.
        
               | luhn wrote:
               | Maybe they ask the NSA for a copy.
        
               | dekhn wrote:
               | That's surprising given how badly restoration worked
               | (much more like tape than drives).
        
               | roncesvalles wrote:
               | I'd be curious whether simulating a shitty restoration
               | experience was part of the emulation when they first ran
               | Glacier on plain S3 to test the market.
        
               | everfrustrated wrote:
               | I'm not aware of AWS ever confirming tape for glacier. My
               | own speculation is they likely use hdd for glacier -
               | especially so for the smaller regions - and eat the cost.
               | 
               | Someone recently came across some planning documents
               | filed in London for a small "datacenter" which wasn't
               | attached to their usual London compute DCs and built to
               | house tape libraries (this was explicitly called out as
               | there was concern about power - tape libraries don't use
               | much). So I would be fairly confident they wait until the
               | glacier volumes grow enough on hdd before building out
               | tape infra.
        
           | g-mork wrote:
           | There might be surprisingly little value in going tape due to
           | all the specialization required. As the other comment
           | suggest, many of the lower tiers likely represent basically
           | IO bandwidth classes. a 16 TB disk with 100 IOPs can only
           | offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s
           | over 160 GB for 1000, etc. Just scale up that thinking to a
           | building full of disks, it still applies
        
             | pjdesno wrote:
             | See comments above about AWS per-request cost - if your
             | customers want higher performance, they'll pay enough to
             | let AWS waste some of that space and earn a profit on it.
        
           | derefr wrote:
           | My own assumption was always that the cold tiers are managed
           | by a tape _robot_ , but managing offlined HDDs rather than
           | actual tapes.
        
             | chippiewill wrote:
             | I think that's close to the truth. IIRC it's something like
             | a massive cluster of machines that are effectively powered
             | off 99% of the time with a careful sharding scheme where
             | they're turned on and off in batches over a long period of
             | time for periodic backup or restore of blobs.
        
         | rubiquity wrote:
         | > So is any of S3 powered by SSD's?
         | 
         | S3's KeyMap Index uses SSDs. I also wouldn't be surprised if at
         | this point SSDs are somewhere along the read path for caching
         | hot objects or in the new one zone product.
        
         | electroly wrote:
         | It's assumed that the new S3 Express One Zone is backed by SSDs
         | but I believe Amazon doesn't say so explicitly.
        
           | rubiquity wrote:
           | I've always felt it's probably a wrapper around the Amazon
           | EFS due to the similar pricing and that S3 One Zone has
           | "Directory" buckets, a very file system-y idea.
        
         | yabones wrote:
         | The storage itself is probably (mostly) on HDDs, but I'd
         | imagine metadata, indices, etc are stored on much faster flash
         | storage. At least, that's the common advice for small-ish Ceph
         | cluster MDS servers. Obviously S3 is a few orders of magnitude
         | bigger than that...
        
         | pjdesno wrote:
         | Repeating a comment I made above - for standard tier, requests
         | are expensive enough that it's cost-effective to let space on
         | the disks go unused if someone wants an IOPS/TB ratio that's
         | higher than what disk drives can provide. But not much more
         | expensive than that.
         | 
         | The latest generation of drives store about 30TB - I don't know
         | how much AWS pays for them, but a wild-ass guess would be
         | $300-$500. That's a lot cheaper than 30TB of SSD.
         | 
         | Also important - you can put those disks in high-density
         | systems (e.g. 100 drives in 4U) that only add maybe 25% to the
         | total cost, at least if you're AWS, a bit more for the rest of
         | us. The per-slot cost of boxes that hold lots of SSDs seems to
         | be a lot higher.
        
         | UltraSane wrote:
         | I expect they are storing metadata on SSDs. They might have SSD
         | caches for really hot objects that get read a lot.
        
       | wg0 wrote:
       | Does anyone know what is the technology stack of S3? Monolith or
       | multiple services?
       | 
       | I assume would have lots of queues, caches and long running
       | workers.
        
         | jyscao wrote:
         | > conway's law and how it shapes S3's architecture (consisting
         | of 300+ microservices)
        
         | Twirrim wrote:
         | Amazon biases towards Systems Oriented Architecture approach
         | that is in the middle ground between monolith and
         | microservices.
         | 
         | Biasing away from lots of small services in favour of larger
         | ones that handle more of the work so that as much as possible
         | you avoid the costs and latency of preparing, transmitting,
         | receiving and processing requests.
         | 
         | I know S3 has changed since I was there nearly a decade ago, so
         | this is outdated. Off the top of my head it used to be about a
         | dozen main services at that time. A request to put an object
         | would only touch a couple of services en route to disk, and
         | similar on retrieval. There were a few services that handled
         | fixity and data durability operations, the software on the
         | storage servers themselves, and then stuff that maintained the
         | mapping between object and storage.
        
           | taeric wrote:
           | Amusingly, I suspect that the "dozen main services" is still
           | quite a few more than most smaller companies would consider
           | on their stacks.
        
             | Twirrim wrote:
             | Probably. Conway's law comes into effect, naturally.
        
         | hnexamazon wrote:
         | I was an SDE on the S3 Index team 10 years ago, but I doubt
         | much of the core stack has changed.
         | 
         | S3 is comprised primarily of layers of Java-based web services.
         | The hot path (object get / put / list) are all served by
         | synchronous API servers - no queues or workers. It is the best
         | example of how many transactions per second a pretty standard
         | Java web service stack can handle that I've seen in my career.
         | 
         | For a get call, you first hit a fleet of front-end HTTP API
         | servers behind a set of load balancers. Partitioning is based
         | on the key name prefixes, although I hear they've done work to
         | decouple that recently. Your request is then sent to the
         | Indexing fleet to find the mapping of your key name to an
         | internal storage id. This is returned to the front end layer,
         | which then calls the storage layer with the id to get the
         | actual bits. It is a very straightforward multi-layer
         | distributed system design for serving synchronous API responses
         | at massive scale.
         | 
         | The only novel bit is all the backend communication uses a
         | home-grown stripped-down HTTP variant, called STUMPY if I
         | recall. It was a dumb idea to not just use HTTP but the service
         | is ancient and originally built back when principal engineers
         | were allowed to YOLO their own frameworks and protocols so now
         | they are stuck with it. They might have done the massive lift
         | to replace STUMPY with HTTP since my time.
        
           | js4ever wrote:
           | "It is the best example of how many transactions per second a
           | pretty standard Java web service stack can handle that I've
           | seen in my career."
           | 
           | can you give some numbers? or at least ballpark?
        
             | hnexamazon wrote:
             | Tens of thousands of TPS per node.
        
           | derefr wrote:
           | > The hot path (... list) are all served by synchronous API
           | servers
           | 
           | Wait; how does that work, when a user is PUTting tons of
           | objects concurrently into a bucket, and then LISTing the
           | bucket during that? If the PUTs are all hitting different
           | indexing-cluster nodes, then...?
           | 
           | (Or do you mean that there _are_ queues /workers, but only
           | outside the hot path; with hot-path requests emitting events
           | that then get chewed through async to do things like cross-
           | shard bucket metadata replication?)
        
             | pjdesno wrote:
             | LIST is dog slow, and everyone expects it to be. (my
             | research group did a prototype of an ultra-high-speed
             | S3-compatible system, and it really helps not needing to
             | list things quickly)
        
           | rubiquity wrote:
           | Rest assured STUMPY was replaced with another home grown
           | protocol! Though I think a stream oriented protocol is a
           | better match for large scale services like S3 storage than a
           | synchronous protocol like HTTP.
        
           | gregates wrote:
           | It's not all java anymore. There's some rust now, too.
           | ShardStore, at least (which the article mentions).
        
           | master_crab wrote:
           | _Partitioning is based on the key name prefixes, although I
           | hear they've done work to decouple that recently._
           | 
           | They may still use key names for partitioning. But they now
           | randomly hash the user key name prefix on the back end to
           | handle hotspots generated by similar keys.
        
         | riknos314 wrote:
         | Microservices for days.
         | 
         | I worked on lifecycle ~5 years ago and just the Standard ->
         | Glacier transition path involved no fewer than 7 microservices.
         | 
         | Just determining which of the 400 trillion keys are eligible
         | for a lifecycle action (comparing each object's metadata
         | against the lifecycle policy on the bucket) is a _massive_ big
         | data job.
         | 
         | Always was a fun oncall when some bucket added a lifecycle rule
         | that queued 1PB+ of data for transition or deletion on the same
         | day. At the time our queuing had become good enough to handle
         | these queues gracefully but our alarming hadn't figured out how
         | to differentiate between the backlog for a single customer with
         | a huge job and the whole system failing to process quickly
         | enough. IIRC this was being fixed as I left.
        
           | rubiquity wrote:
           | I used to work on the backing service for S3's Index and the
           | daily humps in our graphs from lifecycle running were
           | immense!
        
         | charltones wrote:
         | There's a pretty good talk on S3 under the hood from last
         | year's re:Invent: https://www.youtube.com/watch?v=NXehLy7IiPM
        
         | pjdesno wrote:
         | The only scholarly paper they've written about it is this one:
         | https://www.amazon.science/publications/using-lightweight-fo...
         | 
         | (well, I think they may have submitted one or two others, but
         | this is the only one that got published)
        
       | dgllghr wrote:
       | I enjoyed this article but I think the answer to the headline is
       | obvious: parallelism
        
         | ffsm8 wrote:
         | That's like saying "how to get to the moon is obvious:
         | traveling"
        
           | crazygringo wrote:
           | I dunno, the article's tl;dr is just parallelism.
           | 
           | Data gets split into redundant copies, and is rebalanced in
           | response to hot spots.
           | 
           | Everything in this article is the obvious answer you'd
           | expect.
        
             | toolslive wrote:
             | It's not really "redundant copies". It's erasure coding
             | (ie, your data is the solution of an overdetermined system
             | of equations).
        
           | DoctorOW wrote:
           | Thank you for setting me up for this...
           | 
           | It's not exactly rocket science.
        
         | timeinput wrote:
         | I generally don't think about storage I/O speed at that scale
         | (I mean really who does?). I once used a RAID0 to store data to
         | HDDs faster, but that was a long time ago.
         | 
         | I would have naively guessed an interesting caching system, and
         | to some degree tiers of storage for hot vs cold objects.
         | 
         | It was obvious after I read the article that parallelism was a
         | great choice, but I definitely hadn't considered the detailed
         | scheme of S3, or the error correction it used. Parallelism is
         | the one word summary, but the details made the article worth
         | reading. I bet minio also has a similar scaling story:
         | parallelism.
        
           | 0x457 wrote:
           | My homelab servers all have raidz out of 3 nvme drives for
           | this reason: higher parallelism without loosing redundancy.
           | 
           | > I would have naively guessed an interesting caching system,
           | and to some degree tiers of storage for hot vs cold objects.
           | 
           | Caching in this scenario usually done outside of S3 in
           | something like Cloudfront
        
           | MrDarcy wrote:
           | If you're curious about this at home, try Ceph in Proxmox.
        
             | glitchcrab wrote:
             | Unless you have a large cluster with many tens of
             | nodes/OSDs (and who does in a homelab?) then using Ceph is
             | a bad idea (I've run large Ceph clusters at previous jobs).
        
           | UltraSane wrote:
           | AWS themselves have bragged that the biggest S3 buckets are
           | striped across over 1 million hard drives. This doesn't mean
           | they are using all of the space of all these drives, because
           | one of the key concepts of S3 is to average IO of many
           | customers over many drives.
        
         | gregates wrote:
         | I think the article's title question is a bit misleading
         | because it focuses on peak throughput for S3 as a whole. The
         | interesting question is "How can the throughput for a GET
         | exceed the throughput of an HDD?"
         | 
         | If you just replicated, you could still get big throughput for
         | S3 as a whole by doing many reads that target different HDDs.
         | But you'd still be limited to max HDD throughput * number of
         | GETs. S3 is not so limited, and that's interesting and non-
         | obvious!
        
         | UltraSane wrote:
         | Millions of hard drives cumulatively has enormous IO bandwidth.
        
       | ttfvjktesd wrote:
       | > tens of millions of disks
       | 
       | If we assume enterprise HDDs in the double digit TB range then
       | one can estimate that the total S3 storage volume of AWS is in
       | the triple digit Exabyte range. That's propably the biggest
       | storage system on planet earth.
        
         | yanslookup wrote:
         | Production scale enterprise HDDs are in the 30TB range, 50TB on
         | the horizon...
        
           | pjdesno wrote:
           | Google for Seagate Mozaic
        
         | threeducks wrote:
         | A certain data center in Utah might top that, assuming that
         | they have upgraded their hardware since 2013.
         | 
         | https://www.forbes.com/sites/kashmirhill/2013/07/24/blueprin...
        
           | ttfvjktesd wrote:
           | This piece is interesting background, but worth noting that
           | the actual numbers are highly speculative. The NSA has never
           | disclosed hard data on capacity, and most of what's out there
           | is inference from blueprints, water/power usage, or second-
           | hand claims. No verifiable figures exist.
        
       | sciencesama wrote:
       | can we replicate something similar to this for homelab ?
        
         | ai-christianson wrote:
         | garage
        
       | gregates wrote:
       | A few factual inaccuracies in here that don't affect the general
       | thrust. For example, the claim that S3 uses a 5:9 sharding
       | scheme. In fact they use many different sharding schemes, and
       | iirc 5:9 isn't one of them.
       | 
       | The main reason being that a ratio of 1.8 physical bytes to 1
       | logical byte is awful for HDD costs. You can get that down
       | significantly, and you get wider parallelism and better
       | availability guarantees to boot (consider: if a whole AZ goes
       | down, how many shards can you lose before an object is
       | unavailable for GET?).
        
         | UltraSane wrote:
         | VAST data uses 146+4
         | 
         | https://www.vastdata.com/whitepaper/#similarity-reduction-in...
        
           | bombela wrote:
           | page loads then quicky move up some video loads, and content
           | is gone
        
       | Const-me wrote:
       | > full-platter seek time: ~8ms; half-platter seek time (avg):
       | ~4ms
       | 
       | Average distance between two points (first is current location,
       | second is target location) when both are uniformly distributed in
       | [ 0 .. +1 ] interval is not 0.5, it's 1/3. If the full platter
       | seek time is 8ms, average seek time should be 2.666ms.
        
         | nick49488171 wrote:
         | There's acceleration of the read head to move it between the
         | tracks. So it may well be 4ms because shorter distances are
         | penalized by a lower peak speed of the read head as well as
         | constant factors (settling at the end of motion)
        
         | pjdesno wrote:
         | Full seek on a modern drive is a lot closer to 25ms than 8ms.
         | It's pretty easy to test yourself if you have a hard drive in
         | your machine and root access - fire up fio with --readonly and
         | feed it a handmade trace that alternates reading blocks at the
         | beginning and end of disk. (--readonly does a hard disable of
         | any code that could write to the drive)
         | 
         | Here's a good paper that explains why the 1/3 number isn't
         | quite right on any drives manufactured in the last quarter
         | century or so - https://www.msstconference.org/MSST-
         | history/2024/Papers/msst...
         | 
         | I'd be happy to answer any other questions about disk drive
         | mechanics and performance.
        
       | pjdesno wrote:
       | Note that you can kind of infer that S3 is still using hard
       | drives for their basic service by looking at pricing and
       | calculating the IOPS rate that doubles the cost per GB per month.
       | 
       | S3 GET and PUT requests are sufficiently expensive that AWS can
       | afford to let disk space sit idle to satisfy high-performance
       | tenants, but not a lot more expensive than that.
        
       ___________________________________________________________________
       (page generated 2025-09-24 23:00 UTC)