[HN Gopher] How AWS S3 serves 1 petabyte per second on top of sl...
___________________________________________________________________
How AWS S3 serves 1 petabyte per second on top of slow HDDs
Author : todsacerdoti
Score : 222 points
Date : 2025-09-24 10:05 UTC (12 hours ago)
(HTM) web link (bigdata.2minutestreaming.com)
(TXT) w3m dump (bigdata.2minutestreaming.com)
| EwanToo wrote:
| I think a more interesting article on S3 is "Building and
| operating a pretty big storage system called S3"
|
| https://www.allthingsdistributed.com/2023/07/building-and-op...
| giancarlostoro wrote:
| Really nice read, thank you for that.
| enether wrote:
| Author of the 2minutestreaming blog here. Good point! I'll
| add this as a reference at the end. I loved that piece. My
| goal was to be more concise and focus on the HDD aspect
| dang wrote:
| Discussed at the time:
|
| _Building and operating a pretty big storage system called S3_
| - https://news.ycombinator.com/item?id=36894932 - July 2023
| (160 comments)
| crabique wrote:
| Is there an open source service designed with HDDs in mind that
| achieves similar performance? I know none of the big ones work
| that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they
| all suggest flash-only deployments.
|
| Recently I've been looking into Garage and liking the idea of it,
| but it seems to have a very different design (no EC).
| giancarlostoro wrote:
| Doing some light googling aside from Ceph being listed, there's
| one called Gluster as well. Hypes itself as "using common off-
| the-shelf hardware you can create large, distributed storage
| solutions for media streaming, data analysis, and other data-
| and bandwidth-intensive tasks."
|
| It's open source / free to boot. I have no direct experience
| with it myself however.
|
| https://www.gluster.org/
| a012 wrote:
| I've used GlusterFS before because I was having tens of old
| PCs and it worked for me very well. It's basically a PoC to
| see how it work than production though
| epistasis wrote:
| A decade ago where I worked we used gluster for ~200TB of HDD
| for a shared file system on a SLURM compute cluster, as a
| much better clustered version of NFS. And we used ceph for
| its S3 interface (RadowGW) for tens of petabytes of back
| storage after the high IO stages of compute were finished.
| The ceph was all HDD though later we added some SSDs for a
| caching pool.
|
| For single client performance, ceph beat the performance I
| get from S3 today for large file copies. Gluster had
| difficult to characterize performance, but our setup with big
| fast RAID arrays seems to still outperform what I see of
| AWS's luster as a service today for our use case of long
| sequential reads and writes.
|
| We would occasionally try cephFS, the POSIX shared network
| filesystem, but it couldn't match our gluster performance for
| our workload. But also, we built the ceph long term storage
| to maximize TB/$, so it was at a disadvantage compared to our
| gluster install. Still, I never heard of cephFS being used
| anywhere despite it being the original goal in the papers
| back at UCSC. Keep an eye on CERN for news about one of the
| bigger ceph installs with public info.
|
| I love both of the systems, and see ceph used everywhere
| today, but am surprised and happy to see that gluster is
| still around.
| mbreese wrote:
| Gluster has been slowly declining for a while. It used to be
| sponsored by RedHat, but tha stopped a few years ago. Since
| then, development slowed significantly.
|
| I used to keep a large cluster array with Gluster+ZFS
| (1.5PB), and I can't say I was ever really that impressed
| with the performance. That said -- I really didn't have
| enough horizontal scaling to make it worthwhile from a
| performance aspect. For us, it was mainly used to make a
| union file system.
|
| But, I can't say I'd recommend it for anything new.
| elitepleb wrote:
| Any of them will work just as well, but only with many
| datacenters worth of drives, which very few deployments can
| target.
|
| It's the classic horizontal/vertical scaling trade off, that's
| why flash tends to be more space/cost efficient for speedy
| access.
| bayindirh wrote:
| Lustre and ZFS can do similar speeds.
|
| However, if you need high IOPS, you need flash on MDS for
| Lustre and some Log SSDs (esp. dedicated write and read ones)
| for ZFS.
| crabique wrote:
| Thanks, but I forgot to specify that I'm interested in
| S3-compatible servers only.
|
| Basically, I have a single big server with 80 high-capacity
| HDDs and 4 high-endurance NVMes, and it's the S3 endpoint
| that gets a lot of writes.
|
| So yes, for now my best candidate is ZFS + Garage, this way I
| can get away with using replica=1 and rely on ZFS RAIDz for
| data safety, and the NVMEs can get sliced and diced to act as
| the fast metadata store for Garage, the "special"
| device/small records store for the ZFS, the ZIL/SLOG device
| and so on.
|
| Currently it's a bit of a Frankenstein's monster: using
| XFS+OpenCAS as the backing storage for an old version of
| MinIO (containerized to run as 5 instances), I'm looking to
| replace it with a simpler design and hopefully get a better
| performance.
| foobarian wrote:
| Do you know if some of these systems have components to
| periodically checksum the data at rest?
| bayindirh wrote:
| ZFS/OpenZFS can do scrub and do block-level recovery. I'm
| not sure about Lustre, but since Petabyte sized storage
| is its natural habitat, there should be at least one way
| to handle that.
| bayindirh wrote:
| It might not be the most ideal solution, but did you
| consider installing TrueNAS on that thing?
|
| TrueNAS can handle the OpenZFS (zRAID, Caches and Logs)
| part and you can deploy Garage or any other S3 gateway on
| top of it.
|
| It can be an interesting experiment, and 80 disk server is
| not too big for a TrueNAS installation.
| creiht wrote:
| It is probably worth noting that most of the listed storage
| systems (including S3) are designed to scale not only in
| hard drives, but horizontally across many servers in a
| distributed system. They really are not optimized for a
| single storage node use case. There are also other things
| to consider that can limit performance, like what does the
| storage back plane look like for those 80 HDDs, and how
| much throughput can you effectively push through that. Then
| there is the network connectivity that will also be a
| limiting factor.
| crabique wrote:
| It's a very beefy server with 4 NVMe and 20 HDD bays + a
| 60-drive external enclosure, 2 enterprise grade HBA cards
| set to multipath round-robin mode, even with 80 drives
| it's nowhere near the data path saturation point.
|
| The link is a 10G 9K MTU connection, the server is only
| accessed via that local link.
|
| Essentially, the drives being HDD are the only real
| bottleneck (besides the obvious single-node scenario).
|
| At the moment, all writes are buffered into the NVMes via
| OpenCAS write-through cache, so the writes are very
| snappy and are pretty much ingested at the rate I can
| throw data at it. But the read/delete operations require
| at least a metadata read, and due to the very high number
| of small (most even empty) objects they take a lot more
| time than I would like.
|
| I'm willing to sacrifice the write-through cache benefits
| (the write performance is actually an overkill for my use
| case), in order to make it a little more balanced for
| better List/Read/DeleteObject operations performance.
|
| On paper, most "real" writes will be sequential data, so
| writing that directly to the HDDs should be fine, while
| metadata write operations will be handled exclusively by
| the flash storage, thus also taking care of the
| empty/small objects problem.
| dardeaup wrote:
| Hope you don't have expectations (over the long run) for
| high availability. At some point that server will come
| down (planned or unplanned).
| crabique wrote:
| For sure, there is zero expectations for any kind of
| hardware downtime tolerance, it's a secondary backup
| storage cobbled together from leftovers over many years
| :)
|
| For software, at least with MinIO it's possible to do
| rolling updates/restarts since the 5 instances in docker-
| compose are enough for proper write quorum even with any
| single instance down.
| edude03 wrote:
| > Essentially, the drives being HDD are the only real
| bottleneck
|
| ? on the low end a single HD can deliver 100MB/s, 80 can
| deliver 8,000MB/s, a single nvme can do 700MB/s and you
| have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so
| isn't your bottle neck Network and then probably CPU?
| uroni wrote:
| I'm working on something that might be suited for this use-
| case at https://github.com/uroni/hs5 (not ready for
| production yet).
|
| It would still need a resilience/cache layer like ZFS,
| though.
| toast0 wrote:
| If you can afford it, mirroring in some form is going to
| give you way better read perf than RAIDz. Using zfs mirrors
| is probably easiest but least flexible, zfs copies=2 with
| all devices as top level vdevs in a single zpool is not
| very unsafe, and something custom would be a lot of work
| but could get safety and flexibility if done right.
|
| You're basically seek limited, and a read on a mirror is
| one seek, whereas a read on a RAIDz is one seek per device
| in the stripe. (Although if most of your objects are under
| the chunk size, you end up with more of mirroring than
| striping)
|
| You lose on capacity though.
| epistasis wrote:
| It's great to see other people's working solutions, thanks.
| Can I ask if you have backup on something like this? In
| many systems it's possible to store some data on ingress or
| after processing, which serves as something that's
| rebuildable, even if it's not a true backup. I'm not
| familiar if your software layer has backup to off site as
| part of their system, for example, which would be a great
| feature.
| pjdesno wrote:
| Ceph's S3 protocol implementation is really good.
|
| Getting Ceph erasure coding set up properly on a big hard
| disk pool is a pain - you can tell that EC was shoehorned
| into a system that was totally designed around triple
| replication.
| olavgg wrote:
| SeaweedFS has evolved a lot the last few years, with RDMA
| support and EC.
| pickle-wizard wrote:
| At a past job we had an object store that used SwiftStack. We
| just used SSDs for the metadata storage but all the objects
| were stored on regular HDDs. It worked well enough.
| epistasis wrote:
| I would say that Ceph+RadosGW works well with HDDs, as long as
| 1) you use SDDs for the index pool, and 2) you are realistic
| about the number of IOPs you can get out of your pool of HDDs.
|
| And remember that there's a multiplication of iops for any
| individual client iop, whether you're using triplicate storare
| or erasure coding. S3 also has iop multiplication, which they
| solve with tons of HDDs.
|
| For big object storage that's mostly streaming 4MB chunks, this
| is no big deal. If you have tons of small random reads and
| writes across many keys or a single big key, that's when you
| need to make sure your backing store can keep up.
| kerneltime wrote:
| Apache Ozone has multiple 100+ petabyte clusters in production.
| The capacity is on HDDs and metadata is on SSDs. Updated docs
| (staging for new docs): https://kerneltime.github.io/ozone-
| site/
| cullenking wrote:
| We've been running a production ceph cluster for 11 years now,
| with only one full scheduled downtime for a major upgrade in
| all those years, across three different hardware generations. I
| wouldn't call it easy, but I also wouldn't call it hard. I used
| to run it with SSDs for radosgw indexes as well as a fast pool
| for some VMs, and harddrives for bulk object storage. Since i
| was only running 5 nodes with 10 drives each, I was tired of
| occasional iop issues under heavy recovery so on the last
| upgrade I just migrated to 100% nvme drives. To mitigate the
| price I just bought used enterprise micron drives off ebay
| whenever I saw a good deal popup. Haven't had any performance
| issues since then no matter what we've tossed at it. I'd
| recommend it, though I don't have experience with the other
| options. On paper I think it's still the best option. Stay away
| from CephFS though, performance is truly atrocious and you'll
| footgun yourself for any use in production.
| nerdjon wrote:
| So is any of S3 powered by SSD's?
|
| I honestly figured that it must be powered by SSD for the
| standard tier and the slower tiers were the ones using HDD or
| slower systems.
| MDGeist wrote:
| I always assumed the really slow tiers were tape.
| hobs wrote:
| Not even the higher tiers of Glacier were tape afaict (at
| least when it was first created), just the observation that
| hard drives are much bigger than you can reasonably access in
| useful time.
| temp0826 wrote:
| In the early days when there were articles speculating on
| what Glacier was backed by, it was actually on crusty old
| S3 gear (and at the very beginning, it was just on S3
| itself as a wrapper and a hand wavy price discount, eating
| the costs to get people to buy in to the idea!). Later on
| (2018 or so) they began moving to a home grown tape-based
| solution (at least for some tiers).
| luhn wrote:
| Do you have any sources for that? I'm really curious
| about Glacier's infrastructure and AWS has been
| notoriously tight-lipped about it. I haven't found
| anything better than informed speculation.
| iamtedd wrote:
| My speculation: writes are to /dev/null, and the fact
| that reads are expensive and that you need to inventory
| your data before reading means Amazon is recreating your
| data from network transfer logs.
| luhn wrote:
| Maybe they ask the NSA for a copy.
| dekhn wrote:
| That's surprising given how badly restoration worked
| (much more like tape than drives).
| roncesvalles wrote:
| I'd be curious whether simulating a shitty restoration
| experience was part of the emulation when they first ran
| Glacier on plain S3 to test the market.
| everfrustrated wrote:
| I'm not aware of AWS ever confirming tape for glacier. My
| own speculation is they likely use hdd for glacier -
| especially so for the smaller regions - and eat the cost.
|
| Someone recently came across some planning documents
| filed in London for a small "datacenter" which wasn't
| attached to their usual London compute DCs and built to
| house tape libraries (this was explicitly called out as
| there was concern about power - tape libraries don't use
| much). So I would be fairly confident they wait until the
| glacier volumes grow enough on hdd before building out
| tape infra.
| g-mork wrote:
| There might be surprisingly little value in going tape due to
| all the specialization required. As the other comment
| suggest, many of the lower tiers likely represent basically
| IO bandwidth classes. a 16 TB disk with 100 IOPs can only
| offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s
| over 160 GB for 1000, etc. Just scale up that thinking to a
| building full of disks, it still applies
| pjdesno wrote:
| See comments above about AWS per-request cost - if your
| customers want higher performance, they'll pay enough to
| let AWS waste some of that space and earn a profit on it.
| derefr wrote:
| My own assumption was always that the cold tiers are managed
| by a tape _robot_ , but managing offlined HDDs rather than
| actual tapes.
| chippiewill wrote:
| I think that's close to the truth. IIRC it's something like
| a massive cluster of machines that are effectively powered
| off 99% of the time with a careful sharding scheme where
| they're turned on and off in batches over a long period of
| time for periodic backup or restore of blobs.
| rubiquity wrote:
| > So is any of S3 powered by SSD's?
|
| S3's KeyMap Index uses SSDs. I also wouldn't be surprised if at
| this point SSDs are somewhere along the read path for caching
| hot objects or in the new one zone product.
| electroly wrote:
| It's assumed that the new S3 Express One Zone is backed by SSDs
| but I believe Amazon doesn't say so explicitly.
| rubiquity wrote:
| I've always felt it's probably a wrapper around the Amazon
| EFS due to the similar pricing and that S3 One Zone has
| "Directory" buckets, a very file system-y idea.
| yabones wrote:
| The storage itself is probably (mostly) on HDDs, but I'd
| imagine metadata, indices, etc are stored on much faster flash
| storage. At least, that's the common advice for small-ish Ceph
| cluster MDS servers. Obviously S3 is a few orders of magnitude
| bigger than that...
| pjdesno wrote:
| Repeating a comment I made above - for standard tier, requests
| are expensive enough that it's cost-effective to let space on
| the disks go unused if someone wants an IOPS/TB ratio that's
| higher than what disk drives can provide. But not much more
| expensive than that.
|
| The latest generation of drives store about 30TB - I don't know
| how much AWS pays for them, but a wild-ass guess would be
| $300-$500. That's a lot cheaper than 30TB of SSD.
|
| Also important - you can put those disks in high-density
| systems (e.g. 100 drives in 4U) that only add maybe 25% to the
| total cost, at least if you're AWS, a bit more for the rest of
| us. The per-slot cost of boxes that hold lots of SSDs seems to
| be a lot higher.
| UltraSane wrote:
| I expect they are storing metadata on SSDs. They might have SSD
| caches for really hot objects that get read a lot.
| wg0 wrote:
| Does anyone know what is the technology stack of S3? Monolith or
| multiple services?
|
| I assume would have lots of queues, caches and long running
| workers.
| jyscao wrote:
| > conway's law and how it shapes S3's architecture (consisting
| of 300+ microservices)
| Twirrim wrote:
| Amazon biases towards Systems Oriented Architecture approach
| that is in the middle ground between monolith and
| microservices.
|
| Biasing away from lots of small services in favour of larger
| ones that handle more of the work so that as much as possible
| you avoid the costs and latency of preparing, transmitting,
| receiving and processing requests.
|
| I know S3 has changed since I was there nearly a decade ago, so
| this is outdated. Off the top of my head it used to be about a
| dozen main services at that time. A request to put an object
| would only touch a couple of services en route to disk, and
| similar on retrieval. There were a few services that handled
| fixity and data durability operations, the software on the
| storage servers themselves, and then stuff that maintained the
| mapping between object and storage.
| taeric wrote:
| Amusingly, I suspect that the "dozen main services" is still
| quite a few more than most smaller companies would consider
| on their stacks.
| Twirrim wrote:
| Probably. Conway's law comes into effect, naturally.
| hnexamazon wrote:
| I was an SDE on the S3 Index team 10 years ago, but I doubt
| much of the core stack has changed.
|
| S3 is comprised primarily of layers of Java-based web services.
| The hot path (object get / put / list) are all served by
| synchronous API servers - no queues or workers. It is the best
| example of how many transactions per second a pretty standard
| Java web service stack can handle that I've seen in my career.
|
| For a get call, you first hit a fleet of front-end HTTP API
| servers behind a set of load balancers. Partitioning is based
| on the key name prefixes, although I hear they've done work to
| decouple that recently. Your request is then sent to the
| Indexing fleet to find the mapping of your key name to an
| internal storage id. This is returned to the front end layer,
| which then calls the storage layer with the id to get the
| actual bits. It is a very straightforward multi-layer
| distributed system design for serving synchronous API responses
| at massive scale.
|
| The only novel bit is all the backend communication uses a
| home-grown stripped-down HTTP variant, called STUMPY if I
| recall. It was a dumb idea to not just use HTTP but the service
| is ancient and originally built back when principal engineers
| were allowed to YOLO their own frameworks and protocols so now
| they are stuck with it. They might have done the massive lift
| to replace STUMPY with HTTP since my time.
| js4ever wrote:
| "It is the best example of how many transactions per second a
| pretty standard Java web service stack can handle that I've
| seen in my career."
|
| can you give some numbers? or at least ballpark?
| hnexamazon wrote:
| Tens of thousands of TPS per node.
| derefr wrote:
| > The hot path (... list) are all served by synchronous API
| servers
|
| Wait; how does that work, when a user is PUTting tons of
| objects concurrently into a bucket, and then LISTing the
| bucket during that? If the PUTs are all hitting different
| indexing-cluster nodes, then...?
|
| (Or do you mean that there _are_ queues /workers, but only
| outside the hot path; with hot-path requests emitting events
| that then get chewed through async to do things like cross-
| shard bucket metadata replication?)
| pjdesno wrote:
| LIST is dog slow, and everyone expects it to be. (my
| research group did a prototype of an ultra-high-speed
| S3-compatible system, and it really helps not needing to
| list things quickly)
| rubiquity wrote:
| Rest assured STUMPY was replaced with another home grown
| protocol! Though I think a stream oriented protocol is a
| better match for large scale services like S3 storage than a
| synchronous protocol like HTTP.
| gregates wrote:
| It's not all java anymore. There's some rust now, too.
| ShardStore, at least (which the article mentions).
| master_crab wrote:
| _Partitioning is based on the key name prefixes, although I
| hear they've done work to decouple that recently._
|
| They may still use key names for partitioning. But they now
| randomly hash the user key name prefix on the back end to
| handle hotspots generated by similar keys.
| riknos314 wrote:
| Microservices for days.
|
| I worked on lifecycle ~5 years ago and just the Standard ->
| Glacier transition path involved no fewer than 7 microservices.
|
| Just determining which of the 400 trillion keys are eligible
| for a lifecycle action (comparing each object's metadata
| against the lifecycle policy on the bucket) is a _massive_ big
| data job.
|
| Always was a fun oncall when some bucket added a lifecycle rule
| that queued 1PB+ of data for transition or deletion on the same
| day. At the time our queuing had become good enough to handle
| these queues gracefully but our alarming hadn't figured out how
| to differentiate between the backlog for a single customer with
| a huge job and the whole system failing to process quickly
| enough. IIRC this was being fixed as I left.
| rubiquity wrote:
| I used to work on the backing service for S3's Index and the
| daily humps in our graphs from lifecycle running were
| immense!
| charltones wrote:
| There's a pretty good talk on S3 under the hood from last
| year's re:Invent: https://www.youtube.com/watch?v=NXehLy7IiPM
| pjdesno wrote:
| The only scholarly paper they've written about it is this one:
| https://www.amazon.science/publications/using-lightweight-fo...
|
| (well, I think they may have submitted one or two others, but
| this is the only one that got published)
| dgllghr wrote:
| I enjoyed this article but I think the answer to the headline is
| obvious: parallelism
| ffsm8 wrote:
| That's like saying "how to get to the moon is obvious:
| traveling"
| crazygringo wrote:
| I dunno, the article's tl;dr is just parallelism.
|
| Data gets split into redundant copies, and is rebalanced in
| response to hot spots.
|
| Everything in this article is the obvious answer you'd
| expect.
| toolslive wrote:
| It's not really "redundant copies". It's erasure coding
| (ie, your data is the solution of an overdetermined system
| of equations).
| DoctorOW wrote:
| Thank you for setting me up for this...
|
| It's not exactly rocket science.
| timeinput wrote:
| I generally don't think about storage I/O speed at that scale
| (I mean really who does?). I once used a RAID0 to store data to
| HDDs faster, but that was a long time ago.
|
| I would have naively guessed an interesting caching system, and
| to some degree tiers of storage for hot vs cold objects.
|
| It was obvious after I read the article that parallelism was a
| great choice, but I definitely hadn't considered the detailed
| scheme of S3, or the error correction it used. Parallelism is
| the one word summary, but the details made the article worth
| reading. I bet minio also has a similar scaling story:
| parallelism.
| 0x457 wrote:
| My homelab servers all have raidz out of 3 nvme drives for
| this reason: higher parallelism without loosing redundancy.
|
| > I would have naively guessed an interesting caching system,
| and to some degree tiers of storage for hot vs cold objects.
|
| Caching in this scenario usually done outside of S3 in
| something like Cloudfront
| MrDarcy wrote:
| If you're curious about this at home, try Ceph in Proxmox.
| glitchcrab wrote:
| Unless you have a large cluster with many tens of
| nodes/OSDs (and who does in a homelab?) then using Ceph is
| a bad idea (I've run large Ceph clusters at previous jobs).
| UltraSane wrote:
| AWS themselves have bragged that the biggest S3 buckets are
| striped across over 1 million hard drives. This doesn't mean
| they are using all of the space of all these drives, because
| one of the key concepts of S3 is to average IO of many
| customers over many drives.
| gregates wrote:
| I think the article's title question is a bit misleading
| because it focuses on peak throughput for S3 as a whole. The
| interesting question is "How can the throughput for a GET
| exceed the throughput of an HDD?"
|
| If you just replicated, you could still get big throughput for
| S3 as a whole by doing many reads that target different HDDs.
| But you'd still be limited to max HDD throughput * number of
| GETs. S3 is not so limited, and that's interesting and non-
| obvious!
| UltraSane wrote:
| Millions of hard drives cumulatively has enormous IO bandwidth.
| ttfvjktesd wrote:
| > tens of millions of disks
|
| If we assume enterprise HDDs in the double digit TB range then
| one can estimate that the total S3 storage volume of AWS is in
| the triple digit Exabyte range. That's propably the biggest
| storage system on planet earth.
| yanslookup wrote:
| Production scale enterprise HDDs are in the 30TB range, 50TB on
| the horizon...
| pjdesno wrote:
| Google for Seagate Mozaic
| threeducks wrote:
| A certain data center in Utah might top that, assuming that
| they have upgraded their hardware since 2013.
|
| https://www.forbes.com/sites/kashmirhill/2013/07/24/blueprin...
| ttfvjktesd wrote:
| This piece is interesting background, but worth noting that
| the actual numbers are highly speculative. The NSA has never
| disclosed hard data on capacity, and most of what's out there
| is inference from blueprints, water/power usage, or second-
| hand claims. No verifiable figures exist.
| sciencesama wrote:
| can we replicate something similar to this for homelab ?
| ai-christianson wrote:
| garage
| gregates wrote:
| A few factual inaccuracies in here that don't affect the general
| thrust. For example, the claim that S3 uses a 5:9 sharding
| scheme. In fact they use many different sharding schemes, and
| iirc 5:9 isn't one of them.
|
| The main reason being that a ratio of 1.8 physical bytes to 1
| logical byte is awful for HDD costs. You can get that down
| significantly, and you get wider parallelism and better
| availability guarantees to boot (consider: if a whole AZ goes
| down, how many shards can you lose before an object is
| unavailable for GET?).
| UltraSane wrote:
| VAST data uses 146+4
|
| https://www.vastdata.com/whitepaper/#similarity-reduction-in...
| bombela wrote:
| page loads then quicky move up some video loads, and content
| is gone
| Const-me wrote:
| > full-platter seek time: ~8ms; half-platter seek time (avg):
| ~4ms
|
| Average distance between two points (first is current location,
| second is target location) when both are uniformly distributed in
| [ 0 .. +1 ] interval is not 0.5, it's 1/3. If the full platter
| seek time is 8ms, average seek time should be 2.666ms.
| nick49488171 wrote:
| There's acceleration of the read head to move it between the
| tracks. So it may well be 4ms because shorter distances are
| penalized by a lower peak speed of the read head as well as
| constant factors (settling at the end of motion)
| pjdesno wrote:
| Full seek on a modern drive is a lot closer to 25ms than 8ms.
| It's pretty easy to test yourself if you have a hard drive in
| your machine and root access - fire up fio with --readonly and
| feed it a handmade trace that alternates reading blocks at the
| beginning and end of disk. (--readonly does a hard disable of
| any code that could write to the drive)
|
| Here's a good paper that explains why the 1/3 number isn't
| quite right on any drives manufactured in the last quarter
| century or so - https://www.msstconference.org/MSST-
| history/2024/Papers/msst...
|
| I'd be happy to answer any other questions about disk drive
| mechanics and performance.
| pjdesno wrote:
| Note that you can kind of infer that S3 is still using hard
| drives for their basic service by looking at pricing and
| calculating the IOPS rate that doubles the cost per GB per month.
|
| S3 GET and PUT requests are sufficiently expensive that AWS can
| afford to let disk space sit idle to satisfy high-performance
| tenants, but not a lot more expensive than that.
___________________________________________________________________
(page generated 2025-09-24 23:00 UTC)