[HN Gopher] Building and operating a pretty big storage system c...
___________________________________________________________________
Building and operating a pretty big storage system called S3
Author : werner
Score : 157 points
Date : 2023-07-27 15:20 UTC (7 hours ago)
(HTM) web link (www.allthingsdistributed.com)
(TXT) w3m dump (www.allthingsdistributed.com)
| Twirrim wrote:
| > That's a bit error rate of 1 in 10^15 requests. In the real
| world, we see that blade of grass get missed pretty frequently -
| and it's actually something we need to account for in S3.
|
| One of the things I remember from my time at AWS was
| conversations about how 1 in a billion events end up being a
| daily occurrence when you're operating at S3 scale. Things that
| you'd normally mark off as so wildly improbable it's not worth
| worrying about, have to be considered, and handled.
|
| Glad to read about ShardStore, and especially the formal
| verification, property based testing etc. The previous generation
| of services were notoriously buggy, a very good example of the
| usual perils of organic growth (but at least really well designed
| such that they'd fail "safe", ensuring no data loss, something S3
| engineers obsessed about).
| Waterluvian wrote:
| Ever see a UUID collision?
| ignoramous wrote:
| James Hamilton, AWS' chief architect, wrote about this
| phenomena in 2017: _At scale, rare events aren 't rare_;
| https://news.ycombinator.com/item?id=14038044
| ilyt wrote:
| I think Ceph hit similar problems and they had to add more
| robust checksumming to the system, as relying on just tcp
| checksums for integrity for example was no longer enough
| Twirrim wrote:
| Yes, I remember tcp checksumming coming up as not sufficient
| at one stage. Even saw S3 deal with a real head-scratcher of
| a non-impacting event that came down to a single NIC in a
| single machine corrupting the tcp checksum under very
| specific circumstances.
| mjb wrote:
| > daily occurrence when you're operating at S3 scale
|
| Yeah! With S3 averaging over 100M requests per second, 1 in a
| billion happens every ten seconds. And it's not just S3. For
| example, for Prime Day 2022, DynamoDB peaked at over 105M
| requests per second (just for the Amazon workload):
| https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
|
| In the post, Andy also talks about Lightweight Formal Methods
| and the team's adoption of Rust. When even extremely low
| probability events are common, we need to invest in multiple
| layers of tooling and process around correctness.
| ldjkfkdsjnv wrote:
| Also worked at Amazon, saw some issues with major well known
| open source libraries that broke in places nobody would ever
| expect.
| wrboyce wrote:
| Any examples you can share?
| rubiquity wrote:
| Daily? A component I worked on that supported S3's Index could
| hit a 1 in a billion issue multiple times a minute. Thankfully
| we had good algorithms and hardware that is a lot more reliable
| these days!
| Twirrim wrote:
| This was 7-8 years ago now. Lot of scaling up since those
| days :)
| jl6 wrote:
| Great to see Amazon employees being allowed to talk openly about
| how S3 works behind the scenes. I would love to hear more about
| how Glacier works. As far as I know, they have never revealed
| what the underlying storage medium is, leading to a lot of wild
| speculation (tape? offline HDDs? custom HDDs?).
| Twirrim wrote:
| Glacier is a big "keep your lips sealed" one. I'd love AWS to
| talk about everything there, and the entire journey it was on
| because it is truly fascinating.
| [deleted]
| inopinatus wrote:
| Never officially stated, but frequent leaks from insiders
| confirm that Glacier is based on Very Large Arrays of Wax
| Phonograph Records (VLAWPR) technology.
| Twirrim wrote:
| We came up with that idea in Glacier during the run up to
| April one year (2014, I think?), half jokingly suggested it
| as an April Fool's Day Joke, but Amazon quite reasonably
| decided against doing such jokes.
|
| One of the tag line ideas we had was "8 out of 10 customers
| say they prefer the feel of their data after it is restored"
| [deleted]
| anderspitman wrote:
| The things we could build if S3 specified a simple OAuth2-based
| protocol for delegating read/write access. The world needs an
| HTTP-based protocol for apps to access data on the user's behalf.
| Google Drive is the closest to this but it only has a single
| provider and other issues[0]. I'm sad remoteStorage never caught
| on. I really hope Solid does well but it feels too complex to me.
| My own take on the problem is https://gemdrive.io/, but it's
| mostly on hold while I'm focused on other parts of the self-
| hosting stack.
|
| [0]: https://gdrivemusic.com/help
| baq wrote:
| > What's interesting here, when you look at the highest-level
| block diagram of S3's technical design, is the fact that AWS
| tends to ship its org chart. This is a phrase that's often used
| in a pretty disparaging way, but in this case it's absolutely
| fascinating.
|
| I'd go even further: at this scale, it is essential and required
| to develop these kind of projects with any sort of velocity.
|
| Large organizations ship their communication structure by design.
| The alternative is engineering anarchy.
| hobo_in_library wrote:
| This is also why reorgs tend to be pretty common at large tech
| orgs.
|
| They know they'll almost inevitably ship their org chart. And
| they'll encounter tons of process-based friction if they don't.
|
| The solution: Change your org chart to match what you want to
| ship
| Severian wrote:
| Straight from The Mythical Man Month: Organizations which
| design systems are constrained to produce systems which are
| copies of the communication structures of these organizations.
| epistasis wrote:
| Working in genomics, I've dealt with lots of petabyte data stores
| over the past decade. Having used AWS S3, GCP GCS, and a raft of
| storage systems for collocated hardware (Ceph, Gluster, and an HP
| system whose name I have blocked from my memory), I have no small
| amount of appreciation for the effort that goes into operating
| these sorts of systems.
|
| And the benefits of sharing disk IOPs with untold numbers of
| other customers is hard to understate. I hadn't heard the term
| "heat" as it's used in the article but it's incredibly hard to
| mitigate on single system. For our co-located hardware clusters,
| we would have to customize the batch systems to treat IO as an
| allocatable resource the same as RAM or CPU in order to manage it
| correctly across large jobs. S3 and GCP are super expensive, but
| the performance can be worth it.
|
| This sort of article is some of the best of HN, IMHO.
| deathanatos wrote:
| > _Now, let's go back to that first hard drive, the IBM RAMAC
| from 1956. Here are some specs on that thing:_
|
| > _Storage Capacity: 3.75 MB_
|
| > _Cost: ~$9,200 /terabyte_
|
| Those specs can't possibly be correct. If you multiply the cost
| by the storage, the cost of the drive works out to 3C/.
|
| This site[1] states,
|
| > _It stored about 2,000 bits of data per square inch and had a
| purchase price of about $10,000 per megabyte_
|
| So perhaps the specs should read $9,200 / _megabyte_? (Which
| would put the drive 's cost at $34,500, which seems more
| plausible.)
|
| [1]: https://www.historyofinformation.com/detail.php?entryid=952
| acdha wrote:
| https://en.m.wikipedia.org/wiki/IBM_305_RAMAC has the likely
| source of the error: 30M bits (using the 6 data bits but not
| parity), but it rented for $3k per month so you didn't have a
| set cost the same as buying a physical drive outright - very
| close to S3's model, though.
| andywarfield wrote:
| oh shoot. good catch, thanks!
| birdyrooster wrote:
| Must've put a decimal point in the wrong place or something. I
| always do that. I always mess up some mundane detail.
| S_A_P wrote:
| Did you get the memo? Yeah I will go ahead and get you
| another copy of that memo.
| jakupovic wrote:
| The part about distributing loads takes me back to S3 KeyMap days
| and me trying to migrate to it, from initial implementation. What
| I learned is that even after you identify the hottest
| objects/partitions/buckets you cannot simply move them and be
| done. Everything had to be sorted. The actual solution was to
| sort and then divide the host's partition load into quartiles and
| move the second quartile partitions onto the least loaded hosts.
| If one tried to move the hottest buckets, 1st quartile, it would
| put even more load on the remaining members which would fail,
| over and over again.
|
| Another side effect was that the error rate went from steady ~1%
| to days without any errors. Consequently we updated the alerts to
| be much stricter. This was around 2009 or so.
|
| Also came from academic background, UM, but instead of getting my
| PhD I joined S3. It even rhymes :).
| dsalzman wrote:
| > Imagine a hard drive head as a 747 flying over a grassy field
| at 75 miles per hour. The air gap between the bottom of the plane
| and the top of the grass is two sheets of paper. Now, if we
| measure bits on the disk as blades of grass, the track width
| would be 4.6 blades of grass wide and the bit length would be one
| blade of grass. As the plane flew over the grass it would count
| blades of grass and only miss one blade for every 25 thousand
| times the plane circled the Earth.
| mcapodici wrote:
| S3 is more than storage. It is a standard. I like how you can get
| S3 compatible (usually with some small caveats) storage from a
| few places. I am not sure how open the standards is, and if you
| have to pay Amazon to say you are "S3 compatible" but it is
| pretty cool.
|
| Examples:
|
| iDrive has E2, Digital Ocean has Object Storage, Cloudflare has
| R2, Vultr has Object Storage, Backblaze has B2
___________________________________________________________________
(page generated 2023-07-27 23:00 UTC)