[HN Gopher] Continuous reinvention: A brief history of block sto...
___________________________________________________________________
Continuous reinvention: A brief history of block storage at AWS
Author : riv991
Score : 87 points
Date : 2024-08-22 14:59 UTC (2 hours ago)
(HTM) web link (www.allthingsdistributed.com)
(TXT) w3m dump (www.allthingsdistributed.com)
| mjb wrote:
| Super cool to see this here. If you're at all interested in big
| systems, you should read this.
|
| > Compounding this latency, hard drive performance is also
| variable depending on the other transactions in the queue.
| Smaller requests that are scattered randomly on the media take
| longer to find and access than several large requests that are
| all next to each other. This random performance led to wildly
| inconsistent behavior.
|
| The effect of this can be huge! Given a reasonably sequential
| workload, modern magnetic drives can do >100MB/s of reads or
| writes. Given an entirely random 4kB workload, they can be
| limited to as little as 400kB/s of reads or writes. Queuing and
| scheduling can help avoid the truly bad end of this, but real-
| world performance still varies by over 100x depending on
| workload. That's really hard for a multi-tenant system to deal
| with (especially with reads, where you can't do the "just write
| it somewhere else" trick).
|
| > To know what to fix, we had to know what was broken, and then
| prioritize those fixes based on effort and rewards.
|
| This was the biggest thing I learned from Marc in my career (so
| far). He'd spend time working on visualizations of latency (like
| the histogram time series in this post) which were much richer
| than any of the telemetry we had, then tell a story using those
| visualizations, and completely change the team's perspective on
| the work that needed to be done. Each peak in the histogram came
| with it's own story, and own work to optimize. Really diving into
| performance data - and looking at that data in multiple ways -
| unlocks efficiencies and opportunities that are invisible without
| that work and investment.
|
| > Armed with this knowledge, and a lot of human effort, over the
| course of a few months in 2013, EBS was able to put a single SSD
| into each and every one of those thousands of servers.
|
| This retrofit project is one of my favorite AWS stories.
|
| > The thing that made this possible is that we designed our
| system from the start with non-disruptive maintenance events in
| mind. We could retarget EBS volumes to new storage servers, and
| update software or rebuild the empty servers as needed.
|
| This is a great reminder that building distributed systems isn't
| just for scale. Here, we see how building the system in a way
| that can seamlessly tolerate the failure of a server, and move
| data around without loss, makes large scale operations
| (everything from day-to-day software upgrades to a massive
| hardware retrofit project) possible that just wouldn't be
| possible in a "simpler" architecture. A "simpler" architecture
| would make these operations much harder, to the point of being
| impossible, making the end-to-end problem we're trying to solve
| for the customer harder.
| tw04 wrote:
| I think the most fascinating thing is watching them relearn every
| lesson the storage industry already knew about a decade earlier.
| Feels like most of this could have been solved by either hiring
| storage industry experts or just acquiring one of the major
| vendors.
| jeeyoungk wrote:
| What is there to learn from an "storage industry expert" or
| major vendors? network attached block level storage at AWS's
| scale hasn't been done before.
| simonebrunozzi wrote:
| If you're curious, this is a talk I gave back in 2009 [0] about
| Amazon S3 internals. It was created from internal assets by the
| S3 team, and a lot in there influenced how EBS was developed.
|
| [0]: https://vimeo.com/7330740
___________________________________________________________________
(page generated 2024-08-22 17:00 UTC)