[HN Gopher] AWS power failure in US-EAST-1 region killed some ha...
___________________________________________________________________
AWS power failure in US-EAST-1 region killed some hardware and
instances
Author : bratao
Score : 83 points
Date : 2021-12-23 18:34 UTC (4 hours ago)
(HTM) web link (www.theregister.com)
(TXT) w3m dump (www.theregister.com)
| iJohnDoe wrote:
| Not piling on AWS. These things happen and I'm sure everyone
| involved is working to improve things. Yes, everyone should
| deploy in multiple availability zones.
|
| My 2 cents. Outages happen. Network glitches happen. Bad configs
| and bad updates happen. However, power issues should not really
| happen. One of the primary cost saving areas of going to the
| cloud is not having to do on-prem power, such as UPS, generators,
| maintenance etc. Not having to do on-prem cooling is another
| thing. These should be solved from the customer's perspective
| when going into a professional data center and are the things you
| don't want to worry about anymore.
| ksec wrote:
| Yes. I expect software and network glitches because they are
| complicated and error prone. But I fully expect hardware and
| power redundancy. Even if they were down it shouldn't "kill"
| them. At least not on AWS, the biggest and arguably best DC
| running in the world. But it did.
|
| This suggest to me DataCenter, as an infrastructure and
| building on itself still have plenty of room for improvements.
| kazen44 wrote:
| Getting power right in a DC takes a ton of preventative
| maintenance. (doing black building tests, make sure there are
| working load tests which occur frequently, managing UPS battery
| health etc).
| spenczar5 wrote:
| Yes, it's hard work... but it's work AWS should be great at.
| Like, _really_ great at.
|
| I used to work at AWS. The most worked-up I ever saw Charlie
| Bell (the de facto head of AWS engineering) was in the weekly
| operations meeting, going over a postmortem which described a
| near-miss as several generators failed to start properly in a
| power interruption. In this meeting - with nearly 1,000
| participants! - Charlie got increasingly irate about the fact
| that this had happened. For the next few weeks, details of
| some sort of flushing subsystem for backup generators was an
| every-time topic.
|
| Sadly, Charlie left AWS and now works at Microsoft. I can't
| help but wonder what that operations meeting is like now
| without him.
| CoastalCoder wrote:
| Any idea why a power _failure_ would cause (or reveal) hardware
| damage?
|
| Leading up to Y2K, I remember concerns about spinning hard disks
| not being able to start up again.
|
| And if the power is _flaky_ with spikes and brown-outs, I
| understand that 's a problem.
|
| But is either of those relevant to AWS?
| dilyevsky wrote:
| I worked for one of the largest datacenter company in the world
| and significant portion of boxes would never come up after
| being powercycled
| mgsouth wrote:
| Slamming on the brakes can be hard on the electronics. It
| causes large, high-speed voltage or current spikes that kill
| (old, tired) components before the spikes can be clamped. Can
| be made worse by old electolytic caps that no longer filter
| effectively. Especially hard on power supply diodes and
| mosfets.
|
| Here's a few references at random:
|
| https://vbn.aau.dk/ws/portalfiles/portal/108034396/ESREF2013...
|
| https://www.infineon.com/dgdl/s30p5.pdf?fileId=5546d46253360...
|
| https://www.sciencedirect.com/science/article/pii/S003811010...
| Spivak wrote:
| Which is why if you realize that your hard drive is failing
| the worst thing you can do is cut the power [1]. You get your
| data off that box as fast as possible.
|
| [1] Yes yes always have backups, and RAID, and replication
| something about not being able to count that low. Betcha
| most'a'ya don't have any of that on your MBP though except
| for a maybe a backup that lags hours behind your work.
| dahfizz wrote:
| https://news.ycombinator.com/item?id=29666048
|
| TLDR if electronics are near-death, a power cycle is likely to
| kick it over the edge.
| SteveNuts wrote:
| In my experience it's common when there's a full power down of
| a large data center like this. In 2013 the company I was
| working at had an unplanned power outage which took out 200
| spinning drives, and 18 servers. I think the motors in the
| drives are fine running at constant speed but don't have the
| juice to spin up from a dead stop.
|
| I remember IBM was renting private planes to get us replacement
| hardware from around the country (we were a very large
| customer).
| jbergens wrote:
| A very long time ago I worked for a medium sized company and
| there was a power outage. The servers were in a room in our
| building but they had ups power. The problem was that the
| ups's were broken/malfunctioning and the servers needed to do
| a disk check after an unplanned reboot. It took hours before
| we could work again even though most drives survived. This
| was also before git so we could not reach the central version
| control system and could not commit any code. I think all
| code and workspaces were on servers that was not up yet.
| wdfx wrote:
| It's just the numbers. Eventually all the hardware will fail.
| Some of that hardware will require a specific set of conditions
| to fail. Such a condition may be current inrush from being
| powercycled. Given the scale of a data centre and powercycling
| it all at once, then some will fail due to that condition.
|
| Further to that thought, there must be hardware in that centre
| which is failing and being replaced all the time due to other
| conditions, we just won't hear about it because it's part of
| normal maintenance.
| emptybottle wrote:
| It can be a number of things, but often times it's the thermal
| cycle and wear on the components.
|
| After letting devices (especially that rarely do this) cool,
| and then bringing them back up to temperature some percentage
| of components may fail or operate out of spec.
|
| And mechanical parts have similar issue. A motor that's been
| happily running for months/years may not restart after being
| stopped.
|
| It's also possible that a power fault itself could damage
| hardware, although in my experience this type of issue is far
| less common.
| michaelrpeskin wrote:
| This might be old lore, but I remember many years ago that
| someone told me that when a spinning disk has been up to
| temperature for a long time there's a bit of lubricant that is
| vaporized and evenly distributed everywhere (like the halides
| in your halogen headlamps). But when it stops spinning that
| stuff cools down and gets sticky and ends up sticking the heads
| to the disk so that it can't get going anymore.
|
| Don't know the truth to that, but it seemed like a reasonable
| explanation at the time.
| _3u10 wrote:
| Think of it like a car with a starter wearing out, at some
| point you turn the engine off and it never starts up again,
| until you replace the starter. Similarly bad batteries,
| corrosion on terminals, etc. Startup is a much different
| workload on electronics and in the case of HDDs mechanical
| devices than running.
|
| I have an old fan that will need replacing one day that takes
| about 15 minutes to reach full speed as the bearings are gone.
| Everytime I turn it off I wonder if the next time I turn it on
| it will be time to replace it.
| kube-system wrote:
| If I were to make some WAGs about things that could generally
| be described this way:
|
| * increased failure rates of drives (or other hardware) due to
| thermal cycling
|
| * cache power failures leading to corrupted data
|
| * previously unknown failures in recovery processes (like
| firmware bugs that might be described as a "hardware problem")
|
| * cooling failures leading to hardware failure
|
| Most of these are pretty rare issues, but it becomes more
| probable to happen at least once when you're running freekin'
| us-east-1
| zkirill wrote:
| Have there been any incidents that affected more than one AZ?
|
| AWS RDS in multi-AZ deployment gives you two availability zones.
| Aurora gives you three. What kind of scenario would be used to
| justify three AZ's for the purposes of high availability?
| wmf wrote:
| Modern Paxos/Raft HA requires an odd number of nodes so that's
| how you end up with three AZs.
| Jweb_Guru wrote:
| Aurora isn't really designed to handle the total failure of two
| availability zones (not short term, anyway). It's designed to
| handle one AZ + one additional node failure (which is
| reasonably likely to happen on large instances due to data from
| a single database being striped across up to 12,800 machines
| per AZ). Due to how quorum systems work the "simplest" way they
| decided to handle that was six replicas per 10 GiB segment
| across 3 AZs, with three of the replicas being log-only and
| three designated as full replicas (which is the lowest number
| of full replicas you can have given their failure model, and
| hitting that lower bound does require you to deploy across 3
| AZs). If any three of the nodes are dead (log-only or
| otherwise), write traffic is stopped until they can bring up a
| fourth node, though there is some support for backup 3-of-4
| quorums for very long-term AZ outages.
| zkirill wrote:
| Amazingly insightful answer. Thank you very much!
| gundmc wrote:
| Rare to see power issues at a modern data center cause downtime.
| All of those racks should have UPS and batteries to sustain
| during an outage until the automatic transfer switch can fail
| over to a redundant system or generator. Would be interested in
| reading more about what happened here.
| electroly wrote:
| In previous postmortems[0] they've mentioned their UPSes. They
| definitely have them. They don't seem to write a lot of
| postmortems, though. I'm not sure whether we should expect one
| for this event.
|
| [0] https://aws.amazon.com/premiumsupport/technology/pes/
| 1cvmask wrote:
| A lot more companies will go to a multi-cloud active active
| architecture with maybe even bare metal redundancies.
| beermonster wrote:
| As large cloud outages become more frequent and the impact
| greater each time, I feel it's more likely people will
| reconsider moving workloads off-prem.
| [deleted]
| tyingq wrote:
| Multi-cloud is odd to me, unless you're a company selling a
| service to cloud customers.
|
| By definition, you would have to either go lowest-common-
| denominator, or build complicated facades in front of like
| services.
|
| If you're going lowest-common-denominator, then multi-old-
| school-hosting would be far cheaper.
| ransom1538 wrote:
| me: "Yeah! AWS went down a few times, I think we should pay
| double!"
|
| vp: no.
| emodendroket wrote:
| More than double realistically. But you could achieve most of
| the benefit at much lower cost by going multi-AZ.
| profmonocle wrote:
| Egress fees can make replicating databases, storage buckets,
| etc. between clouds _very_ expensive. Multi-region is a much
| more affordable option. Multi-region outages aren 't unheard of
| among the major cloud operators, but they're less common than
| single-AZ or single-region outages.
|
| IMO, most companies just aren't sensitive enough to downtime
| that multi-AZ + multi-region deployment within a single cloud
| provider isn't good enough.
| psanford wrote:
| I would stay away from any company that thinks "we need to go
| multicloud" as a response to this outage. This affected a
| single az in a single region. If it caused you downtime or a
| partial outage, it means you are not fully resilient to single
| az failures. The correct thing is to fix your application to
| handle that.
|
| If you can't handle a single az failure there is no way you are
| going to handle failing over across different cloud providers
| correctly.
| dragonwriter wrote:
| > If it caused you downtime or a partial outage, it means you
| are not fully resilient to single az failures.
|
| Given the number of AWS global services that have
| dependencies on infra in US-EAST-1 (and, from the impacts of
| this and other past outages, seen vulnerable to single-AZ
| failures in US-EAST-1) that's...less avoidable for _certain_
| regions /AZs than one might naively expect. Most clouds seem
| to have at least some degree of this kind of vulnerability.
| CubsFan1060 wrote:
| There will, however, be a lot of executives _talking_ about
| going multi cloud.
| nonane wrote:
| > This affected a single az in a single region. If it caused
| you downtime or a partial outage, it means you are not fully
| resilient to single az failures.
|
| This is not true. Amazon is not being upfront about what
| happened here. It was simply not a single AZ failure. Our us-
| east-1 ELB load balancers were hosed and were unable to
| direct traffic to other AZs - they simply stopped working an
| were dropping traffic. We tried creating load balancers in
| different AZs and that didn't work either.
|
| How can you be resilient to single AZ failures if load
| balancers stop working region wide during a single AZ outage?
| acdha wrote:
| Did your TAM go into any details on that? Over a couple
| hundred load-balancers, the only issue we had was taking
| longer to register new instances and that affected only a
| couple of them. Running services weren't interrupted,
| latency remained, etc. which is what I'd expect for a
| single AZ failure.
| luhn wrote:
| To be fair though, from what I heard AZ outage caused an EC2
| API brownout, so people couldn't launch new instances in the
| other AZs. That put a wrench in a lot of multi-AZ
| architectures.
|
| Not advocating for multi-cloud though...
| daneel_w wrote:
| _" As is often the case with a loss of power, there may be some
| hardware that is not recoverable..."_
|
| No. Not even rarely. If they lost hardware because of this
| something much different than just loss of power happened on
| their servers' mains rails.
| jacquesm wrote:
| To forestall that reply: I did think about it before hitting
| the reply button. I've seen a couple of large scale DC power
| outages in my 35 years of IT work. The vast majority of the
| hardware came through just fine. But older data centers with
| gear that has been running uninterrupted for many years tend to
| have at least some hardware that simply won't come up again,
| due to a variety of reasons. One can be that the on-board
| batteries have gone bad, which nobody noticed until the power
| cycle. Another is that some harddrives function well as long as
| they keep spinning but that has worn a nasty little spot on the
| bearing that keeps the spindle aloft. When it spins down,
| depending on the orientation it may not be able to start back
| up again, it may even crash while spinning down. Powercycle
| enough gear that has been running for years and you will most
| likely lose at least some small fraction of it. You could do it
| the next day after taking care of those and everything would
| likely be fine.
| vlovich123 wrote:
| Wouldn't a UPS + generator failover mean that the HW never
| spins down in the first place? That's how I interpreted op's
| statement anyway.
| jacquesm wrote:
| I've seen plenty of UPS's fail, and generator failover is
| that moment when everybody stands there with their fingers
| crossed.
|
| None of those things are fool proof and in case of a large
| scale outage, especially one that lasts more than a couple
| of minutes there is a fair chance that you will find the
| limitations of some of your contingency plans.
| acdha wrote:
| Hopefully, but those aren't perfect. Large data centers
| regularly test those because things can go wrong with any
| of the UPS, generator[1], or the distribution hardware
| which switches between the line and generator power. One of
| the longer outages I've seen was when the distribution
| hardware itself failed and burnt out multiple parts which
| the manufacturer did not have sufficient spares in our
| region.
|
| AWS has a lot of very skilled professionals who are quite
| familiar with those issues so I'd be quite surprised if it
| turned out to be something that simple but you always want
| to have a contingency plan you've tested for what happens
| if core infrastructure like that fails and takes multiple
| days to recover.
|
| 1. One nasty example: fuel issues which aren't immediately
| obvious so someone doing a 5 minute test wouldn't learn
| that the system wasn't going to handle an outage longer
| than that.
| Johnny555 wrote:
| If my home UPS failed, I'd be awfully surprised if my home
| fileserver that was plugged into it failed too.
|
| But if I lost power to thousands of servers, I'd expect some
| number of them to fail. I've even lost servers when losing
| power to a single rack.
| joshuamorton wrote:
| Yes, its a fairly well known issue at datacenter scale that if
| you power-cycle everything, some percentage of things won't
| turn back on (I recall something like 1% of HDDs just failing
| to work again being the number quoted at me). This obviously
| isn't the case for fresh new HDDs, but ones that have been
| spinning continuously for a years.
|
| Other hardware is similar.
| jdsully wrote:
| Starting up is hard on electronics due to the inrush current.
| There may be marginal devices just barely operating that will
| not survive a reboot. Things like bad capacitors aren't always
| detectable during steady state.
|
| At AWS scale it's highly likely there are more than a few of
| these.
| daneel_w wrote:
| Did you think this over before hitting the reply button? Can
| you imagine any server or networking brand surviving if its
| products were at substantial risk of dying when powering up
| or cycling? Have you ever heard of common consumer grade
| computer equipment regularly dying from it? No? Not me
| either.
|
| It's _really_ common to find thyristors in devices to limit
| in-rush current, even in cheap electronics. You can bet that
| PSUs in data center equipment use them.
| emodendroket wrote:
| > Have you ever heard of common consumer grade computer
| equipment regularly dying from powering up?
|
| Yes, absolutely.
| daneel_w wrote:
| You're romanticizing here. Electronics eventually dying
| is not the same as being at substantial risk from dying
| due to power-cycling itself.
| emodendroket wrote:
| No, I am not. I personally have experienced this. Pop,
| whoops, it doesn't work anymore.
| daneel_w wrote:
| Same. What you're romanticizing about is it being a
| common and regular occurrence. It's not. It's a fluke.
| emodendroket wrote:
| Now imagine you have one of the world's largest
| assemblages of electronic devices and they all turn on at
| once. How likely are some flukes?
| dpratt wrote:
| And when you have 500,000 copies of something, most of
| which have been in use for some appreciable fraction of
| their effective lifetime, it doesn't have to be "common"
| to occur to some fraction of them.
|
| The post you're responding to never implied that it
| happens to _most_ of them, or even many of them, just
| that when you have a gigantic farm, it's not unreasonable
| to see a small handful of hardware instances release up
| their magic smoke when they're all coming back from being
| powered down.
|
| MTBF is a thing.
| ericd wrote:
| A fluke x massive scale = regular occurrence.
| [deleted]
| [deleted]
| kube-system wrote:
| I am not sure whether or not inrush current (or thermal
| cycling, or something else) is to blame, but server
| equipment in data centers has been fairly well documented
| to exhibit increased hardware failure rates after outages.
| organsnyder wrote:
| Substantial? No. A low percentage that is still large
| enough to be impactful in a datacenter? Absolutely.
| X-Istence wrote:
| When you've got a massive datacenter packed, even 0.01%
| is a meaningful amount of hardware.
| sonofhans wrote:
| It's a scale problem. We're not talking about "substantial
| risk" or "regularly dying." At AWS scale even 0.01% cold
| start failure is noticeable. It's not possible to guarantee
| perfect operation of each system or component; physical
| things wear out. If you can find a way to avoid cold-start
| failure 100% of the time your product would be valuable to
| many businesses.
|
| Also:
|
| > Did you think this over before hitting the reply button?
|
| This is ad hominem, and we should avoid it here.
| jeffbee wrote:
| > Have you ever heard of common consumer grade computer
| equipment regularly dying from it?
|
| You have outed yourself as someone with zero experience
| operating computers, either individually or at scale.
| Computers have moving parts, whether fans or hard disk
| drives. Just because those _were moving_ doesn 't mean they
| will _begin moving again_ from a dead stop. There are also
| things like dead cmos /nvram batteries that can prevent
| machines from automatically starting when power is applied.
| hhh wrote:
| I don't know if it's intentional, but your comment comes
| off very hostile.
|
| Recently we has a violent power outage where I work due to
| the tornados in the midwest. A few of our facilities have
| industrial hardware with batteries that will last for 72
| hours and after that point it's a world of unknowns what
| will happen. The batteries died after a lot of effort to
| try and restore power, but everything came up just fine.
|
| However, a random Cisco switch elsewhere in the building
| had at least one power supply fail.
|
| This was in one facility where most of these have been
| running for 4+ years straight (12 years for the system with
| a 72 hour battery.)
|
| I find it hard to imagine that in the scale of even one AZ
| that it wouldn't be possible that at least one system has
| this happen.
| acdha wrote:
| > Did you think this over before hitting the reply button?
| Can you imagine any server or networking brand surviving if
| its products were at substantial risk of dying when
| powering up or cycling?
|
| This is hostile enough that I'd have trouble squaring it
| with the site guidelines.
|
| It's especially bad because, as others have been saying,
| your belief isn't supported by real lived experience for
| many of us. A data center has enough devices that even a
| low probability error rate will fairly reliably happen, and
| that doesn't usually hurt the manufacturer's reputation
| because they never promised 100% and will replace it under
| warranty. I've seen this with hard drives especially but
| also things like power supplies and on-board batteries, and
| even things like motherboards where a cooling/heating cycle
| was enough to unseat RAM or cause hairline fractures in
| solder traces.
|
| This can be especially bad with unplanned power outages if
| the power doesn't instantly go out and stay out for the
| duration. Especially for hard drives hitting the power
| up/down cycle a few times was a good way to have extra
| failures.
|
| Unscheduled power outages can be even worse
| jacquesm wrote:
| I've seen even planned outages and tests result in failed
| hardware.
| acdha wrote:
| Ditto -- you never do a substantial shutdown without
| learning something.
| nicolaslem wrote:
| It is common for failing hard drives to continue running until
| their are power cycled, at which point they never start again.
| X-Istence wrote:
| Having had to restore a NetApp cluster from backup because
| the disks would no longer spin up after having been running
| for 4+ years... yup. The hardware was fully working and
| according to all internal metrics was still perfectly good,
| but a power outage and they never came back up.
| X-Istence wrote:
| Having worked in datacenter where servers were running for 5+
| years in racks, when the datacenter went dark due to a UPS
| going up in flames there were definitely systems that did not
| come back online.
|
| Most notably we had a large NetApp array that would not boot,
| once we replaced the controllers we had lost 13 out of the 60
| hard drives in the array. They would no longer spin up. Like
| physically seized. Because they had been spinning for so long,
| they would have likely kept spinning just fine, but with power
| gone, it was done.
|
| Fans are another fun one, anything with ball bearings really.
| Power supplies are another issue, due to the sudden large
| inrush of power when the switch was flipped back to on, some
| power supplies had their capacitors go up in smoke.
|
| This is not rare, this a common occurrence. When you have an
| absolutely massive footprint with thousands upon thousands of
| servers, that have been running for a long time, there will be
| things that just don't come back once they have stopped running
| or when the electricity is gone.
___________________________________________________________________
(page generated 2021-12-23 23:01 UTC)