[HN Gopher] Major data center power failure (again): Cloudflare ...
___________________________________________________________________
Major data center power failure (again): Cloudflare Code Orange
tested
Author : gmemstr
Score : 122 points
Date : 2024-04-08 13:20 UTC (9 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| qmarchi wrote:
| Two major outages less than half a year a part, but with wildly
| different outcomes. It's definitely showing their engineering
| capabilities were targeted at the correct outcomes.
|
| Would definitely be interested to see the detailed RCA on the
| power side of things. Not many people really think about Layer 0
| on the stack.
| mike_d wrote:
| Cloudflare is in EdgeConneX Portland, you can try poking around
| but I haven't seen a RCA of what happened. Details are usually
| only shared with direct customers because it is bad for the
| brand.
|
| https://www.edgeconnex.com/wp-content/uploads/2018/10/ECX-22...
| mannyv wrote:
| Someone set up the breakers incorrectly way back when, and they
| were never adjusted. I'll bet it's not possible to adjust those
| without powering off the downstream equipment.
|
| It reminds me of the amazon guy discovering that there was no way
| to fail back power without an outage, then them going off and
| building their own equipment.
| SteveNuts wrote:
| > there was no way to fail back power without an outage, then
| them going off and building their own equipment.
|
| Anywhere I can read about that?
| michaelt wrote:
| Maybe [1] which is about per-rack UPSes, to reduce the blast
| radius of UPS failure?
|
| Pretty sensible IMHO - I live in a country with a reliable
| electricity grid, and outages due to UPS malfunction are
| about as common as power outages.
|
| [1] https://www.datacenterdynamics.com/en/news/aws-develops-
| its-...
| nyrikki wrote:
| In large data centers, rack level UPSs are impractical for
| many reasons like cost and efficiency, but the big problem
| is that modern power densities are so high that you want
| rows to fail if cooling isn't available.
|
| It doesn't take long without cooling to cook equipment to
| the point of failure or reduced reliability.
|
| 7 to 16kW per rack is common even in these older colo
| facilities.
|
| And there never would have been enough UPS to make up for
| not enough replacement breakers on site.
| michaelt wrote:
| But isn't the UPS only expected to last for 15 minutes or
| so, to give the backup generators time to start up? Or to
| perform a fast-but-graceful migration when the generator
| doesn't start up?
|
| I thought most DCs just pause the cooling until the
| generator comes up, rather than running the cooling on
| battery power?
| jeffbee wrote:
| Google has been using rack-level UPS for many years, so
| I'm curious to hear why you think that is impractical
| specifically in large datacenters.
|
| https://cloud.google.com/blog/products/compute/google-
| joins-....
| justsomehnguy wrote:
| Your typical company, even if it has hundreds of racks,
| usually at the scale _and workloads_ needed at Google,
| where you can plug off the racks and not really notice
| it.
| Dylan16807 wrote:
| You normally reduce the blast radius of UPS failures by
| having two supplies for each server. So I don't think
| that's it.
|
| The link someone else put about overriding faults might be
| it, if OP misremembered the problem.
| michaelt wrote:
| Well in principle, sure.
|
| And yet here we are, reading an article about "a total
| loss of power [...] following a reportedly simultaneous
| failure of four [...] switchboards serving all of
| Cloudflare's cages. This meant both primary and redundant
| power paths were deactivated across the entire
| environment."
| Dylan16807 wrote:
| I'm not sure what you mean. Wouldn't we have the same
| level of outage with per-rack UPSes?
| mysteria wrote:
| > Pretty sensible IMHO - I live in a country with a
| reliable electricity grid, and outages due to UPS
| malfunction are about as common as power outages.
|
| Hence why modern servers and their associated networking
| equipment have dual power supplies which could be connected
| to two seperate UPS systems. It would be very unlikely to
| have them both fail at once. In a less important home/small
| business scenario typically one supply is connected to the
| UPS and the other is connected to the wall via a surge
| protector.
| jabart wrote:
| Per-Rack UPS is not allowed in our local Flexential DC.
| Reason being is that in case of a fire, the whole room
| needs to go dark on power for fire control. We do have two
| redundant AC circuits on two different breakers. But our DC
| was from a company that got bought out by Flexential so
| maybe, hopefully, its setup different.
| hx833001 wrote:
| https://perspectives.mvdirona.com/2017/04/at-scale-rare-
| even...
| decasia wrote:
| As always, it's really impressive to see how much technical
| detail they release publicly in their RCAs. It sets a good
| example for the industry.
|
| Also -- quite impressive to make major infrastructure and
| architecture changes in a few months. Not every organization can
| pull that off.
| Waterluvian wrote:
| I feel there's a sweet spot where if you do it too quickly,
| it's a bad sign. And if it takes years, risk just keeps going
| up and up until it becomes basically impossible to do smoothly.
| andrewaylett wrote:
| I can very definitely empathise with the experience of having
| worked hard at fixing the issues underpinning high priority
| incidents, then noticing that what previously would have taken
| hours to fix is now only visible as a blip on a graph.
| llbeansandrice wrote:
| A single k8s cluster spanning multiple datacenters feels mind
| boggling to me. I know it's not exactly uncommon for HA even if
| you just have a little one in your cloud provider of choice but
| I'm sure it's a totally different beast than the toy ones I've
| created.
___________________________________________________________________
(page generated 2024-04-08 23:00 UTC)