[HN Gopher] Major data center power failure (again): Cloudflare ...
       ___________________________________________________________________
        
       Major data center power failure (again): Cloudflare Code Orange
       tested
        
       Author : gmemstr
       Score  : 122 points
       Date   : 2024-04-08 13:20 UTC (9 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | qmarchi wrote:
       | Two major outages less than half a year a part, but with wildly
       | different outcomes. It's definitely showing their engineering
       | capabilities were targeted at the correct outcomes.
       | 
       | Would definitely be interested to see the detailed RCA on the
       | power side of things. Not many people really think about Layer 0
       | on the stack.
        
         | mike_d wrote:
         | Cloudflare is in EdgeConneX Portland, you can try poking around
         | but I haven't seen a RCA of what happened. Details are usually
         | only shared with direct customers because it is bad for the
         | brand.
         | 
         | https://www.edgeconnex.com/wp-content/uploads/2018/10/ECX-22...
        
       | mannyv wrote:
       | Someone set up the breakers incorrectly way back when, and they
       | were never adjusted. I'll bet it's not possible to adjust those
       | without powering off the downstream equipment.
       | 
       | It reminds me of the amazon guy discovering that there was no way
       | to fail back power without an outage, then them going off and
       | building their own equipment.
        
         | SteveNuts wrote:
         | > there was no way to fail back power without an outage, then
         | them going off and building their own equipment.
         | 
         | Anywhere I can read about that?
        
           | michaelt wrote:
           | Maybe [1] which is about per-rack UPSes, to reduce the blast
           | radius of UPS failure?
           | 
           | Pretty sensible IMHO - I live in a country with a reliable
           | electricity grid, and outages due to UPS malfunction are
           | about as common as power outages.
           | 
           | [1] https://www.datacenterdynamics.com/en/news/aws-develops-
           | its-...
        
             | nyrikki wrote:
             | In large data centers, rack level UPSs are impractical for
             | many reasons like cost and efficiency, but the big problem
             | is that modern power densities are so high that you want
             | rows to fail if cooling isn't available.
             | 
             | It doesn't take long without cooling to cook equipment to
             | the point of failure or reduced reliability.
             | 
             | 7 to 16kW per rack is common even in these older colo
             | facilities.
             | 
             | And there never would have been enough UPS to make up for
             | not enough replacement breakers on site.
        
               | michaelt wrote:
               | But isn't the UPS only expected to last for 15 minutes or
               | so, to give the backup generators time to start up? Or to
               | perform a fast-but-graceful migration when the generator
               | doesn't start up?
               | 
               | I thought most DCs just pause the cooling until the
               | generator comes up, rather than running the cooling on
               | battery power?
        
               | jeffbee wrote:
               | Google has been using rack-level UPS for many years, so
               | I'm curious to hear why you think that is impractical
               | specifically in large datacenters.
               | 
               | https://cloud.google.com/blog/products/compute/google-
               | joins-....
        
               | justsomehnguy wrote:
               | Your typical company, even if it has hundreds of racks,
               | usually at the scale _and workloads_ needed at Google,
               | where you can plug off the racks and not really notice
               | it.
        
             | Dylan16807 wrote:
             | You normally reduce the blast radius of UPS failures by
             | having two supplies for each server. So I don't think
             | that's it.
             | 
             | The link someone else put about overriding faults might be
             | it, if OP misremembered the problem.
        
               | michaelt wrote:
               | Well in principle, sure.
               | 
               | And yet here we are, reading an article about "a total
               | loss of power [...] following a reportedly simultaneous
               | failure of four [...] switchboards serving all of
               | Cloudflare's cages. This meant both primary and redundant
               | power paths were deactivated across the entire
               | environment."
        
               | Dylan16807 wrote:
               | I'm not sure what you mean. Wouldn't we have the same
               | level of outage with per-rack UPSes?
        
             | mysteria wrote:
             | > Pretty sensible IMHO - I live in a country with a
             | reliable electricity grid, and outages due to UPS
             | malfunction are about as common as power outages.
             | 
             | Hence why modern servers and their associated networking
             | equipment have dual power supplies which could be connected
             | to two seperate UPS systems. It would be very unlikely to
             | have them both fail at once. In a less important home/small
             | business scenario typically one supply is connected to the
             | UPS and the other is connected to the wall via a surge
             | protector.
        
             | jabart wrote:
             | Per-Rack UPS is not allowed in our local Flexential DC.
             | Reason being is that in case of a fire, the whole room
             | needs to go dark on power for fire control. We do have two
             | redundant AC circuits on two different breakers. But our DC
             | was from a company that got bought out by Flexential so
             | maybe, hopefully, its setup different.
        
           | hx833001 wrote:
           | https://perspectives.mvdirona.com/2017/04/at-scale-rare-
           | even...
        
       | decasia wrote:
       | As always, it's really impressive to see how much technical
       | detail they release publicly in their RCAs. It sets a good
       | example for the industry.
       | 
       | Also -- quite impressive to make major infrastructure and
       | architecture changes in a few months. Not every organization can
       | pull that off.
        
         | Waterluvian wrote:
         | I feel there's a sweet spot where if you do it too quickly,
         | it's a bad sign. And if it takes years, risk just keeps going
         | up and up until it becomes basically impossible to do smoothly.
        
       | andrewaylett wrote:
       | I can very definitely empathise with the experience of having
       | worked hard at fixing the issues underpinning high priority
       | incidents, then noticing that what previously would have taken
       | hours to fix is now only visible as a blip on a graph.
        
       | llbeansandrice wrote:
       | A single k8s cluster spanning multiple datacenters feels mind
       | boggling to me. I know it's not exactly uncommon for HA even if
       | you just have a little one in your cloud provider of choice but
       | I'm sure it's a totally different beast than the toy ones I've
       | created.
        
       ___________________________________________________________________
       (page generated 2024-04-08 23:00 UTC)