[HN Gopher] More details about the October 4 outage
       ___________________________________________________________________
        
       More details about the October 4 outage
        
       Author : moneil971
       Score  : 320 points
       Date   : 2021-10-05 17:30 UTC (5 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | pm2222 wrote:
       | I'm confused by this. Are the DNS servers inside the backbone, or
       | outside?                 To ensure reliable operation, our DNS
       | servers disable those BGP advertisements if they themselves can
       | not speak to our data centers, since this is an indication of an
       | unhealthy network connection. In the recent outage the entire
       | backbone was removed from operation,  making these locations
       | declare themselves unhealthy and withdraw those BGP
       | advertisements. The end result was that our DNS servers became
       | unreachable even though they were still operational. This made it
       | impossible for the rest of the internet to find our servers.
        
         | net4all wrote:
         | If I understand it correctly they have DNS servers spread out
         | at different locations. These locations are also BGP peering
         | locations. If the DNS server at a location cannot reach the
         | other datacenters via the backbone it stops advertising their
         | IP prefixes. The hope is that traffic will then instead get
         | routed to some other facebook location that is still
         | operational.
        
         | [deleted]
        
         | toast0 wrote:
         | If we make a simplified network map, FB looks more or less like
         | a bunch of PoPs (points of presence) at major peering points
         | around the world, a backbone network that connects those PoPs
         | to the FB datacenters, and the FB operated datacenters
         | themselves. (The datacenters are generally located a bit
         | farther away from population centers, and therefore peering
         | points, so it's sensible to communicate to the outside world
         | through the PoPs only)
         | 
         | The DNS servers run at the PoPs, but only BGP advertise the DNS
         | addresses when the PoP determines it's healthy. If there's no
         | connectivity back to a FB datacenter (or perhaps, no
         | connectivity to the preferred datacenter), the PoP is unhealthy
         | and won't advertise the DNS addresses over BGP.
         | 
         | Since the BGP change that was pushed eliminated the backbone
         | connectivity, none of the PoPs were able to connect to
         | datacenters, and so they all, independently, stopped
         | advertising the DNS addresses.
         | 
         | So that's why DNS went down. Of course, since client access
         | goes through load balancers at the PoPs, and the PoPs couldn't
         | access the datacenters where requests are actually processed,
         | DNS being down wasn't a meaningful impediment to accessing the
         | services. Apparently, it was an issue with management (among
         | other issues).
         | 
         | Disclosure: I worked at WhatsApp until 2019, and saw some of
         | the network diagrams. Network design may have changed a bit in
         | the last 2 years, but probably not too much.
        
           | pm2222 wrote:
           | Ok so the DNS servers at PoPs, outside of backbone, did not
           | go down.
           | 
           | Does it mean they can respond with public IPs meaningful for
           | local PoP only and are not able to respond with IPs as
           | directions to other PoPs or FB's main DCs? So that has to
           | mean different public IPs are handed out at different PoPs,
           | right?
        
             | toast0 wrote:
             | I'm not quite sure I understand the question exactly, but
             | let me give it a try.
             | 
             | So, first off, each pop has a /24, so like the seattle-1
             | pop which is near me has 157.240.3.X addresses; for me,
             | whatsapp.net currently resolves to 157.240.3.54 in the
             | seattle-1 pop. these addresses are used as unicast meaning
             | they go to one place only, and they're dedicated for
             | seattle-1 (until FB moves them around). But there are also
             | anycast /24s, like 69.171.250.0/24, where 69.171.250.60 is
             | a loadbalancer IP that does the same job as 157.240.3.54,
             | but multiple PoPs advertise 69.171.250.0/24; it's served
             | from seattle-1 for me, but probably something else for you
             | unless you're nearby.
             | 
             | The DNS server IPs are also anycast, so if a PoP is
             | healthy, it will BGP advertise the DNS server IPs (or at
             | least some of them; if I ping {a-d}.ns.whatsapp.net, I see
             | 4 different ping times, so I can tell seattle-1 is only
             | advertising d.ns.whatsapp.net right now, and if I worked a
             | little harder, I could probably figure out the other PoPs).
             | 
             | Ok, so then I think your question is, if my DNS request for
             | whatsapp.net makes it to the seattle1 PoP, will it only
             | respond with a seattle-1 IP? That's one way to do it, but
             | it's not necessarily the best way. Since my DNS requests
             | could make it to any PoP, sending back an answer that
             | points at that PoP may not be the best place to send me.
             | 
             | Ideally, you want to send back an answer that is network
             | local to the requester and also not a PoP that is
             | overloaded. Every fancy DNS server does it a little
             | different, but more or less you're integrating a bunch of
             | information that links resolver IP to network location as
             | well as capacity information and doing the best you can.
             | Sometimes that would be sending users to anycast which
             | should end up network local (but doesn't always), sometimes
             | it's sending them to a specific pop you think is local,
             | sometimes it's sending them to another pop because the
             | usual best pop has some issue (overloaded on CPU, network
             | congestion to the datacenters, network congestion on
             | peering/transit, utility power issue, incoming weather
             | event, fiber cut or upcoming fiber maintenance, etc).
             | 
             | But in short, different DNS requests will get different
             | answers. If you've got a few minutes, run these commands to
             | see the range of answers you could get for the same query:
             | host whatsapp.net # using your system resolver settings
             | host whatsapp.net a.ns.whatsapp.net # direct to
             | authoritative A         host whatsapp.net b.ns.whatsapp.net
             | # direct to B         host whatsapp.net 8.8.8.8 # google
             | public DNS         host whatsapp.net 1.1.1.1 # cloudflare
             | public DNS         host whatsapp.net 4.2.2.1 # level 3 not
             | entirely public DNS         host whatsapp.net
             | 208.67.222.222 # OpenDNS         host whatsapp.net 9.9.9.9
             | # Quad9
             | 
             | You should see a bunch of different addresses for the same
             | service. FB hostnames do similar things of course.
             | 
             | Adding on, the BGP announcments for the unicast /24s of the
             | PoPs didn't go down during yesterday's outage. If you had
             | any of the pop specific IPs for whatsapp.net, you could
             | still use http://whatsapp.net (or https://whatsapp.net ),
             | because the configuration for that hostname is so simple,
             | it's served from the PoPs without going to the datacenters
             | (it just sets some HSTS headers and redirects to
             | www.whatsapp.com, which perhaps despite appearances is a
             | page that is served from the datacenters and so would not
             | have worked during the outage).
        
               | pm2222 wrote:
               | Ok, so then I think your question is, if my DNS request
               | for whatsapp.net makes it to the seattle1 PoP, will it
               | only respond with a seattle-1 IP? That's one way to do
               | it, but it's not necessarily the best way. Since my DNS
               | requests could make it to any PoP, sending back an answer
               | that points at that PoP may not be the best place to send
               | me.            Ideally, you want to send back an answer
               | that is network local to the requester and also not a PoP
               | that is overloaded. Every fancy DNS server does it a
               | little different, but more or less you're integrating a
               | bunch of information that links resolver IP to network
               | location as well as capacity information and doing the
               | best you can. Sometimes that would be sending users to
               | anycast which should end up network local (but doesn't
               | always), sometimes it's sending them to a specific pop
               | you think is local, sometimes it's sending them to
               | another pop because the usual best pop has some issue
               | (overloaded on CPU, network congestion to the
               | datacenters, network congestion on peering/transit,
               | utility power issue, incoming weather event, fiber cut or
               | upcoming fiber maintenance, etc).
               | 
               | Right I was hoping the DNSs of FB ought to be smarter
               | than usual and let's say when DNS at Seattle-1 cannot
               | reach backbone it'd respond with IP of perhaps NYC/SF
               | before it starts the BGP withdrawal.
               | 
               | Thanks for the write up and I enjoy it.
        
               | toast0 wrote:
               | > Right I was hoping the DNSs of FB ought to be smarter
               | than usual and let's say when DNS at Seattle-1 cannot
               | reach backbone it'd respond with IP of perhaps NYC/SF
               | before it starts the BGP withdrawal.
               | 
               | The problem there is coordination. The PoPs don't
               | generally communicate amongst themselves (and may not
               | have been able to after the FB backbone was broken,
               | although technically, they could have through transit
               | connectivity, it may not be configured to work that way),
               | so when a PoP loses its connection to the FB datacenters,
               | it also loses its source of what PoPs are available and
               | healthy. I think this is likely a classic distributed
               | systems problem; the desired behavior when an individual
               | node becomes unhealthy is different than when all nodes
               | become unhealthy, but the nature of distributed systems
               | is that a node can't tell if its the only unhealthy node
               | or all nodes became unhealthy together. Each individual
               | PoP did the right thing by dropping out of the anycast,
               | but because they all did it, it was the wrong thing.
        
               | pm2222 wrote:
               | You are to the point and precise. This is exactly the
               | problem.                 Each individual PoP did the
               | right thing by dropping out of the anycast, but because
               | they all did it, it was the wrong thing.
               | 
               | Somehow I feel the design is flawed because if abuses DNS
               | server status a bit. I mean DNS server down and BGP
               | withdrawal _for the DNS server_ is a perfect combination,
               | however connectivity between DNS and backend server down,
               | DNS up and BGP withdrawal _for DNS server_ is not. DNS
               | did not fail and DNS should just fall back to some other
               | operational DNS perhaps a regional /global default one.
        
         | fanf2 wrote:
         | Facebook's authoritative DNS servers are at the borders between
         | Facebook's backbone and the rest of the Internet.
        
         | cranekam wrote:
         | The wording is a bit unclear (rushed, no doubt) but I expect
         | this means the DNS servers stopped announcing themselves as
         | possible targets for the anycasted IPs Facebook uses for its
         | authoritative DNS [1], since they learned that the network was
         | deemed unhealthy. If they all do that nobody will answer
         | traffic sent to the authoritative DNS IPs and nothing works.
         | 
         | [1] See "our authoritative name servers that occupy well known
         | IP addresses themselves" mentioned earlier
        
           | pm2222 wrote:
           | The anycasted IPs for DNS servers make sense to me and the
           | BGP withdrawal too when the common cases are perhaps one or a
           | few PoPs lost connectivity to the backbone/DC rarely every
           | one of them fails at the time.
           | 
           | I was hoping perhaps the DNS servers at PoPs can be improved
           | by responding with public IPs for other PoPs/DCs and only
           | when that is not available start the BGP withdrawal. Or I can
           | presume the available DNSs at PoPs decrease over time and the
           | remaining ones getting more and more requests until finally
           | every one of them is cut off from the internet.
        
       | rvnx wrote:
       | It's a no apologies messages:
       | 
       | "We failed, our processes failed, our recovery process only
       | partially worked, we celebrate failure. Our investors were not
       | happy, our users were not happy, some people probably ended in
       | physically dangerous situations due to WhatsApp being unavailable
       | but it's ok. We believe a tradeoff like this is worth it."
       | 
       | - Your engineering team.
        
         | rPlayer6554 wrote:
         | Literally the first line of their first article.[0] This
         | article is just a technical explanation.
         | 
         | [0] https://engineering.fb.com/2021/10/04/networking-
         | traffic/out...
        
         | jffry wrote:
         | Yesterday's blog post (discussion here:
         | https://news.ycombinator.com/item?id=28754824) was a direct
         | apology.
         | 
         | Should that be repeated in a somewhat more technical discussion
         | of why it happened?
        
         | kzrdude wrote:
         | We (the users) are the product anyway, they can only apologize
         | to themselves for missing out on making money yesterday.
        
       | dmoy wrote:
       | Interesting bit on recovery w.r.t. the electrical grid
       | 
       | > flipping our services back on all at once could potentially
       | cause a new round of crashes due to a surge in traffic.
       | Individual data centers were reporting dips in power usage in the
       | range of tens of megawatts, and suddenly reversing such a dip in
       | power consumption could put everything from electrical systems
       | ...
       | 
       | I wish there was a bit more detail in here. What's the worst case
       | there? Brownouts, exploding transformers? Or less catastrophic?
        
         | nomel wrote:
         | If your system is pulling 500 watts at 120V, that's around 4A
         | of line voltage. If you drop down 20% to 100V, the output will
         | happily still pull its regulated voltage, but now the line
         | components are seeing ~20% more, at 5A. For brown out, you need
         | to overrate your components, and/or shut everything off if the
         | line voltage goes too low.
         | 
         | I used to do electrical compliance testing in a previous life,
         | with brown out testing being one of our safety tests. You would
         | drape a piece of cheese cloth over the power supply and slowly
         | ramp the line voltage down. At the time, the power supplies
         | didn't have good line side voltage monitoring. There was almost
         | always smoke, and sometimes cheese cloth fires. Since this was
         | safety testing, pass/fail was mostly based on if the cheese
         | cloth caught fire, not if the power supply was damaged.
        
           | dreamlayers wrote:
           | Why wouldn't the power supply shut down due to overcurrent
           | protection?
        
             | detaro wrote:
             | That's likely on the output side, which doesn't see
             | overcurrent.
        
         | kube-system wrote:
         | Likely tripping breakers or overload protection on UPSes?
         | 
         | Often PDUs used in a rack can be configured to start servers up
         | in a staggered pattern to avoid a surge in demand for these
         | reasons.
         | 
         | I'd imagine there's more complications when you're doing an
         | entire DC vs just a single rack, though.
        
           | topspin wrote:
           | Disk arrays have been staggering drive startup for a long
           | time for this reason. Sinking current into hundreds of little
           | starting motors simultaneously is a bad idea.
        
           | Johnny555 wrote:
           | I don't see how suddenly running more traffic is going to
           | trip datacenter breakers -- I could see how flipping on power
           | to an entire datacenter's worth of servers could cause a
           | spike in electrical demand that the power infrastructure
           | can't handle, but if suddently running CPU's at 100% trips
           | breakers, then it seems like that power infrastructure is
           | undersized? This isn't a case where servers were powered off,
           | they were idle because they had no traffic.
           | 
           | Do large providers like Facebook really provision less power
           | than their servers would require at 100% utilization? Seems
           | like they could just use fewer servers with power sized at
           | 100% if their power system going to constrain utilization
           | anyway?
        
             | kube-system wrote:
             | I don't know the answer. But it's not too uncommon, in
             | general, to provision for reasonable use cases plus a
             | margin, rather than provision for worst case scenario.
        
             | dfcowell wrote:
             | All of the components in the supply chain will be rated for
             | greater than max load, however power generation at grid
             | scale is a delicate balancing act.
             | 
             | I'm not an electrical engineer, so the details here may be
             | fuzzy, however in broad strokes:
             | 
             | Grid operators constantly monitor power consumption across
             | the grid. If more power is being drawn than generated, line
             | frequency drops _across the whole grid._ This leads to
             | brownouts and can cause widespread damage to grid equipment
             | and end-user devices.
             | 
             | The main way to manage this is to bring more capacity
             | online to bring the grid frequency back up. This is slow,
             | since spinning up even "fast" generators like natural gas
             | can take on the order of several minutes.
             | 
             | Notably, this kind of scenario is the whole reason the
             | Tesla battery in South Australia exists. It can respond to
             | spikes in demand (and consume surplus supply!) much faster
             | than generator capacity can respond.
             | 
             | The other option is load shedding, where you just
             | disconnect parts of your grid to reduce demand.
             | 
             | Any large consumers (like data center operators) likely
             | work closely with their electricity suppliers to be good
             | citizens and ramp up and down their consumption in a
             | controlled manner to give the supply side (the power
             | generators) time to adjust their supply as the demand
             | changes.
             | 
             | Note that changes to power draw as machines handle
             | different load will also result in changes to consumption
             | in the cooling systems etc. making the total consumption
             | profile substantially different coming from a cold start.
        
               | Johnny555 wrote:
               | You're talking about the grid, the OP was talking about
               | datacenter infrastructure -- which one is the weak link?
               | 
               | If a datacenter can't go from idle (but powered on)
               | servers to fully utilized servers without taking down the
               | power grid, then it seems that they'd have software
               | controls in place to prevent this, since there are other
               | failure modes that could cause this behavior other than a
               | global Facebook outage.
        
               | dfcowell wrote:
               | Unfortunately the article doesn't provide enough explicit
               | detail to be 100% sure one way or the other, however my
               | read is that it's probably the grid.
               | 
               | > Individual data centers were reporting dips in power
               | usage in the range of tens of megawatts, and suddenly
               | reversing such a dip in power consumption could put
               | everything from electrical systems to caches at risk.
               | 
               | "Electrical systems" is vague and could refer to either
               | internal systems, external systems or both.
               | 
               | That said, if the DC is capable of running under
               | sustained load at peak (which we have to assume it is,
               | since that's its normal state when FB is operational) it
               | seems to me like the externality of the grid is the more
               | likely candidate.
               | 
               | In terms of software controls preventing this kind of
               | failure mode, they do have it - load shedding. They'll
               | cut your supply until capacity is made available.
        
             | cesarb wrote:
             | The key word is "suddenly".
             | 
             | In the electricity grid, demand and generation must always
             | be precisely matched (otherwise, things burn up). This is
             | done by generators automatically ramping up or down
             | whenever the load changes. But most generators cannot
             | change their output instantly; depending on the type of
             | generator, it can take several minutes or even hours to
             | respond to a large change in the demand.
             | 
             | Now consider that, on modern servers, most of the power
             | consumption is from the CPU, and also there's a significant
             | difference on the amount of power consumed between 100% CPU
             | and idle. Imagine for instance 1000 servers (a single rack
             | can hold 40 servers or more), each consuming 2kW of power
             | at full load, and suppose they need only half that at idle
             | (it's probably even less than half). Suddenly switching
             | from idle to full load would mean 1MW of extra power has to
             | be generated; while the generators are catching up to that,
             | the voltage drops, which means the current increases to
             | compensate (unlike incandescent lamps, switching power
             | supplies try to maintain the same output no matter the
             | input voltage), and breakers (which usually are configured
             | to trip on excess current) can trip (without breakers, the
             | wiring would overheat and burn up or start a fire).
             | 
             | If the load changes slowly, on the other hand, there's
             | enough time for the governor on the generators to adjust
             | their power source (opening valves to admit more water or
             | steam or fuel), and overcome the inertia of their large
             | spinning mass, before the voltage drops too much.
        
               | Johnny555 wrote:
               | I get that lots of servers can add up to lots of power,
               | but what is a "lot"? Is 1MW really enough demand to
               | destabilize a regional power grid?
        
         | mschuster91 wrote:
         | One case is automated protection systems in the grid detecting
         | a sudden hop of current and assuming an isolation failure along
         | the path - basically, not enough current to trip the short-
         | circuit breakers, but enough to raise an alarm.
        
         | plorg wrote:
         | Brownouts is probably the most proximate concern - a sudden
         | increase in demand will draw down the system frequency in the
         | vicinity, and if there aren't generation units close enough or
         | with enough dispatchable capacity there's a small chance they
         | would trip a protective breaker.
         | 
         | A person I know on the power grid side said at one data center
         | there were step functions when FB went down and then when it
         | came up, equal to about 20% of the load behind the distribution
         | transformer. That quantity is about as much as an aluminum
         | smelter switching on or off.
        
           | Johnny555 wrote:
           | But don't their datacenters all have backup generators? So
           | worst case in a brownout, they fail over to generator power,
           | then can start to flip back to utility power slowly.
           | 
           | Or do they forgo backup generators and count on shifting
           | traffic to a new datacenter if there's a regional power
           | outage?
        
             | kgermino wrote:
             | Edit to be less snarky:
             | 
             | I assume they do have backup generators, though I don't
             | know.
             | 
             | However if the sudden increase put that much load on the
             | grid it could drop the frequency enough to blackout the
             | entire neighborhood. That would be bad even if FB was able
             | to keep running through it.
        
           | [deleted]
        
         | MrDunham wrote:
         | I'm very close with someone who works at a FB data center and
         | was discussing this exact issue.
         | 
         | I can only speak to one problem I know of (and am rather sure I
         | can share): a spike might trip a bunch of breakers at the data
         | center.
         | 
         | BUT, unlike me at home, FBs policy is to never flip a circuit
         | back on until you're _positive_ of the root cause of said trip.
         | 
         | By itself that could compound issues and delay ramp up time as
         | they'd work to be sure no electrical components actually
         | sorted/blew/etc. A potentially time sucking task given these
         | buildings could be measured in whole units of football fields.
        
       | [deleted]
        
       | Ansil849 wrote:
       | > The backbone is the network Facebook has built to connect all
       | our computing facilities together, which consists of tens of
       | thousands of miles of fiber-optic cables crossing the globe and
       | linking all our data centers.
       | 
       | This makes it sound like Facebook has physically laid "tens of
       | thousands of miles of fiber-optic cables crossing the globe and
       | linking all our data centers". Is this in fact true?
        
         | colechristensen wrote:
         | Likely a mixture of bought, leased, and self laid fiber. This
         | is not at all uncommon and basically necessary if you have your
         | own data center.
        
         | Closi wrote:
         | Yes, although undoubtedly a lot of this is shared investment -
         | https://www.businessinsider.com/google-facebook-giant-unders...
         | 
         | https://datacenterfrontier.com/facebook-will-begin-selling-w...
        
       | louwrentius wrote:
       | > Our primary and out-of-band network access was down
       | 
       | Don't create circular dependencies.
        
         | oarabbus_ wrote:
         | How do you avoid circular dependencies on an out-of-band-
         | network? Seems like the choice is between a circular
         | dependency, or turtles all the way down.
        
           | detaro wrote:
           | How do you go from "have a separate access method that
           | doesn't depend on your main system" to "turtles all the way
           | down"? The secondary access is allowed to have dependencies,
           | just not on _your_ network.
        
             | oarabbus_ wrote:
             | And if the secondary access fails, then what? Backup
             | systems are not reliable 100% of the time.
        
               | Ajedi32 wrote:
               | Then you're SOL. What's your point? The backup might
               | fail, so don't have a backup? I don't understand what
               | you're trying to say.
        
               | _moof wrote:
               | Then you were 1-FT, which is still worlds better than
               | 0-FT.
               | 
               | "Don't put two engines on the plane because both of them
               | might fail" is not how fault tolerance works.
        
         | vitus wrote:
         | With something as fundamental as the network, no way around it.
         | 
         | - Okay, we'll set up a separate maintenance network in case we
         | can't get to the regular network.
         | 
         | - Wait, but we need a maintenance network for the maintenance
         | network...
        
           | packetslave wrote:
           | "Okay, we'll pull in a DSL line from a completely separate
           | ISP for the out-of-band access." (guess what else is in that
           | manhole/conduit?)
           | 
           | "Okay, we'll use LTE for out-of-band!" (oops, the backhaul
           | for the cell tower goes under the same bridge as the real
           | network)
           | 
           | True diversity is HARD (not unsolvable, just hard. especially
           | at scale)!
        
             | chasd00 wrote:
             | heh i toured a large data center here in dallas and
             | listened to them brag about all the redundant connectivity
             | they had while standing next to the conduit where they all
             | entered the building. One person, a pair of wire cutters,
             | and 5 seconds and that whole datacenter is dark.
        
             | detaro wrote:
             | Although the difference here is that loosing connection and
             | out-of-band for a single data center shouldn't be as
             | catastrophic for Facebook, so your examples would be
             | tolerable?
        
               | packetslave wrote:
               | That's the trick, though: if you don't do that level of
               | planning for _all_ of your datacenters and POPs (and
               | fiber huts out in the middle of nowhere), it 's
               | inevitable that the one you most need to access during an
               | outage will be the one where your OOB got backhoe'd.
               | 
               | Murphy is a jerk.
        
           | tristor wrote:
           | Two is One, One is None. There are absolutely ways around
           | this, it's called redundancy. The marginal cost of laying an
           | extra pair during physical plant installation is basically
           | $0, which is why you'd never go "well we need a backup for
           | the backup, so there's no point in having two pairs).
           | Similarly, the marginal cost for having a second UPS and PDU
           | in a rack is effectively $0 at scale, so nobody would argue
           | this is unnecessary to deal with possible UPS failure or
           | accidentally unplugging a cable.
           | 
           | In this case, there are likely several things that can be
           | changes systemically to mitigate or prevent similar failures
           | in the future, and I have every faith that Facebook's SRE
           | team is capable of identifying and implementing those
           | changes. There is no such thing as "no way around it", unless
           | you're dealing with a law of physics.
        
             | vitus wrote:
             | By "no way around it" I mean you're going to need to create
             | a circular dependency at some point, whether it's a
             | maintenance network that's used to manage itself, or the
             | prod network for managing the maintenance network.
             | 
             | I absolutely agree that installing a maintenance network is
             | a good idea. One of the big challenges, though, is making
             | sure that all your tooling can and will run exclusively on
             | the maintenance network if needed.
             | 
             | (Also, while the marginal cost of laying an extra pair of
             | fiber during physical installation may be low, making sure
             | that you have fully independent failure domains is much
             | higher, whether that's leased fiber, power, etc.)
        
       | HenryKissinger wrote:
       | > During one of these routine maintenance jobs, a command was
       | issued with the intention to assess the availability of global
       | backbone capacity, which unintentionally took down all the
       | connections in our backbone network, effectively disconnecting
       | Facebook data centers globally
       | 
       | Imagine being this person.
       | 
       | Tomorrow on /r/tifu.
        
         | blowski wrote:
         | I doubt Facebook engineers are free-typing commands on Bash, so
         | it's probably not an individual error. More likely to be a race
         | condition or other edge case that wasn't considered during a
         | review. This might be a script that's run 1000s of times before
         | with no problems.
        
           | packetslave wrote:
           | Back in Ye Old Dark Ages, I caused a BIG Google outage by
           | running a routine maintenance script that had been run dozens
           | if not hundreds of times before.
           | 
           | Turns out the underlying network software had a race
           | condition that would ONLY be hit if the script ran at the
           | exact same time as some automated monitoring tools polled the
           | box.
           | 
           | At FAANG scale, "one in a million" happens a lot more often
           | than you'd think.
        
         | savant_penguin wrote:
         | Hahahahahahahaha
        
         | kabdib wrote:
         | "It was the intern" is only cute once.
        
         | Barrin92 wrote:
         | of course if one person can knock down an entire global system
         | through a trivial mistake the problem is obviously not the
         | person to begin with, but the architecture of the system.
        
           | flutas wrote:
           | Or the fact that there was a bug in the tool that should have
           | prevented this.
           | 
           | > Our systems are designed to audit commands like these to
           | prevent mistakes like this, but a bug in that audit tool
           | didn't properly stop the command.
        
             | bastardoperator wrote:
             | So this is really two peoples fault. One for issuing the
             | command the other for introducing the bug in the audit
             | tool.
        
           | colechristensen wrote:
           | Not really, there's essentially always a button that blows
           | everything up. Catastrophic failures usually end up being a
           | large set of safety systems malfunctioning which would
           | otherwise prevent the issue when that button is pressed.
           | 
           | But yes, for these types of problem, the ultimate fault is
           | never "that guy Larry is an idiot", it takes a large team of
           | cooperating mistakes.
        
       | harshreality wrote:
       | > To ensure reliable operation, our DNS servers disable those BGP
       | advertisements if they themselves can not speak to our data
       | centers, since this is an indication of an unhealthy network
       | connection.
       | 
       | No, it's (clearly) not a guaranteed indication of that. Logic
       | fail. Infrastructure tools at that scale need to handle all
       | possible causes of test failures. "Is the internet down or only
       | the few sites I'm testing?" is a classic network monitoring
       | script issue.
        
         | jaywalk wrote:
         | I think you're misunderstanding. The DNS servers (at Facebook
         | peering points) had zero access to Facebook datacenters because
         | the backbone was down. That is as unhealthy as the network
         | connection can get, so they (correctly) stopped advertising the
         | routes to the outside world.
         | 
         | By that point, the Facebook backbone was already gone. The DNS
         | servers stopping BGP advertisements to the outside world did
         | not cause that.
        
           | harshreality wrote:
           | You're talking about _backend_ network connections to
           | facebook 's datacenters as if that's the only thing that
           | matters. I'm talking about _overall_ network connection
           | including the internet-facing part.
           | 
           | Facebook's infrastructure at their peering points loses all
           | contact with their respective facebook datacenter(s).
           | 
           | Their response is to automatically withdraw routes to
           | themselves. I suppose they assumed that all datacenters would
           | never go down at the same time, so that client dns redundancy
           | would lead to clients using other dns servers that could
           | still contact facebook datacenters. It's unclear how those
           | routes could be restored without on-site intervention. If
           | they automatically detect when the datacenters are reachable
           | again, that too requires on-site intervention since after
           | withdrawing routes FB's ops tools can't do anything to the
           | relevant peering points or datacenters.
           | 
           | But even without the catastrophic case of all datacenter
           | connections going down, you don't need to be a facebook ops
           | engineer to realize that there are problems that need to be
           | carefully thought through when ops tools depends on the same
           | (public) network routes and DNS entries that the DNS servers
           | are capable of autonomously withdrawing.
        
       | halotrope wrote:
       | It is completely logical but still kind of amazing, that facebook
       | plugged their globally distributed datacenters together with
       | physical wire.
        
         | packetslave wrote:
         | What do you imagine other companies use to connect their
         | datacenters?
        
           | meepmorp wrote:
           | Uh... the cloud?
        
             | zht wrote:
             | isn't that still over wires?
        
       | henrypan1 wrote:
       | Wow
        
       | nonbirithm wrote:
       | DNS seems to be a massive point of failure everywhere, even
       | taking out the tools to help deal with outages themselves. The
       | same thing happened to Azure multiple times in the past, causing
       | complete service outages. Surely there must be some way to better
       | mitigate DNS misconfiguration by now, given the exceptional
       | importance of DNS?
        
         | bryan_w wrote:
         | DNS was very much a proximate cause. In most cases you want
         | your anycast dns servers to shoot themselves in the head if
         | they detect their connection to origin to be interrupted. This
         | would have been an big outage anyways just at a different
         | layer.
         | 
         | Oddly enough, one could consider that behavior something that
         | was put in place to "mitigate DNS misconfiguration"
        
         | cnst wrote:
         | But DNS didn't actually fail. Their design says DNS must go
         | offline if the rest of the network is offline. That's exactly
         | what DNS did.
         | 
         | Sounds like their design was wrong, but you can't just blame
         | DNS. DNS worked 100% here as per the task that it was given.
         | 
         | > To ensure reliable operation, our DNS servers disable those
         | BGP advertisements if they themselves can not speak to our data
         | centers, since this is an indication of an unhealthy network
         | connection.
        
           | jaywalk wrote:
           | I'm not sure the design was even wrong, since the DNS servers
           | being down didn't meaningfully contribute to the outage. The
           | entire Facebook backbone was gone, so even if the DNS servers
           | continued giving out cached responses clients wouldn't be
           | able to connect anyway.
        
             | Xylakant wrote:
             | DNS being down instead of returning an unreachable
             | destination did increase load for other DNS resolvers
             | though since empty results cannot be cached and clients
             | continued to retry. This made the outage affect others.
        
               | cnst wrote:
               | Source?
               | 
               | DNS errors are actually still cached; it's something that
               | has been debunked by DJB like a couple of decades ago,
               | give or take:
               | 
               | http://cr.yp.to/djbdns/third-party.html
               | 
               | > RFC 2182 claims that DNS failures are not cached; that
               | claim is false.
               | 
               | Here are some more recent details and the fuller
               | explanation:
               | 
               | https://serverfault.com/a/824873
               | 
               | Note that FB.com currently expires its records in 300
               | seconds, which is 5 minutes.
               | 
               | PowerDNS (used by ordns.he.net) caches servfail for 60s
               | by default -- packetcache-servfail-ttl -- which isn't
               | very far from the 5min that you get when things aren't
               | failing.
               | 
               | Personally, I do agree with DJB -- I think it's a better
               | user experience to get a DNS resolution error right away,
               | than having to wait many minutes for the TCP timeout to
               | occur when the host is down anyways.
        
             | cnst wrote:
             | Exactly. And it would actually be worse, because the
             | clients would have to wait for a timeout, instead of simply
             | returning a name error right away.
        
               | mbauman wrote:
               | How would've it been worse? Waiting for a timeout is a
               | _good_ thing as it prevents a thundering herd of refresh-
               | smashing (both automated and manual).
               | 
               | I don't know BGP well, but it seems easier for peers to
               | just drop FB's packets on the floor than deal with a DNS
               | stampede.
        
               | cnst wrote:
               | An average webpage today is several megabytes in size.
               | 
               | How would a few bytes over a couple of UDP packets for
               | DNS have any meaningful impact on anyone's network? If
               | anything, things fail faster, so, there's less data to
               | transmit.
               | 
               | For example, I often use ordns.he.net as an open
               | recursive resolver. They use PowerDNS as their software.
               | PowerDNS has the default of packetcache-servfail-ttl of
               | 60s. OTOH, fb.com A response currently has a TTL of 300s
               | -- 5 minutes. So, basically, FB's DNS is cached for
               | roughly the same time whether or not they're actually
               | online.
        
               | mbauman wrote:
               | The rest of the internet sucked yesterday, and my
               | understanding was it was due to a thundering herd of
               | recursive DNS requests. Slowing down clients seems like a
               | good thing.
        
         | adrianmonk wrote:
         | > _DNS seems to be a massive point of failure everywhere_
         | 
         | Emphasis on the "seems". DNS gets blamed a lot because it's the
         | very first step in the process of connecting. When everything
         | is down, you will see DNS errors.
         | 
         | And since you can't get past the DNS step, you never see the
         | other errors that you would get if you could try later steps.
         | If you knew the web server's IP address to try to make a TCP
         | connection to it, you'd get connection timed out errors. But
         | you don't see those errors because you didn't get to the point
         | where you got an IP address to connect to.
         | 
         | It's like if you go to a friend's house but their electricity
         | is out. You ring the doorbell and nothing happens. Your first
         | thought is that the doorbell is messed up. And you're not
         | wrong: it is, but so is everything else. If you could ring it
         | and get their attention to let you inside in their house, you'd
         | see that their lights don't turn on, their TV doesn't turn on,
         | their refrigerator isn't running, etc. But those things are
         | hidden to you because you're stuck on the front porch.
        
         | dbbk wrote:
         | Seems like the simplest solution would be to just move recovery
         | tooling to their own domain / DNS?
        
       | tantalor wrote:
       | So it wasn't a config change, it was a command-of-death.
        
       | rblion wrote:
       | Was this a way to delete a lot of evidence before shit really hit
       | the fan?'
       | 
       | After reading this, I can't help but feel this was a calculated
       | move.
       | 
       | It gives FB a chance to hijack media attention from the
       | whistleblower. It gives them a chance to show the average peson,
       | 'hey, we make mistakes and we have a review process to improve
       | our systems'.
       | 
       | The timing is too perfect if you ask me.
        
         | FridayoLeary wrote:
         | Inviting further whistleblowing.
        
         | joemi wrote:
         | I'm not usually that cynical, but the timing of it combined
         | with facebook's lengthy abusive relationship with customers'
         | privacy (and what kind of company morals that implies) makes me
         | think that it's definitely a possibility.
        
           | rblion wrote:
           | The testimony in front of congress and the reaction is just
           | making me feel even more like this was a calculated move or
           | internal sabotage.
        
       | Hokusai wrote:
       | > One of the jobs performed by our smaller facilities is to
       | respond to DNS queries. DNS is the address book of the internet,
       | enabling the simple web names we type into browsers to be
       | translated into specific server IP addresses.
       | 
       | What is the target audience of this post? It is too technical for
       | non-technical people, but also it is dumbed down to try to
       | include people that does not know how the internet works. I feel
       | like I'm missing something.
        
         | Florin_Andrei wrote:
         | > _Those data centers come in different forms._
         | 
         | It's like the birds and the bees.
        
         | gist wrote:
         | > What is the target audience of this post?
         | 
         | Separate point to your question.
         | 
         | FB is under no obligation to provide more details than they
         | need to because a small segment of the population (certainly
         | relative to their 'customers') might find it interesting or
         | helpful or entertaining. FB is a business. They can essentially
         | do (and should be able to) do whatever they want. There is no
         | requirement (and there should be no requirement) to provide the
         | general public with more info than they want subject to any
         | legal requirement. If the government wants more (and are
         | entitled to more info) they can ask for it and FB can decide if
         | they are required to comply.
         | 
         | FB is a business. Their customers are not the tech community
         | looking to get educated and avoid issues themselves at their
         | companies or (as mentioned) be entertained. And their customers
         | (advertisers or users) can decide if they want to continue to
         | patronize FB.
         | 
         | I always love on HN seeing 'hey where is the post mortem' as if
         | it's some type of defacto requirement to air dirty laundry to
         | others.
         | 
         | If I go to the store and there is not paper towels there I
         | don't need to know why there are no towels and what the company
         | will do going forward to prevent any errors that caused the
         | lack of that product. I can decide to buy another brand or
         | simply take steps to not have it be an issue.
        
           | jablan wrote:
           | > If I go to the store and there is not paper towels there I
           | don't need to know why there are no towels
           | 
           | You don't _need_ to know, but it's human to want to know, and
           | it's also human to want to satisfy other human's curiosity,
           | especially if it doesn't bring any harm to you.
           | 
           | Also, your post is not really answering any of GP's
           | questions. I presume you wanted to say that FB doesn't _owe_
           | any explanation to us, but the GP asked, as they already
           | provided one, to whom is it addressed.
        
           | Hokusai wrote:
           | The air industry has this solved, it's mandatory to report
           | certain kind of incidents to avoid them in the future and
           | inform the aviation community. https://www.skybrary.aero/inde
           | x.php/Mandatory_Occurrence_Rep...
           | 
           | That the main form of personal communication for hundreds of
           | millions of users is down and there is no mandatory reporting
           | is irresponsible. That Facebook is a business does not mean
           | that they do not have responsibilities towards society.
           | 
           | Facebook is not your local supermarket, it has global impact.
        
             | tehjoker wrote:
             | One would imagine a large local supermarket going down
             | would owe the people it serves some explanation. That's
             | where their food comes from.
             | 
             | At this point, I am completely sick of the pro-corporate
             | rhetoric to let businesses do whatever they want. They
             | exist to serve the public and they should be treated as
             | such.
        
         | Gulfick wrote:
         | Poor targeting choices.
        
         | whymauri wrote:
         | Huh? I would hardly describe this as technical. Someone with a
         | high school education can read it and get the gist. It's
         | actually somewhat impressive how it toes the line between
         | accessibility and 'just detailed enough'.
        
         | bovermyer wrote:
         | I'm guessing it has multiple target audiences. Those that won't
         | understand some of the technical jargon (e.g., "IP addresses")
         | will still be able to follow the general flow of the article.
         | 
         | Those of us who are familiar with the domain of knowledge, on
         | the other hand, get a decent summary of events.
         | 
         | It's a balancing act. I think the article does a good enough
         | job of explaining things.
        
         | [deleted]
        
         | thirtyseven wrote:
         | With an outage this big, even a post for a technical audience
         | will get read by non-technical people (including journalists),
         | so I'm sure it helps to include details like this.
        
           | submain wrote:
           | The media: "Facebook engineer typed command. This is what
           | happened next."
        
         | riffic wrote:
         | I'm reading your comment as a form of "feigning surprise", in
         | other words a statement along the lines of "I can't believe
         | target audience doesn't know about x concept".
         | 
         | more on the concept: https://noidea.dog/blog/admitting-
         | ignorance
        
         | simooooo wrote:
         | Stupid journalists
        
         | shadofx wrote:
         | Teenagers who are responsible for managing the family router?
        
         | eli wrote:
         | The media. This was a huge international story.
        
         | blowski wrote:
         | Both those groups of people. I imagine, they would either be
         | accused of it being either too complicated or dumbed down, so
         | they do both in the same article.
        
         | 88913527 wrote:
         | There are plenty of technical people (or people employed in
         | technical roles) who don't understand how DNS works. For
         | example, I field questions on why "hostname X only works when
         | I'm on VPN" at work.
        
       | byron22 wrote:
       | What kind of bgp command would do that?
        
       | TedShiller wrote:
       | During the outage, FB briefly made the world a better place
        
       | bilater wrote:
       | I want to know what happened to the poor engineer who issued the
       | command?
        
         | pjscott wrote:
         | The blog post is putting the blame on a bug in the tooling
         | which _should_ have made the command impossible to issue, which
         | is exactly where the blame ought to go.
        
           | bilater wrote:
           | Still I'd hate to be the first 'Why' of a multi-billion
           | dollar outage :D
        
         | chovybizzass wrote:
         | or the manager who told him to "do it anyway" when he raised
         | concern.
        
         | antocv wrote:
         | He will be promoted to street dweller while his managers will
         | fail up.
        
           | runawaybottle wrote:
           | Actually he was promoted to C-Level, CII, Chief Imperial
           | Intern 'For Life'.
           | 
           | It's a great accomplishment to be be fair, comes with a
           | lifetime weekly stipend and access to whatever Frontend
           | books/courses you need to be a great web developer.
           | 
           | Will never touch ops again.
        
       | cnst wrote:
       | Note that contrary to popular reports, DNS was NOT to blame for
       | this outage -- for once DNS worked exactly as per per the spec,
       | design and configuration:
       | 
       | > To ensure reliable operation, our DNS servers disable those BGP
       | advertisements if they themselves can not speak to our data
       | centers, since this is an indication of an unhealthy network
       | connection.
        
       | jimmyvalmer wrote:
       | tldr; a maintenance query was issued that inexplicably severed
       | FB's data centers from the internet, which unnecessarily caused
       | their DNS servers to mark themselves defunct, which made it all
       | but impossible for their guys to repair the problem from HQ,
       | which compelled them to physically dispatch field units whose
       | progress was stymied by recent increased physical security
       | measures.
        
         | antocv wrote:
         | > caused their DNS servers to mark themselves defunct
         | 
         | This is awkward for me too, why should a DNS server withdraw
         | BGP routes? Design fail.
        
           | yuliyp wrote:
           | It's a trade-off.
           | 
           | Imagine you have some DNS servers at a POP. They're connected
           | to a peering router there which is connected to a bunch of
           | ISPs. The POP is connected via a couple independent fiber
           | links to the rest of your network. What happens if both of
           | those links fail?
           | 
           | Ideally the rest of your service can detect that this POP is
           | disconnected, and adjust DNS configuration to point users
           | toward POPs which are not disconnected. But you still have
           | that DNS server which can't see that config change (since
           | it's disconnected from the rest of your network) but still
           | reachable from a bunch of local ISPs. That DNS server will
           | continue to direct traffic to the POP which can't handle it.
           | 
           | What if that DNS server were to mark itself unavailable? In
           | that case, DNS traffic from ISPs near that POP would instead
           | find another DNS server from a different POP, and get a
           | response which pointed toward some working POP instead. How
           | would the DNS server mark itself unavailable? One way is to
           | see if it stopped being able to communicate with the source
           | of truth.
           | 
           | Yesterday all of the DNS servers stopped being able to
           | communicate with the source of truth, so marked themselves
           | offline. This code assumes a network partition, so can't
           | really rely on consensus to decide what to do.
        
           | colechristensen wrote:
           | Designed to handle individual POPs/DCs going down
        
           | miyuru wrote:
           | Most of the large DNS services are anycasted via BGP. (All
           | POPs announce the same IP prefix) It makes sense to stop the
           | BGP routing if the POP is unhealthy. Traffic will flow to the
           | next healthy POP.
           | 
           | In this case if the DNS sevice in the POP is unhealthy and IP
           | address belonging to the DNS service are removed from the
           | POP.
        
           | chronid wrote:
           | Note those are anycast addresses, my guess is the DNS server
           | gives out addresses for FB names pointing your traffic to the
           | POP the DNS server is part of.
           | 
           | If the POP is not able to connect to the rest of Facebook's
           | network, the POP stops announcing itself as available and
           | that DNS and part of the network goes away so your traffic
           | can go somewhere else.
        
       | jbschirtzs wrote:
       | "The Devil's Backbone"
        
       | cube00 wrote:
       | _> We've done extensive work hardening our systems to prevent
       | unauthorized access, and it was interesting to see how that
       | hardening slowed us down as we tried to recover from an outage
       | caused not by malicious activity, but an error of our own making.
       | I believe a tradeoff like this is worth it -- greatly increased
       | day-to-day security vs. a slower recovery from a hopefully rare
       | event like this._
       | 
       | If you correctly design your security with appropriate fall backs
       | you don't need to make this trade off.
       | 
       | If that story of the Facebook campus having no physical key holes
       | on doors is true it just speaks to an arrogance of assuming
       | things can never fail so we don't even need to bother planning
       | for it.
        
         | joshuamorton wrote:
         | Can you elaborate on this? There are always going to be
         | security/reliability tradeoffs. Things that fail closed for
         | security reasons will cause slower incident responses. That's
         | unavoidable. Innovation can improve the frontier, but there
         | will always be tradeoffs.
        
           | cube00 wrote:
           | Slower sure, but not five hour slow.
        
       | secguyperson wrote:
       | > a command was issued with the intention to assess the
       | availability of global backbone capacity, which unintentionally
       | took down all the connections in our backbone network,
       | effectively disconnecting Facebook data centers globally
       | 
       | From a security perspective, I'm blown away that a single person
       | apparently had the technical permissions to do such a thing. I
       | can't think of any valid reason that a single person would have
       | the ability to disconnect every single data center globally. The
       | fact that such functionality exists seems like a massive foot-
       | gun.
       | 
       | At a minimum I would expect multiple layers of approval, or
       | perhaps regionalized permissions, so that even if this person did
       | run an incorrect command, the system turns around and says "ok
       | we'll shut down the US data centers but you're not allowed to
       | issue this command for the EU data centers, so those stay up".
        
       | chomp wrote:
       | So someone ran "clear mpls lsp" instead of "show mpls lsp"?
        
         | slim wrote:
         | My guess : he executed the command from shell history. Commands
         | look similar enough and he hit Enter too quickly
        
         | bryan_w wrote:
         | Seems like it. It's kinda like typing hostname and accidentally
         | poking your yubikey (Not that I've done that...) or the date
         | command that both let's you set the date and format the date
        
         | divbzero wrote:
         | For context, parent comment is trying to decipher this heavily-
         | PR-reviewed paragraph:
         | 
         | > _During one of these routine maintenance jobs, a command was
         | issued with the intention to assess the availability of global
         | backbone capacity, which unintentionally took down all the
         | connections in our backbone network, effectively disconnecting
         | Facebook data centers globally. Our systems are designed to
         | audit commands like these to prevent mistakes like this, but a
         | bug in that audit tool didn't properly stop the command._
        
       | i_like_apis wrote:
       | the total loss of DNS broke many of the internal tools we'd
       | normally use to investigate and resolve outages like this.
       | this took time, because these facilities are designed with high
       | levels of physical and system security in mind. They're hard to
       | get into, and once you're inside, the hardware and routers are
       | designed to be difficult to modify even when you have physical
       | access to them.
       | 
       | Sounds like it was the perfect storm.
        
       | codebolt wrote:
       | Apparently they had to bring in the angle grinder to get access
       | to the server room.
       | 
       | https://twitter.com/cullend/status/1445156376934862848?t=P5u...
        
         | jd3 wrote:
         | Was this ever confirmed? NYT tech reporter Mike Isaac issued a
         | correction to his previous reporting about it.
         | 
         | > the team dispatched to the Facebook site had issues getting
         | in because of physical security but did not need to use a saw/
         | grinder.
         | 
         | https://twitter.com/MikeIsaac/status/1445196576956162050
        
           | tpmx wrote:
           | > so.....they had to use a jackhammer, got it
        
             | 14 wrote:
             | This. We know they lie so have to assume the worst when
             | they say something. The correction only says saw/angle
             | grinder. "No tools were needed" is a little clearer.
             | However I am not sure why it matters if they used a key or
             | a grinder?
        
         | packetslave wrote:
         | that didn't happen (NY Times corrected their story)
        
           | [deleted]
        
         | mzs wrote:
         | or not
         | 
         | https://twitter.com/MikeIsaac/status/1445196576956162050
         | 
         | https://twitter.com/cullend/status/1445212476652535815
        
       | nabakin wrote:
       | Yesterday's post from Facebook
       | 
       | https://engineering.fb.com/2021/10/04/networking-traffic/out...
        
       | ghostoftiber wrote:
       | Incidentally the facebook app itself really handled this
       | gracefully. When the app can't connect to facebook, it displays
       | "updates" from a pool of cached content. It looks and feels like
       | facebook is there, but we know it's not. I didn't notice this
       | until the outage and I thought it was neat.
        
         | divbzero wrote:
         | I'm curious if the app handled posts or likes gracefully too.
         | Did it accept and cache the updates until it could reconnect to
         | Facebook servers?
        
         | [deleted]
        
         | colanderman wrote:
         | WhatsApp failed similarly, but I thought it was a poor design
         | decision to do so. Anyone waiting on communication through
         | WhatsApp had no indication (outside the media) that it was
         | unavailable, and that they should find a nother communication
         | channel.
         | 
         | Don't paper connectivity failures. It disempowers users
        
       | mumblemumble wrote:
       | > Our systems are designed to audit commands like these to
       | prevent mistakes like this, but a bug in that audit tool didn't
       | properly stop the command.
       | 
       | I'm so glad to see that they framed this in terms of a bug in a
       | tool designed to prevent human error, rather than simply blaming
       | it on human error.
        
         | nuerow wrote:
         | > I'm so glad to see that they framed this in terms of a bug in
         | a tool designed to prevent human error, rather than simply
         | blaming it on human error.
         | 
         | Wouldn't human error reflect extremely poorly on the company
         | though? I mean, for human error to be the root cause of this
         | mega-outage, that would imply that the company's infrastructure
         | and operational and security practices were so ineffective that
         | a single person screwing up could inadvertently bring the whole
         | company down.
         | 
         | A freak accident that requires all stars to be aligned to even
         | be possible, on the other hand, does not cause a lot of
         | concerns.
        
         | cecilpl2 wrote:
         | Human error is a cop-out excuse anyway, since it's not
         | something you can fix going forward. Humans err, and if a
         | mistake was made once it could easily be made again.
        
         | tobr wrote:
         | The buggy audit tool was probably made by a human too, though.
        
           | jhrmnn wrote:
           | But reviewed by other humans. At some count, a collective
           | human error becomes a system error.
        
           | cecilpl2 wrote:
           | Of course the system was built by humans, but we are
           | discussing the _proximate_ cause of the outage.
        
       | vitus wrote:
       | Google had a comparable outage several years ago.
       | 
       | https://status.cloud.google.com/incident/cloud-networking/19...
       | 
       | This event left a lot of scar tissue across all of Technical
       | Infrastructure, and the next few months were not a fun time (e.g.
       | a mandatory training where leadership read out emails from
       | customers telling us how we let them down and lost their trust).
       | 
       | I'd be curious to see what systemic changes happen at FB as a
       | result, if any.
        
         | AdamHominem wrote:
         | > This event left a lot of scar tissue across all of Technical
         | Infrastructure, and the next few months were not a fun time
         | (e.g. a mandatory training where leadership read out emails
         | from customers telling us how we let them down and lost their
         | trust).
         | 
         | Bullshit.
         | 
         | I'd believe this if it was not completely impossible for
         | 99.999999% of google "customers" to contact anyone at the
         | company. Or for the decade and a half of personal and
         | professional observations of people getting fucked over by
         | google and having absolutely nobody they could contact to try
         | and resolve the situation.
         | 
         | You googlers can't even disdain yourselves to talk to other
         | workers at the company who are in a caste lower than you.
         | 
         | The fundamental problem googlers have is that they all think
         | they're so smart/good at what they do, it just doesn't seem to
         | occur that they could have _possibly_ screwed something up, or
         | something could go wrong or break, or someone might need help
         | in a way your help page authors didn 't anticipate...and people
         | might need to get ahold of an actual human to say "shit's
         | broke, yo." Or worse, none of you give a shit. The company
         | certainly doesn't. When you've got near monopoly and have your
         | fingers in every single aspect of the internet, you don't need
         | to care about fucking your customers over.
         | 
         | I cannot legitimately name a single google product that I, or
         | anyone I know, _likes_ or _wants to use_. We just don 't have a
         | choice because of your market dominance.
        
           | ro_bit wrote:
           | > You googlers can't even disdain yourselves to talk to other
           | workers at the company who are in a caste lower than you.
           | 
           | We must know different googlers then. It's good to avoid
           | painting a group with the same brush
        
           | UncleMeat wrote:
           | Hi there. I'm a Googler and I've directly interfaced with a
           | nontrivial number of customers such that _I alone_ have
           | interfaced with more than 0.000001% of the entire world
           | population.
        
             | [deleted]
        
             | lvs wrote:
             | All you need to do is browse any online forum, bug tracker,
             | subreddit dedicated to a consumer-facing Google product to
             | know that Google does not give a rat's ass about customer
             | service. We know the customer is ultimately not the
             | consumer.
        
               | [deleted]
        
           | colechristensen wrote:
           | Maps, Mail, Drive, Scholar, and Search are all the best or
           | near the best available. That doesn't mean I like every one
           | of them or I wouldn't prefer others, but as far as I can tell
           | the competition doesn't exist that works better.
           | 
           | GCP and Pixel phones are a toss-up between them and
           | competitors.
           | 
           | It isn't market dominance, nobody has made anything better.
        
         | silicon2401 wrote:
         | > leadership read out emails from customers telling us how we
         | let them down and lost their trust).
         | 
         | That's amazing. I would never have expected my feedback to a
         | company to actually be read, let alone taken seriously.
         | Hopefully more companies do this than I thought.
        
           | spaetzleesser wrote:
           | From my experience this is more done to make leadership feel
           | better and deflect blame from their leadership.
        
           | variant wrote:
           | I just read your comment out loud if it helps.
        
         | hoppyhoppy2 wrote:
         | Facebook also had a nearly 24-hour outage in 2019.
         | https://www.nytimes.com/2019/03/14/technology/facebook-whats...
         | (or http://archive.today/O7ycB )
        
         | cratermoon wrote:
         | > I'd be curious to see what systemic changes happen at FB as a
         | result, if any.
         | 
         | If history is any guide, Facebook will decide some division
         | charged with preventing problems was an ineffective waste of
         | money, shut it down, and fire a bunch of people.
        
         | vitus wrote:
         | To expand on why this made me think of the Google outage:
         | 
         | It was a global backbone isolation, caused by configuration
         | changes (as they all are...). It was detected fairly early on,
         | but recovery was difficult because internal tools / debugging
         | workflows were also impacted, and even after the problem was
         | identified, it still took time to back out the change.
         | 
         | "But wait, a global backbone isolation? Google wasn't totally
         | down," you might say. That's because Google has two (primary)
         | backbones (B2 and B4), and only B4 was isolated, so traffic
         | spilled over onto B2 (which has much less capacity), causing
         | heavy congestion.
        
           | tantalor wrote:
           | But the FB outage was _not_ a configuration change.
           | 
           | > a command was issued with the intention to assess the
           | availability of global backbone capacity, which
           | unintentionally took down all the connections in our backbone
           | network
        
             | narrator wrote:
             | This is kind of like Chernobyl where they were testing to
             | see how hot they could run the reactor to see how much
             | power it could generate. Then things went sideways.
        
               | formerly_proven wrote:
               | The Chernobyl test was not a test to drive the reactor to
               | the limits, but actually a test to verify that the
               | inertia of the main turbines is big enough to drive the
               | coolant pumps for X amount of time in the case of grid
               | failure.
        
               | trevorishere wrote:
               | Of possible interest:
               | 
               | https://www.youtube.com/watch?v=Ijst4g5KFN0
               | 
               | This is a presentation to students by an MIT professor
               | that goes over exactly what happened, the sequence of
               | events, mistakes made, and so on.
        
               | formerly_proven wrote:
               | Warning for others: I watched the above video and then
               | watched the entire course (>30 hours).
        
               | q845712 wrote:
               | At first i thought it was inappropriate hyperbole to
               | compare Facebook to Chernobyl, but then i realized that i
               | think Facebook (along with twitter and other "web 2.0"
               | graduates) has spread toxic waste across far larger of an
               | area than Chernobyl. But I would still say that it's not
               | the _outage_ which is comparable to Chernobyl, but the
               | steady-state operations.
        
               | [deleted]
        
               | fabian2k wrote:
               | As already said the test was about something entirely
               | different. And the dangerous part was not the test
               | itself, but the way they delayed the test and then
               | continued to perform it despite the reactor being in a
               | problematic state and the night shift being on duty, who
               | were not trained on this test. The main problem was that
               | they ran the reactor at reduced power long enough to have
               | significant xenon poisoning, and then put the reactor at
               | the brink when they tried to actually run the test under
               | these unsafe conditions.
        
               | willcipriano wrote:
               | I'd say the failure at Chernobyl was that anyone who
               | asked questions got sent to a labor camp and the people
               | making the decisions really had no clue about the work
               | being done. Everything else just stems from that. The
               | safest reactor in the world would blow up under the same
               | leadership.
        
             | vitus wrote:
             | From yesterday's post:
             | 
             | "Our engineering teams have learned that _configuration
             | changes_ on the backbone routers that coordinate network
             | traffic between our data centers caused issues that
             | interrupted this communication.
             | 
             | ...
             | 
             | Our services are now back online and we're actively working
             | to fully return them to regular operations. We want to make
             | clear that there was no malicious activity behind this
             | outage -- its root cause was a faulty _configuration
             | change_ on our end. "
             | 
             | Ultimately, that faulty command changed router
             | configuration globally.
             | 
             | The Google outage was triggered by a configuration change
             | due to an automation system gone rogue. But hey, it too was
             | triggered by a human issuing a command at some point.
        
               | quietbritishjim wrote:
               | I'm inclined to believe the later post as they've had
               | more time to assess the details. I think the point of the
               | earlier post is really to say "we weren't hacked!" but
               | they didn't want to use exactly that language.
        
           | jeffbee wrote:
           | Google also had a runaway automation outage where a process
           | went around the world "selling" all the frontend machines
           | back to the global resource pool. Nobody was alerted until
           | something like 95% of global frontends had disappeared.
           | 
           | This was an important lesson for SREs inside and outside
           | Google because it shows the dangers of the antipattern of
           | command line flags that narrow the scope of an operation
           | instead of expanding it. I.e. if your command was supposed to
           | be `drain -cell xx` to locally turn-down a small resource
           | pool but `drain` without any arguments drains the whole
           | universe, you have developed a tool which is too dangerous to
           | exist.
        
             | nathanyz wrote:
             | I feel like this explains so much about why the gcloud
             | command works the way it does. Sometimes feels overly
             | complicated for minor things, but given this logic, I get
             | it.
        
             | vitus wrote:
             | Agreed, but with an amendment:
             | 
             | If your tool is capable of draining the whole universe,
             | _period_ , it is too dangerous to exist.
             | 
             | That was one of the big takeaways: global config changes
             | must happen slowly. (Whether we've fully internalized that
             | lesson is a different matter.)
        
               | ethbr0 wrote:
               | As FB opines at the end, at some point, it's a trade-off
               | between power (being able access / do everything quickly)
               | and safety (having speed bumps that slow larger
               | operations down).
               | 
               | The pure takeaway is probably that it's important to
               | design systems where "large" operations are rarely
               | required, and frequent ops actions are all "small."
               | 
               | Because otherwise, you're asking for an impossible
               | process (quick and protected).
        
               | bbarnett wrote:
               | _If your tool is capable of draining the whole universe_
               | 
               | Why did I think of humans, when I read this. :P
        
           | ratsmack wrote:
           | >internal tools / debugging workflows were also impacted
           | 
           | That's something that should never happen.
        
         | agwa wrote:
         | > a mandatory training where leadership read out emails from
         | customers telling us how we let them down and lost their trust
         | 
         | Is that normal at Google? Making people feel bad for an outage
         | doesn't seem consistent with the "blameless postmortem" culture
         | promoted in the SRE book[1].
         | 
         | [1] https://sre.google/sre-book/postmortem-culture/
        
           | UncleMeat wrote:
           | "Blameless Postmortem" does not mean "No Consequences", even
           | if people often want to interpret it that way. If an
           | organization determines that a disconnect between ground work
           | and a customer's experience is a contributing factor to poor
           | decision making then they might conclude that making
           | engineers more emotionally invested in their customers could
           | be a viable path forward.
        
             | bob1029 wrote:
             | Relentless customer service is _never_ going to screw you
             | over in my experience... It pains me that we have to
             | constantly play these games of abstraction between engineer
             | and customer. You are presumably working a _job_ which
             | involves some business and some customer. It is not a
             | fucking daycare. If any of my customers are pissed about
             | their experience, I want to be on the phone with them as
             | soon as humanly possible and I want to hear it myself. Yes,
             | it is a dreadful experience to get bitched at, but it also
             | sharpens your focus like you wouldn 't believe when you
             | can't just throw a problem to the guy behind you.
             | 
             | By all means, put the support/enhancement requests through
             | a separate channel+buffer so everyone can actually get work
             | done during the day. But, at no point should an engineer
             | ever be allowed to feel like they don't have to answer to
             | some customer. If you are terrified a junior dev is going
             | to say a naughty phrase to a VIP, then invent an internal
             | customer for them to answer to, and diligently proxy the
             | end customer's sentiment for the engineer's benefit.
        
               | ethbr0 wrote:
               | I think of this is terms of empathy: every engineer
               | should be able to provide a quick and accurate answer to
               | "What do our customers want? And how do they use our
               | product?"
               | 
               | I'm not talking esoterica, but at least a first
               | approximation.
        
             | agwa wrote:
             | From the SRE book: "For a postmortem to be truly blameless,
             | it must focus on identifying the contributing causes of the
             | incident without indicting any individual or team for bad
             | or inappropriate behavior. A blamelessly written postmortem
             | assumes that everyone involved in an incident had good
             | intentions and did the right thing with the information
             | they had. If a culture of finger pointing and shaming
             | individuals or teams for doing the 'wrong' thing prevails,
             | people will not bring issues to light for fear of
             | punishment."
             | 
             | If it's really the case that engineers are lacking
             | information about the impact that outages have on customers
             | (which seems rather unlikely), then leadership needs to
             | find a way to provide them with that information without
             | reading customer emails about how the engineers "let them
             | down", which is blameful.
             | 
             | Furthermore, making engineers "emotionally invested"
             | doesn't provide concrete guidance on how to make better
             | decisions in the future. A blameless portmortem does, but
             | you're less likely to get good postmortems if engineers
             | fear shaming and punishment, which reading those customer
             | emails is a minor form of.
        
           | ddalex wrote:
           | Not the original googler responding, but I have never
           | experienced what they describe.
           | 
           | Postmortems are always blameless in the sense that "Somebody
           | fat fingered it" is not an acceptable explanation for the
           | causes of an incident - the possibility to fat finger it in
           | the first place must be identified and eliminated.
           | 
           | Opinions are my own, as always
        
             | vitus wrote:
             | > Not the original googler responding, but I have never
             | experienced what they describe.
             | 
             | I have also never experienced this outside of this single
             | instance. It was bizarre, but tried to reinforce the point
             | that _something needed to change_ -- it was the latest in a
             | string of major customer-facing outages across various
             | parts of TI, potentially pointing to cultural issues with
             | how we build things.
             | 
             | (And that's not wrong, there are plenty of internal memes
             | about the focus on building new systems and rewarding
             | complexity, while not emphasizing maintainability.)
             | 
             | Usually mandatory trainings are things like "how to avoid
             | being sued" or "how to avoid leaking confidential
             | information". Not "you need to follow these rules or else
             | all of Cloud burns down; look, we're already hemorrhaging
             | customer goodwill."
             | 
             | As I said, there was significant scar tissue associated
             | with this event, probably caused in large part by the
             | initial reaction by leadership.
        
           | fires10 wrote:
           | I don't think Google really cares about listening to their
           | users. I have spent more than 6 hours trying to get simple
           | warranty issues resolved. I wish they had to feel the pain of
           | their actions and decisions.
        
           | lozenge wrote:
           | I assume it was training for all SREs, like "this is why
           | we're all doing so much to prevent it from reoccurring"
        
         | spaetzleesser wrote:
         | " mandatory training where leadership read out emails from
         | customers telling us how we let them down and lost their trust
         | "
         | 
         | The same leadership that demanded tighter and tighter deadlines
         | and discouraged thinking things through?
        
         | grumple wrote:
         | The most remarkable thing about this is learning that anyone at
         | Google read an email from a customer. Given the automated
         | responses to complaints of account shutdowns, or complaints
         | about app store rejections, etc, this is pretty surprising.
        
           | lvs wrote:
           | I'd love to get a read receipt each time someone at Google
           | has actually read my feedback. Then it might be possible to
           | determine whether I'm just shaking my fists at the heavens or
           | not.
        
       | [deleted]
        
       | Ansil849 wrote:
       | > We've done extensive work hardening our systems to prevent
       | unauthorized access, and it was interesting to see how that
       | hardening slowed us down as we tried to recover from an outage
       | caused not by malicious activity, but an error of our own making.
       | I believe a tradeoff like this is worth it -- greatly increased
       | day-to-day security vs. a slower recovery from a hopefully rare
       | event like this. From here on out, our job is to strengthen our
       | testing, drills, and overall resilience to make sure events like
       | this happen as rarely as possible.
       | 
       | I found this to be an extremely deceptive conclusion. This makes
       | it sound like the issue was that Facebook's physical security is
       | just too gosh darn good. But the issue was not Facebook's data
       | center physical security protocols. The issue was glossed over in
       | the middle of the blogpost:
       | 
       | > Our systems are designed to audit commands like these to
       | prevent mistakes like this, but a bug in that audit tool didn't
       | properly stop the command.
       | 
       | The issue was faulty audit code. It is disingenuous to then
       | attempt to spin this like the downtime was due to Facebook's
       | amazing physec protocols.
        
         | cratermoon wrote:
         | Simple Testing Can Prevent Most Critical Failures[1], "We found
         | the majority of catastrophic failures could easily have been
         | prevented by performing simple testing on error handling code -
         | the last line of defense - even without an understanding of the
         | software design."
         | 
         | 1
         | https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...
        
           | net4all wrote:
           | That article should be required reading for all of us.
        
           | jeffbee wrote:
           | Having a separate testing instance of the internet might not
           | be practical. How exactly would you test such a change?
           | Simulating the effect of router commands is a very daunting
           | challenge.
        
         | temp_praneshp wrote:
         | No one except you is trying to spin anything.
        
           | Ansil849 wrote:
           | Going on at lengths about how the trade off between prolonged
           | downtime and strict security protocols is a worthy trade off
           | is erecting a nonsensical strawman, the literal definition of
           | spinning a story. The key issue had nothing to do with
           | Facebook's data center security protocols.
        
             | temp_praneshp wrote:
             | No one except you is erecting a nonsensical strawman.
        
               | elzbardico wrote:
               | The FB PR army is at full force today
        
               | Ansil849 wrote:
               | Except, I quoted exactly where they were.
        
         | detaro wrote:
         | And as they say, it lead to _slower recovery_ from the event,
         | which was caused, as they also clearly say, by something else.
         | Given that  "why is it taking so long to revert a config
         | change?!!!" was a common comment, relevant to the discussion.
         | 
         | It's disingenuous to point to a paragraph in the article and
         | complain that it doesn't mention the root cause when they
         | already said _before that, in the same article_ "This was the
         | source of yesterday's outage" about something else.
        
         | educationcto wrote:
         | The original error was the network command, but the slower
         | response and lengthy outage was partially due to the physical
         | security they put in place to prevent malicious activity. Any
         | event like this has multiple root causes.
        
           | Ansil849 wrote:
           | Yes, but the fact that the blogpost concludes on this
           | relatively tangential note (which notably also conveniently
           | allows Facebook to brag about their security measures) and
           | not on the note that their audit code was apparently itself
           | not sufficiently audited, is what makes this deceptive spin.
        
             | closeparen wrote:
             | Our postmortems have three sections. Prevention, detection,
             | and mitigation. They all matter.
             | 
             | Shit happens. People ship bugs. People fat-finger commands.
             | An engineering team's responsibility doesn't stop there. It
             | also needs to quickly activate responders who know what to
             | do and have the tools & access to fix it. Sometimes the
             | conditions that created the issue are within acceptable
             | bounds; the real need for reform is in why it took so long
             | to fix.
        
             | cranekam wrote:
             | I agree that there's an awkward emphasis on how FB
             | prioritizes security and privacy but nothing is deceptive
             | here. Had the audit bug not subsequently cut off access to
             | internal tools and remote regions it would be easy to
             | revert. Had there not been a global outage nobody would
             | have known that the process for getting access in an
             | emergency was too slow.
             | 
             | Huge events like this _always_ have many factors that have
             | to line up just right. To insist that the one and only true
             | cause was a bug in the auditing system is reductive.
        
               | Ansil849 wrote:
               | > I agree that there's an awkward emphasis on how FB
               | prioritizes security and privacy but nothing is deceptive
               | here.
               | 
               | I guess deceptive was the wrong word, so whatever's the
               | term for "awkward emphasis" :).
        
             | adrianmonk wrote:
             | No, they just wanted to cover both "what caused it?" and
             | "why did it take too long to fix it?" since both are topics
             | people were obviously extremely interested in.
             | 
             | It would have been surprising and disappointing if they
             | didn't cover both of them.
        
             | tedunangst wrote:
             | Seems like appropriate emphasis given how many people
             | yesterday were asking why aren't they back online yet. For
             | every person asking why they deleted their routes there
             | were two people asking why they didn't put them back.
        
       | Jamie9912 wrote:
       | What would they have done if the whole data center was destroyed?
        
         | detaro wrote:
         | continue working form all other data centers, possibly without
         | users really noticing.
        
       | xphos wrote:
       | I don't know the -f in rm -rf isn't a bug xD. I feel sorry for
       | the poor engineer who fat fingered the command. It definitely
       | highlights an anti-pattern in the command line but the fact that,
       | that singular console had the power to effect the entire network
       | highlights an "interesting" design choice indeed.
        
       | fakeythrow8way wrote:
       | For a lot of people in countries outside the US, Facebook _is_
       | the internet. Facebook has cut deals with various ISPs outside
       | the US to allow people to use their services without it costing
       | any data. Facebook going down is a mild annoyance for us but a
       | huge detriment to, say, Latin America.
        
       | tigerlily wrote:
       | We want ramenporn
        
         | cesarb wrote:
         | Context for those who didn't see it yesterday:
         | https://news.ycombinator.com/item?id=28749244
        
       | jasonjei wrote:
       | Will somebody lose a job over this?
        
         | mman0114 wrote:
         | They shouldn't. If FB has a proper PMA culture in place, you
         | figure out why the processes in place didn't work that allowed
         | this kind of change to happen, so more testing, etc. Should be
         | a blameless exercise.
        
         | _moof wrote:
         | Not if FB has a halfway decent engineering culture. People make
         | mistakes. They're practically fundamental to being a person.
         | You can _minimize_ mistakes, but any system that requires
         | perfect human performance will fail.
        
       ___________________________________________________________________
       (page generated 2021-10-05 23:01 UTC)