[HN Gopher] More details about the October 4 outage
___________________________________________________________________
More details about the October 4 outage
Author : moneil971
Score : 320 points
Date : 2021-10-05 17:30 UTC (5 hours ago)
(HTM) web link (engineering.fb.com)
(TXT) w3m dump (engineering.fb.com)
| pm2222 wrote:
| I'm confused by this. Are the DNS servers inside the backbone, or
| outside? To ensure reliable operation, our DNS
| servers disable those BGP advertisements if they themselves can
| not speak to our data centers, since this is an indication of an
| unhealthy network connection. In the recent outage the entire
| backbone was removed from operation, making these locations
| declare themselves unhealthy and withdraw those BGP
| advertisements. The end result was that our DNS servers became
| unreachable even though they were still operational. This made it
| impossible for the rest of the internet to find our servers.
| net4all wrote:
| If I understand it correctly they have DNS servers spread out
| at different locations. These locations are also BGP peering
| locations. If the DNS server at a location cannot reach the
| other datacenters via the backbone it stops advertising their
| IP prefixes. The hope is that traffic will then instead get
| routed to some other facebook location that is still
| operational.
| [deleted]
| toast0 wrote:
| If we make a simplified network map, FB looks more or less like
| a bunch of PoPs (points of presence) at major peering points
| around the world, a backbone network that connects those PoPs
| to the FB datacenters, and the FB operated datacenters
| themselves. (The datacenters are generally located a bit
| farther away from population centers, and therefore peering
| points, so it's sensible to communicate to the outside world
| through the PoPs only)
|
| The DNS servers run at the PoPs, but only BGP advertise the DNS
| addresses when the PoP determines it's healthy. If there's no
| connectivity back to a FB datacenter (or perhaps, no
| connectivity to the preferred datacenter), the PoP is unhealthy
| and won't advertise the DNS addresses over BGP.
|
| Since the BGP change that was pushed eliminated the backbone
| connectivity, none of the PoPs were able to connect to
| datacenters, and so they all, independently, stopped
| advertising the DNS addresses.
|
| So that's why DNS went down. Of course, since client access
| goes through load balancers at the PoPs, and the PoPs couldn't
| access the datacenters where requests are actually processed,
| DNS being down wasn't a meaningful impediment to accessing the
| services. Apparently, it was an issue with management (among
| other issues).
|
| Disclosure: I worked at WhatsApp until 2019, and saw some of
| the network diagrams. Network design may have changed a bit in
| the last 2 years, but probably not too much.
| pm2222 wrote:
| Ok so the DNS servers at PoPs, outside of backbone, did not
| go down.
|
| Does it mean they can respond with public IPs meaningful for
| local PoP only and are not able to respond with IPs as
| directions to other PoPs or FB's main DCs? So that has to
| mean different public IPs are handed out at different PoPs,
| right?
| toast0 wrote:
| I'm not quite sure I understand the question exactly, but
| let me give it a try.
|
| So, first off, each pop has a /24, so like the seattle-1
| pop which is near me has 157.240.3.X addresses; for me,
| whatsapp.net currently resolves to 157.240.3.54 in the
| seattle-1 pop. these addresses are used as unicast meaning
| they go to one place only, and they're dedicated for
| seattle-1 (until FB moves them around). But there are also
| anycast /24s, like 69.171.250.0/24, where 69.171.250.60 is
| a loadbalancer IP that does the same job as 157.240.3.54,
| but multiple PoPs advertise 69.171.250.0/24; it's served
| from seattle-1 for me, but probably something else for you
| unless you're nearby.
|
| The DNS server IPs are also anycast, so if a PoP is
| healthy, it will BGP advertise the DNS server IPs (or at
| least some of them; if I ping {a-d}.ns.whatsapp.net, I see
| 4 different ping times, so I can tell seattle-1 is only
| advertising d.ns.whatsapp.net right now, and if I worked a
| little harder, I could probably figure out the other PoPs).
|
| Ok, so then I think your question is, if my DNS request for
| whatsapp.net makes it to the seattle1 PoP, will it only
| respond with a seattle-1 IP? That's one way to do it, but
| it's not necessarily the best way. Since my DNS requests
| could make it to any PoP, sending back an answer that
| points at that PoP may not be the best place to send me.
|
| Ideally, you want to send back an answer that is network
| local to the requester and also not a PoP that is
| overloaded. Every fancy DNS server does it a little
| different, but more or less you're integrating a bunch of
| information that links resolver IP to network location as
| well as capacity information and doing the best you can.
| Sometimes that would be sending users to anycast which
| should end up network local (but doesn't always), sometimes
| it's sending them to a specific pop you think is local,
| sometimes it's sending them to another pop because the
| usual best pop has some issue (overloaded on CPU, network
| congestion to the datacenters, network congestion on
| peering/transit, utility power issue, incoming weather
| event, fiber cut or upcoming fiber maintenance, etc).
|
| But in short, different DNS requests will get different
| answers. If you've got a few minutes, run these commands to
| see the range of answers you could get for the same query:
| host whatsapp.net # using your system resolver settings
| host whatsapp.net a.ns.whatsapp.net # direct to
| authoritative A host whatsapp.net b.ns.whatsapp.net
| # direct to B host whatsapp.net 8.8.8.8 # google
| public DNS host whatsapp.net 1.1.1.1 # cloudflare
| public DNS host whatsapp.net 4.2.2.1 # level 3 not
| entirely public DNS host whatsapp.net
| 208.67.222.222 # OpenDNS host whatsapp.net 9.9.9.9
| # Quad9
|
| You should see a bunch of different addresses for the same
| service. FB hostnames do similar things of course.
|
| Adding on, the BGP announcments for the unicast /24s of the
| PoPs didn't go down during yesterday's outage. If you had
| any of the pop specific IPs for whatsapp.net, you could
| still use http://whatsapp.net (or https://whatsapp.net ),
| because the configuration for that hostname is so simple,
| it's served from the PoPs without going to the datacenters
| (it just sets some HSTS headers and redirects to
| www.whatsapp.com, which perhaps despite appearances is a
| page that is served from the datacenters and so would not
| have worked during the outage).
| pm2222 wrote:
| Ok, so then I think your question is, if my DNS request
| for whatsapp.net makes it to the seattle1 PoP, will it
| only respond with a seattle-1 IP? That's one way to do
| it, but it's not necessarily the best way. Since my DNS
| requests could make it to any PoP, sending back an answer
| that points at that PoP may not be the best place to send
| me. Ideally, you want to send back an answer
| that is network local to the requester and also not a PoP
| that is overloaded. Every fancy DNS server does it a
| little different, but more or less you're integrating a
| bunch of information that links resolver IP to network
| location as well as capacity information and doing the
| best you can. Sometimes that would be sending users to
| anycast which should end up network local (but doesn't
| always), sometimes it's sending them to a specific pop
| you think is local, sometimes it's sending them to
| another pop because the usual best pop has some issue
| (overloaded on CPU, network congestion to the
| datacenters, network congestion on peering/transit,
| utility power issue, incoming weather event, fiber cut or
| upcoming fiber maintenance, etc).
|
| Right I was hoping the DNSs of FB ought to be smarter
| than usual and let's say when DNS at Seattle-1 cannot
| reach backbone it'd respond with IP of perhaps NYC/SF
| before it starts the BGP withdrawal.
|
| Thanks for the write up and I enjoy it.
| toast0 wrote:
| > Right I was hoping the DNSs of FB ought to be smarter
| than usual and let's say when DNS at Seattle-1 cannot
| reach backbone it'd respond with IP of perhaps NYC/SF
| before it starts the BGP withdrawal.
|
| The problem there is coordination. The PoPs don't
| generally communicate amongst themselves (and may not
| have been able to after the FB backbone was broken,
| although technically, they could have through transit
| connectivity, it may not be configured to work that way),
| so when a PoP loses its connection to the FB datacenters,
| it also loses its source of what PoPs are available and
| healthy. I think this is likely a classic distributed
| systems problem; the desired behavior when an individual
| node becomes unhealthy is different than when all nodes
| become unhealthy, but the nature of distributed systems
| is that a node can't tell if its the only unhealthy node
| or all nodes became unhealthy together. Each individual
| PoP did the right thing by dropping out of the anycast,
| but because they all did it, it was the wrong thing.
| pm2222 wrote:
| You are to the point and precise. This is exactly the
| problem. Each individual PoP did the
| right thing by dropping out of the anycast, but because
| they all did it, it was the wrong thing.
|
| Somehow I feel the design is flawed because if abuses DNS
| server status a bit. I mean DNS server down and BGP
| withdrawal _for the DNS server_ is a perfect combination,
| however connectivity between DNS and backend server down,
| DNS up and BGP withdrawal _for DNS server_ is not. DNS
| did not fail and DNS should just fall back to some other
| operational DNS perhaps a regional /global default one.
| fanf2 wrote:
| Facebook's authoritative DNS servers are at the borders between
| Facebook's backbone and the rest of the Internet.
| cranekam wrote:
| The wording is a bit unclear (rushed, no doubt) but I expect
| this means the DNS servers stopped announcing themselves as
| possible targets for the anycasted IPs Facebook uses for its
| authoritative DNS [1], since they learned that the network was
| deemed unhealthy. If they all do that nobody will answer
| traffic sent to the authoritative DNS IPs and nothing works.
|
| [1] See "our authoritative name servers that occupy well known
| IP addresses themselves" mentioned earlier
| pm2222 wrote:
| The anycasted IPs for DNS servers make sense to me and the
| BGP withdrawal too when the common cases are perhaps one or a
| few PoPs lost connectivity to the backbone/DC rarely every
| one of them fails at the time.
|
| I was hoping perhaps the DNS servers at PoPs can be improved
| by responding with public IPs for other PoPs/DCs and only
| when that is not available start the BGP withdrawal. Or I can
| presume the available DNSs at PoPs decrease over time and the
| remaining ones getting more and more requests until finally
| every one of them is cut off from the internet.
| rvnx wrote:
| It's a no apologies messages:
|
| "We failed, our processes failed, our recovery process only
| partially worked, we celebrate failure. Our investors were not
| happy, our users were not happy, some people probably ended in
| physically dangerous situations due to WhatsApp being unavailable
| but it's ok. We believe a tradeoff like this is worth it."
|
| - Your engineering team.
| rPlayer6554 wrote:
| Literally the first line of their first article.[0] This
| article is just a technical explanation.
|
| [0] https://engineering.fb.com/2021/10/04/networking-
| traffic/out...
| jffry wrote:
| Yesterday's blog post (discussion here:
| https://news.ycombinator.com/item?id=28754824) was a direct
| apology.
|
| Should that be repeated in a somewhat more technical discussion
| of why it happened?
| kzrdude wrote:
| We (the users) are the product anyway, they can only apologize
| to themselves for missing out on making money yesterday.
| dmoy wrote:
| Interesting bit on recovery w.r.t. the electrical grid
|
| > flipping our services back on all at once could potentially
| cause a new round of crashes due to a surge in traffic.
| Individual data centers were reporting dips in power usage in the
| range of tens of megawatts, and suddenly reversing such a dip in
| power consumption could put everything from electrical systems
| ...
|
| I wish there was a bit more detail in here. What's the worst case
| there? Brownouts, exploding transformers? Or less catastrophic?
| nomel wrote:
| If your system is pulling 500 watts at 120V, that's around 4A
| of line voltage. If you drop down 20% to 100V, the output will
| happily still pull its regulated voltage, but now the line
| components are seeing ~20% more, at 5A. For brown out, you need
| to overrate your components, and/or shut everything off if the
| line voltage goes too low.
|
| I used to do electrical compliance testing in a previous life,
| with brown out testing being one of our safety tests. You would
| drape a piece of cheese cloth over the power supply and slowly
| ramp the line voltage down. At the time, the power supplies
| didn't have good line side voltage monitoring. There was almost
| always smoke, and sometimes cheese cloth fires. Since this was
| safety testing, pass/fail was mostly based on if the cheese
| cloth caught fire, not if the power supply was damaged.
| dreamlayers wrote:
| Why wouldn't the power supply shut down due to overcurrent
| protection?
| detaro wrote:
| That's likely on the output side, which doesn't see
| overcurrent.
| kube-system wrote:
| Likely tripping breakers or overload protection on UPSes?
|
| Often PDUs used in a rack can be configured to start servers up
| in a staggered pattern to avoid a surge in demand for these
| reasons.
|
| I'd imagine there's more complications when you're doing an
| entire DC vs just a single rack, though.
| topspin wrote:
| Disk arrays have been staggering drive startup for a long
| time for this reason. Sinking current into hundreds of little
| starting motors simultaneously is a bad idea.
| Johnny555 wrote:
| I don't see how suddenly running more traffic is going to
| trip datacenter breakers -- I could see how flipping on power
| to an entire datacenter's worth of servers could cause a
| spike in electrical demand that the power infrastructure
| can't handle, but if suddently running CPU's at 100% trips
| breakers, then it seems like that power infrastructure is
| undersized? This isn't a case where servers were powered off,
| they were idle because they had no traffic.
|
| Do large providers like Facebook really provision less power
| than their servers would require at 100% utilization? Seems
| like they could just use fewer servers with power sized at
| 100% if their power system going to constrain utilization
| anyway?
| kube-system wrote:
| I don't know the answer. But it's not too uncommon, in
| general, to provision for reasonable use cases plus a
| margin, rather than provision for worst case scenario.
| dfcowell wrote:
| All of the components in the supply chain will be rated for
| greater than max load, however power generation at grid
| scale is a delicate balancing act.
|
| I'm not an electrical engineer, so the details here may be
| fuzzy, however in broad strokes:
|
| Grid operators constantly monitor power consumption across
| the grid. If more power is being drawn than generated, line
| frequency drops _across the whole grid._ This leads to
| brownouts and can cause widespread damage to grid equipment
| and end-user devices.
|
| The main way to manage this is to bring more capacity
| online to bring the grid frequency back up. This is slow,
| since spinning up even "fast" generators like natural gas
| can take on the order of several minutes.
|
| Notably, this kind of scenario is the whole reason the
| Tesla battery in South Australia exists. It can respond to
| spikes in demand (and consume surplus supply!) much faster
| than generator capacity can respond.
|
| The other option is load shedding, where you just
| disconnect parts of your grid to reduce demand.
|
| Any large consumers (like data center operators) likely
| work closely with their electricity suppliers to be good
| citizens and ramp up and down their consumption in a
| controlled manner to give the supply side (the power
| generators) time to adjust their supply as the demand
| changes.
|
| Note that changes to power draw as machines handle
| different load will also result in changes to consumption
| in the cooling systems etc. making the total consumption
| profile substantially different coming from a cold start.
| Johnny555 wrote:
| You're talking about the grid, the OP was talking about
| datacenter infrastructure -- which one is the weak link?
|
| If a datacenter can't go from idle (but powered on)
| servers to fully utilized servers without taking down the
| power grid, then it seems that they'd have software
| controls in place to prevent this, since there are other
| failure modes that could cause this behavior other than a
| global Facebook outage.
| dfcowell wrote:
| Unfortunately the article doesn't provide enough explicit
| detail to be 100% sure one way or the other, however my
| read is that it's probably the grid.
|
| > Individual data centers were reporting dips in power
| usage in the range of tens of megawatts, and suddenly
| reversing such a dip in power consumption could put
| everything from electrical systems to caches at risk.
|
| "Electrical systems" is vague and could refer to either
| internal systems, external systems or both.
|
| That said, if the DC is capable of running under
| sustained load at peak (which we have to assume it is,
| since that's its normal state when FB is operational) it
| seems to me like the externality of the grid is the more
| likely candidate.
|
| In terms of software controls preventing this kind of
| failure mode, they do have it - load shedding. They'll
| cut your supply until capacity is made available.
| cesarb wrote:
| The key word is "suddenly".
|
| In the electricity grid, demand and generation must always
| be precisely matched (otherwise, things burn up). This is
| done by generators automatically ramping up or down
| whenever the load changes. But most generators cannot
| change their output instantly; depending on the type of
| generator, it can take several minutes or even hours to
| respond to a large change in the demand.
|
| Now consider that, on modern servers, most of the power
| consumption is from the CPU, and also there's a significant
| difference on the amount of power consumed between 100% CPU
| and idle. Imagine for instance 1000 servers (a single rack
| can hold 40 servers or more), each consuming 2kW of power
| at full load, and suppose they need only half that at idle
| (it's probably even less than half). Suddenly switching
| from idle to full load would mean 1MW of extra power has to
| be generated; while the generators are catching up to that,
| the voltage drops, which means the current increases to
| compensate (unlike incandescent lamps, switching power
| supplies try to maintain the same output no matter the
| input voltage), and breakers (which usually are configured
| to trip on excess current) can trip (without breakers, the
| wiring would overheat and burn up or start a fire).
|
| If the load changes slowly, on the other hand, there's
| enough time for the governor on the generators to adjust
| their power source (opening valves to admit more water or
| steam or fuel), and overcome the inertia of their large
| spinning mass, before the voltage drops too much.
| Johnny555 wrote:
| I get that lots of servers can add up to lots of power,
| but what is a "lot"? Is 1MW really enough demand to
| destabilize a regional power grid?
| mschuster91 wrote:
| One case is automated protection systems in the grid detecting
| a sudden hop of current and assuming an isolation failure along
| the path - basically, not enough current to trip the short-
| circuit breakers, but enough to raise an alarm.
| plorg wrote:
| Brownouts is probably the most proximate concern - a sudden
| increase in demand will draw down the system frequency in the
| vicinity, and if there aren't generation units close enough or
| with enough dispatchable capacity there's a small chance they
| would trip a protective breaker.
|
| A person I know on the power grid side said at one data center
| there were step functions when FB went down and then when it
| came up, equal to about 20% of the load behind the distribution
| transformer. That quantity is about as much as an aluminum
| smelter switching on or off.
| Johnny555 wrote:
| But don't their datacenters all have backup generators? So
| worst case in a brownout, they fail over to generator power,
| then can start to flip back to utility power slowly.
|
| Or do they forgo backup generators and count on shifting
| traffic to a new datacenter if there's a regional power
| outage?
| kgermino wrote:
| Edit to be less snarky:
|
| I assume they do have backup generators, though I don't
| know.
|
| However if the sudden increase put that much load on the
| grid it could drop the frequency enough to blackout the
| entire neighborhood. That would be bad even if FB was able
| to keep running through it.
| [deleted]
| MrDunham wrote:
| I'm very close with someone who works at a FB data center and
| was discussing this exact issue.
|
| I can only speak to one problem I know of (and am rather sure I
| can share): a spike might trip a bunch of breakers at the data
| center.
|
| BUT, unlike me at home, FBs policy is to never flip a circuit
| back on until you're _positive_ of the root cause of said trip.
|
| By itself that could compound issues and delay ramp up time as
| they'd work to be sure no electrical components actually
| sorted/blew/etc. A potentially time sucking task given these
| buildings could be measured in whole units of football fields.
| [deleted]
| Ansil849 wrote:
| > The backbone is the network Facebook has built to connect all
| our computing facilities together, which consists of tens of
| thousands of miles of fiber-optic cables crossing the globe and
| linking all our data centers.
|
| This makes it sound like Facebook has physically laid "tens of
| thousands of miles of fiber-optic cables crossing the globe and
| linking all our data centers". Is this in fact true?
| colechristensen wrote:
| Likely a mixture of bought, leased, and self laid fiber. This
| is not at all uncommon and basically necessary if you have your
| own data center.
| Closi wrote:
| Yes, although undoubtedly a lot of this is shared investment -
| https://www.businessinsider.com/google-facebook-giant-unders...
|
| https://datacenterfrontier.com/facebook-will-begin-selling-w...
| louwrentius wrote:
| > Our primary and out-of-band network access was down
|
| Don't create circular dependencies.
| oarabbus_ wrote:
| How do you avoid circular dependencies on an out-of-band-
| network? Seems like the choice is between a circular
| dependency, or turtles all the way down.
| detaro wrote:
| How do you go from "have a separate access method that
| doesn't depend on your main system" to "turtles all the way
| down"? The secondary access is allowed to have dependencies,
| just not on _your_ network.
| oarabbus_ wrote:
| And if the secondary access fails, then what? Backup
| systems are not reliable 100% of the time.
| Ajedi32 wrote:
| Then you're SOL. What's your point? The backup might
| fail, so don't have a backup? I don't understand what
| you're trying to say.
| _moof wrote:
| Then you were 1-FT, which is still worlds better than
| 0-FT.
|
| "Don't put two engines on the plane because both of them
| might fail" is not how fault tolerance works.
| vitus wrote:
| With something as fundamental as the network, no way around it.
|
| - Okay, we'll set up a separate maintenance network in case we
| can't get to the regular network.
|
| - Wait, but we need a maintenance network for the maintenance
| network...
| packetslave wrote:
| "Okay, we'll pull in a DSL line from a completely separate
| ISP for the out-of-band access." (guess what else is in that
| manhole/conduit?)
|
| "Okay, we'll use LTE for out-of-band!" (oops, the backhaul
| for the cell tower goes under the same bridge as the real
| network)
|
| True diversity is HARD (not unsolvable, just hard. especially
| at scale)!
| chasd00 wrote:
| heh i toured a large data center here in dallas and
| listened to them brag about all the redundant connectivity
| they had while standing next to the conduit where they all
| entered the building. One person, a pair of wire cutters,
| and 5 seconds and that whole datacenter is dark.
| detaro wrote:
| Although the difference here is that loosing connection and
| out-of-band for a single data center shouldn't be as
| catastrophic for Facebook, so your examples would be
| tolerable?
| packetslave wrote:
| That's the trick, though: if you don't do that level of
| planning for _all_ of your datacenters and POPs (and
| fiber huts out in the middle of nowhere), it 's
| inevitable that the one you most need to access during an
| outage will be the one where your OOB got backhoe'd.
|
| Murphy is a jerk.
| tristor wrote:
| Two is One, One is None. There are absolutely ways around
| this, it's called redundancy. The marginal cost of laying an
| extra pair during physical plant installation is basically
| $0, which is why you'd never go "well we need a backup for
| the backup, so there's no point in having two pairs).
| Similarly, the marginal cost for having a second UPS and PDU
| in a rack is effectively $0 at scale, so nobody would argue
| this is unnecessary to deal with possible UPS failure or
| accidentally unplugging a cable.
|
| In this case, there are likely several things that can be
| changes systemically to mitigate or prevent similar failures
| in the future, and I have every faith that Facebook's SRE
| team is capable of identifying and implementing those
| changes. There is no such thing as "no way around it", unless
| you're dealing with a law of physics.
| vitus wrote:
| By "no way around it" I mean you're going to need to create
| a circular dependency at some point, whether it's a
| maintenance network that's used to manage itself, or the
| prod network for managing the maintenance network.
|
| I absolutely agree that installing a maintenance network is
| a good idea. One of the big challenges, though, is making
| sure that all your tooling can and will run exclusively on
| the maintenance network if needed.
|
| (Also, while the marginal cost of laying an extra pair of
| fiber during physical installation may be low, making sure
| that you have fully independent failure domains is much
| higher, whether that's leased fiber, power, etc.)
| HenryKissinger wrote:
| > During one of these routine maintenance jobs, a command was
| issued with the intention to assess the availability of global
| backbone capacity, which unintentionally took down all the
| connections in our backbone network, effectively disconnecting
| Facebook data centers globally
|
| Imagine being this person.
|
| Tomorrow on /r/tifu.
| blowski wrote:
| I doubt Facebook engineers are free-typing commands on Bash, so
| it's probably not an individual error. More likely to be a race
| condition or other edge case that wasn't considered during a
| review. This might be a script that's run 1000s of times before
| with no problems.
| packetslave wrote:
| Back in Ye Old Dark Ages, I caused a BIG Google outage by
| running a routine maintenance script that had been run dozens
| if not hundreds of times before.
|
| Turns out the underlying network software had a race
| condition that would ONLY be hit if the script ran at the
| exact same time as some automated monitoring tools polled the
| box.
|
| At FAANG scale, "one in a million" happens a lot more often
| than you'd think.
| savant_penguin wrote:
| Hahahahahahahaha
| kabdib wrote:
| "It was the intern" is only cute once.
| Barrin92 wrote:
| of course if one person can knock down an entire global system
| through a trivial mistake the problem is obviously not the
| person to begin with, but the architecture of the system.
| flutas wrote:
| Or the fact that there was a bug in the tool that should have
| prevented this.
|
| > Our systems are designed to audit commands like these to
| prevent mistakes like this, but a bug in that audit tool
| didn't properly stop the command.
| bastardoperator wrote:
| So this is really two peoples fault. One for issuing the
| command the other for introducing the bug in the audit
| tool.
| colechristensen wrote:
| Not really, there's essentially always a button that blows
| everything up. Catastrophic failures usually end up being a
| large set of safety systems malfunctioning which would
| otherwise prevent the issue when that button is pressed.
|
| But yes, for these types of problem, the ultimate fault is
| never "that guy Larry is an idiot", it takes a large team of
| cooperating mistakes.
| harshreality wrote:
| > To ensure reliable operation, our DNS servers disable those BGP
| advertisements if they themselves can not speak to our data
| centers, since this is an indication of an unhealthy network
| connection.
|
| No, it's (clearly) not a guaranteed indication of that. Logic
| fail. Infrastructure tools at that scale need to handle all
| possible causes of test failures. "Is the internet down or only
| the few sites I'm testing?" is a classic network monitoring
| script issue.
| jaywalk wrote:
| I think you're misunderstanding. The DNS servers (at Facebook
| peering points) had zero access to Facebook datacenters because
| the backbone was down. That is as unhealthy as the network
| connection can get, so they (correctly) stopped advertising the
| routes to the outside world.
|
| By that point, the Facebook backbone was already gone. The DNS
| servers stopping BGP advertisements to the outside world did
| not cause that.
| harshreality wrote:
| You're talking about _backend_ network connections to
| facebook 's datacenters as if that's the only thing that
| matters. I'm talking about _overall_ network connection
| including the internet-facing part.
|
| Facebook's infrastructure at their peering points loses all
| contact with their respective facebook datacenter(s).
|
| Their response is to automatically withdraw routes to
| themselves. I suppose they assumed that all datacenters would
| never go down at the same time, so that client dns redundancy
| would lead to clients using other dns servers that could
| still contact facebook datacenters. It's unclear how those
| routes could be restored without on-site intervention. If
| they automatically detect when the datacenters are reachable
| again, that too requires on-site intervention since after
| withdrawing routes FB's ops tools can't do anything to the
| relevant peering points or datacenters.
|
| But even without the catastrophic case of all datacenter
| connections going down, you don't need to be a facebook ops
| engineer to realize that there are problems that need to be
| carefully thought through when ops tools depends on the same
| (public) network routes and DNS entries that the DNS servers
| are capable of autonomously withdrawing.
| halotrope wrote:
| It is completely logical but still kind of amazing, that facebook
| plugged their globally distributed datacenters together with
| physical wire.
| packetslave wrote:
| What do you imagine other companies use to connect their
| datacenters?
| meepmorp wrote:
| Uh... the cloud?
| zht wrote:
| isn't that still over wires?
| henrypan1 wrote:
| Wow
| nonbirithm wrote:
| DNS seems to be a massive point of failure everywhere, even
| taking out the tools to help deal with outages themselves. The
| same thing happened to Azure multiple times in the past, causing
| complete service outages. Surely there must be some way to better
| mitigate DNS misconfiguration by now, given the exceptional
| importance of DNS?
| bryan_w wrote:
| DNS was very much a proximate cause. In most cases you want
| your anycast dns servers to shoot themselves in the head if
| they detect their connection to origin to be interrupted. This
| would have been an big outage anyways just at a different
| layer.
|
| Oddly enough, one could consider that behavior something that
| was put in place to "mitigate DNS misconfiguration"
| cnst wrote:
| But DNS didn't actually fail. Their design says DNS must go
| offline if the rest of the network is offline. That's exactly
| what DNS did.
|
| Sounds like their design was wrong, but you can't just blame
| DNS. DNS worked 100% here as per the task that it was given.
|
| > To ensure reliable operation, our DNS servers disable those
| BGP advertisements if they themselves can not speak to our data
| centers, since this is an indication of an unhealthy network
| connection.
| jaywalk wrote:
| I'm not sure the design was even wrong, since the DNS servers
| being down didn't meaningfully contribute to the outage. The
| entire Facebook backbone was gone, so even if the DNS servers
| continued giving out cached responses clients wouldn't be
| able to connect anyway.
| Xylakant wrote:
| DNS being down instead of returning an unreachable
| destination did increase load for other DNS resolvers
| though since empty results cannot be cached and clients
| continued to retry. This made the outage affect others.
| cnst wrote:
| Source?
|
| DNS errors are actually still cached; it's something that
| has been debunked by DJB like a couple of decades ago,
| give or take:
|
| http://cr.yp.to/djbdns/third-party.html
|
| > RFC 2182 claims that DNS failures are not cached; that
| claim is false.
|
| Here are some more recent details and the fuller
| explanation:
|
| https://serverfault.com/a/824873
|
| Note that FB.com currently expires its records in 300
| seconds, which is 5 minutes.
|
| PowerDNS (used by ordns.he.net) caches servfail for 60s
| by default -- packetcache-servfail-ttl -- which isn't
| very far from the 5min that you get when things aren't
| failing.
|
| Personally, I do agree with DJB -- I think it's a better
| user experience to get a DNS resolution error right away,
| than having to wait many minutes for the TCP timeout to
| occur when the host is down anyways.
| cnst wrote:
| Exactly. And it would actually be worse, because the
| clients would have to wait for a timeout, instead of simply
| returning a name error right away.
| mbauman wrote:
| How would've it been worse? Waiting for a timeout is a
| _good_ thing as it prevents a thundering herd of refresh-
| smashing (both automated and manual).
|
| I don't know BGP well, but it seems easier for peers to
| just drop FB's packets on the floor than deal with a DNS
| stampede.
| cnst wrote:
| An average webpage today is several megabytes in size.
|
| How would a few bytes over a couple of UDP packets for
| DNS have any meaningful impact on anyone's network? If
| anything, things fail faster, so, there's less data to
| transmit.
|
| For example, I often use ordns.he.net as an open
| recursive resolver. They use PowerDNS as their software.
| PowerDNS has the default of packetcache-servfail-ttl of
| 60s. OTOH, fb.com A response currently has a TTL of 300s
| -- 5 minutes. So, basically, FB's DNS is cached for
| roughly the same time whether or not they're actually
| online.
| mbauman wrote:
| The rest of the internet sucked yesterday, and my
| understanding was it was due to a thundering herd of
| recursive DNS requests. Slowing down clients seems like a
| good thing.
| adrianmonk wrote:
| > _DNS seems to be a massive point of failure everywhere_
|
| Emphasis on the "seems". DNS gets blamed a lot because it's the
| very first step in the process of connecting. When everything
| is down, you will see DNS errors.
|
| And since you can't get past the DNS step, you never see the
| other errors that you would get if you could try later steps.
| If you knew the web server's IP address to try to make a TCP
| connection to it, you'd get connection timed out errors. But
| you don't see those errors because you didn't get to the point
| where you got an IP address to connect to.
|
| It's like if you go to a friend's house but their electricity
| is out. You ring the doorbell and nothing happens. Your first
| thought is that the doorbell is messed up. And you're not
| wrong: it is, but so is everything else. If you could ring it
| and get their attention to let you inside in their house, you'd
| see that their lights don't turn on, their TV doesn't turn on,
| their refrigerator isn't running, etc. But those things are
| hidden to you because you're stuck on the front porch.
| dbbk wrote:
| Seems like the simplest solution would be to just move recovery
| tooling to their own domain / DNS?
| tantalor wrote:
| So it wasn't a config change, it was a command-of-death.
| rblion wrote:
| Was this a way to delete a lot of evidence before shit really hit
| the fan?'
|
| After reading this, I can't help but feel this was a calculated
| move.
|
| It gives FB a chance to hijack media attention from the
| whistleblower. It gives them a chance to show the average peson,
| 'hey, we make mistakes and we have a review process to improve
| our systems'.
|
| The timing is too perfect if you ask me.
| FridayoLeary wrote:
| Inviting further whistleblowing.
| joemi wrote:
| I'm not usually that cynical, but the timing of it combined
| with facebook's lengthy abusive relationship with customers'
| privacy (and what kind of company morals that implies) makes me
| think that it's definitely a possibility.
| rblion wrote:
| The testimony in front of congress and the reaction is just
| making me feel even more like this was a calculated move or
| internal sabotage.
| Hokusai wrote:
| > One of the jobs performed by our smaller facilities is to
| respond to DNS queries. DNS is the address book of the internet,
| enabling the simple web names we type into browsers to be
| translated into specific server IP addresses.
|
| What is the target audience of this post? It is too technical for
| non-technical people, but also it is dumbed down to try to
| include people that does not know how the internet works. I feel
| like I'm missing something.
| Florin_Andrei wrote:
| > _Those data centers come in different forms._
|
| It's like the birds and the bees.
| gist wrote:
| > What is the target audience of this post?
|
| Separate point to your question.
|
| FB is under no obligation to provide more details than they
| need to because a small segment of the population (certainly
| relative to their 'customers') might find it interesting or
| helpful or entertaining. FB is a business. They can essentially
| do (and should be able to) do whatever they want. There is no
| requirement (and there should be no requirement) to provide the
| general public with more info than they want subject to any
| legal requirement. If the government wants more (and are
| entitled to more info) they can ask for it and FB can decide if
| they are required to comply.
|
| FB is a business. Their customers are not the tech community
| looking to get educated and avoid issues themselves at their
| companies or (as mentioned) be entertained. And their customers
| (advertisers or users) can decide if they want to continue to
| patronize FB.
|
| I always love on HN seeing 'hey where is the post mortem' as if
| it's some type of defacto requirement to air dirty laundry to
| others.
|
| If I go to the store and there is not paper towels there I
| don't need to know why there are no towels and what the company
| will do going forward to prevent any errors that caused the
| lack of that product. I can decide to buy another brand or
| simply take steps to not have it be an issue.
| jablan wrote:
| > If I go to the store and there is not paper towels there I
| don't need to know why there are no towels
|
| You don't _need_ to know, but it's human to want to know, and
| it's also human to want to satisfy other human's curiosity,
| especially if it doesn't bring any harm to you.
|
| Also, your post is not really answering any of GP's
| questions. I presume you wanted to say that FB doesn't _owe_
| any explanation to us, but the GP asked, as they already
| provided one, to whom is it addressed.
| Hokusai wrote:
| The air industry has this solved, it's mandatory to report
| certain kind of incidents to avoid them in the future and
| inform the aviation community. https://www.skybrary.aero/inde
| x.php/Mandatory_Occurrence_Rep...
|
| That the main form of personal communication for hundreds of
| millions of users is down and there is no mandatory reporting
| is irresponsible. That Facebook is a business does not mean
| that they do not have responsibilities towards society.
|
| Facebook is not your local supermarket, it has global impact.
| tehjoker wrote:
| One would imagine a large local supermarket going down
| would owe the people it serves some explanation. That's
| where their food comes from.
|
| At this point, I am completely sick of the pro-corporate
| rhetoric to let businesses do whatever they want. They
| exist to serve the public and they should be treated as
| such.
| Gulfick wrote:
| Poor targeting choices.
| whymauri wrote:
| Huh? I would hardly describe this as technical. Someone with a
| high school education can read it and get the gist. It's
| actually somewhat impressive how it toes the line between
| accessibility and 'just detailed enough'.
| bovermyer wrote:
| I'm guessing it has multiple target audiences. Those that won't
| understand some of the technical jargon (e.g., "IP addresses")
| will still be able to follow the general flow of the article.
|
| Those of us who are familiar with the domain of knowledge, on
| the other hand, get a decent summary of events.
|
| It's a balancing act. I think the article does a good enough
| job of explaining things.
| [deleted]
| thirtyseven wrote:
| With an outage this big, even a post for a technical audience
| will get read by non-technical people (including journalists),
| so I'm sure it helps to include details like this.
| submain wrote:
| The media: "Facebook engineer typed command. This is what
| happened next."
| riffic wrote:
| I'm reading your comment as a form of "feigning surprise", in
| other words a statement along the lines of "I can't believe
| target audience doesn't know about x concept".
|
| more on the concept: https://noidea.dog/blog/admitting-
| ignorance
| simooooo wrote:
| Stupid journalists
| shadofx wrote:
| Teenagers who are responsible for managing the family router?
| eli wrote:
| The media. This was a huge international story.
| blowski wrote:
| Both those groups of people. I imagine, they would either be
| accused of it being either too complicated or dumbed down, so
| they do both in the same article.
| 88913527 wrote:
| There are plenty of technical people (or people employed in
| technical roles) who don't understand how DNS works. For
| example, I field questions on why "hostname X only works when
| I'm on VPN" at work.
| byron22 wrote:
| What kind of bgp command would do that?
| TedShiller wrote:
| During the outage, FB briefly made the world a better place
| bilater wrote:
| I want to know what happened to the poor engineer who issued the
| command?
| pjscott wrote:
| The blog post is putting the blame on a bug in the tooling
| which _should_ have made the command impossible to issue, which
| is exactly where the blame ought to go.
| bilater wrote:
| Still I'd hate to be the first 'Why' of a multi-billion
| dollar outage :D
| chovybizzass wrote:
| or the manager who told him to "do it anyway" when he raised
| concern.
| antocv wrote:
| He will be promoted to street dweller while his managers will
| fail up.
| runawaybottle wrote:
| Actually he was promoted to C-Level, CII, Chief Imperial
| Intern 'For Life'.
|
| It's a great accomplishment to be be fair, comes with a
| lifetime weekly stipend and access to whatever Frontend
| books/courses you need to be a great web developer.
|
| Will never touch ops again.
| cnst wrote:
| Note that contrary to popular reports, DNS was NOT to blame for
| this outage -- for once DNS worked exactly as per per the spec,
| design and configuration:
|
| > To ensure reliable operation, our DNS servers disable those BGP
| advertisements if they themselves can not speak to our data
| centers, since this is an indication of an unhealthy network
| connection.
| jimmyvalmer wrote:
| tldr; a maintenance query was issued that inexplicably severed
| FB's data centers from the internet, which unnecessarily caused
| their DNS servers to mark themselves defunct, which made it all
| but impossible for their guys to repair the problem from HQ,
| which compelled them to physically dispatch field units whose
| progress was stymied by recent increased physical security
| measures.
| antocv wrote:
| > caused their DNS servers to mark themselves defunct
|
| This is awkward for me too, why should a DNS server withdraw
| BGP routes? Design fail.
| yuliyp wrote:
| It's a trade-off.
|
| Imagine you have some DNS servers at a POP. They're connected
| to a peering router there which is connected to a bunch of
| ISPs. The POP is connected via a couple independent fiber
| links to the rest of your network. What happens if both of
| those links fail?
|
| Ideally the rest of your service can detect that this POP is
| disconnected, and adjust DNS configuration to point users
| toward POPs which are not disconnected. But you still have
| that DNS server which can't see that config change (since
| it's disconnected from the rest of your network) but still
| reachable from a bunch of local ISPs. That DNS server will
| continue to direct traffic to the POP which can't handle it.
|
| What if that DNS server were to mark itself unavailable? In
| that case, DNS traffic from ISPs near that POP would instead
| find another DNS server from a different POP, and get a
| response which pointed toward some working POP instead. How
| would the DNS server mark itself unavailable? One way is to
| see if it stopped being able to communicate with the source
| of truth.
|
| Yesterday all of the DNS servers stopped being able to
| communicate with the source of truth, so marked themselves
| offline. This code assumes a network partition, so can't
| really rely on consensus to decide what to do.
| colechristensen wrote:
| Designed to handle individual POPs/DCs going down
| miyuru wrote:
| Most of the large DNS services are anycasted via BGP. (All
| POPs announce the same IP prefix) It makes sense to stop the
| BGP routing if the POP is unhealthy. Traffic will flow to the
| next healthy POP.
|
| In this case if the DNS sevice in the POP is unhealthy and IP
| address belonging to the DNS service are removed from the
| POP.
| chronid wrote:
| Note those are anycast addresses, my guess is the DNS server
| gives out addresses for FB names pointing your traffic to the
| POP the DNS server is part of.
|
| If the POP is not able to connect to the rest of Facebook's
| network, the POP stops announcing itself as available and
| that DNS and part of the network goes away so your traffic
| can go somewhere else.
| jbschirtzs wrote:
| "The Devil's Backbone"
| cube00 wrote:
| _> We've done extensive work hardening our systems to prevent
| unauthorized access, and it was interesting to see how that
| hardening slowed us down as we tried to recover from an outage
| caused not by malicious activity, but an error of our own making.
| I believe a tradeoff like this is worth it -- greatly increased
| day-to-day security vs. a slower recovery from a hopefully rare
| event like this._
|
| If you correctly design your security with appropriate fall backs
| you don't need to make this trade off.
|
| If that story of the Facebook campus having no physical key holes
| on doors is true it just speaks to an arrogance of assuming
| things can never fail so we don't even need to bother planning
| for it.
| joshuamorton wrote:
| Can you elaborate on this? There are always going to be
| security/reliability tradeoffs. Things that fail closed for
| security reasons will cause slower incident responses. That's
| unavoidable. Innovation can improve the frontier, but there
| will always be tradeoffs.
| cube00 wrote:
| Slower sure, but not five hour slow.
| secguyperson wrote:
| > a command was issued with the intention to assess the
| availability of global backbone capacity, which unintentionally
| took down all the connections in our backbone network,
| effectively disconnecting Facebook data centers globally
|
| From a security perspective, I'm blown away that a single person
| apparently had the technical permissions to do such a thing. I
| can't think of any valid reason that a single person would have
| the ability to disconnect every single data center globally. The
| fact that such functionality exists seems like a massive foot-
| gun.
|
| At a minimum I would expect multiple layers of approval, or
| perhaps regionalized permissions, so that even if this person did
| run an incorrect command, the system turns around and says "ok
| we'll shut down the US data centers but you're not allowed to
| issue this command for the EU data centers, so those stay up".
| chomp wrote:
| So someone ran "clear mpls lsp" instead of "show mpls lsp"?
| slim wrote:
| My guess : he executed the command from shell history. Commands
| look similar enough and he hit Enter too quickly
| bryan_w wrote:
| Seems like it. It's kinda like typing hostname and accidentally
| poking your yubikey (Not that I've done that...) or the date
| command that both let's you set the date and format the date
| divbzero wrote:
| For context, parent comment is trying to decipher this heavily-
| PR-reviewed paragraph:
|
| > _During one of these routine maintenance jobs, a command was
| issued with the intention to assess the availability of global
| backbone capacity, which unintentionally took down all the
| connections in our backbone network, effectively disconnecting
| Facebook data centers globally. Our systems are designed to
| audit commands like these to prevent mistakes like this, but a
| bug in that audit tool didn't properly stop the command._
| i_like_apis wrote:
| the total loss of DNS broke many of the internal tools we'd
| normally use to investigate and resolve outages like this.
| this took time, because these facilities are designed with high
| levels of physical and system security in mind. They're hard to
| get into, and once you're inside, the hardware and routers are
| designed to be difficult to modify even when you have physical
| access to them.
|
| Sounds like it was the perfect storm.
| codebolt wrote:
| Apparently they had to bring in the angle grinder to get access
| to the server room.
|
| https://twitter.com/cullend/status/1445156376934862848?t=P5u...
| jd3 wrote:
| Was this ever confirmed? NYT tech reporter Mike Isaac issued a
| correction to his previous reporting about it.
|
| > the team dispatched to the Facebook site had issues getting
| in because of physical security but did not need to use a saw/
| grinder.
|
| https://twitter.com/MikeIsaac/status/1445196576956162050
| tpmx wrote:
| > so.....they had to use a jackhammer, got it
| 14 wrote:
| This. We know they lie so have to assume the worst when
| they say something. The correction only says saw/angle
| grinder. "No tools were needed" is a little clearer.
| However I am not sure why it matters if they used a key or
| a grinder?
| packetslave wrote:
| that didn't happen (NY Times corrected their story)
| [deleted]
| mzs wrote:
| or not
|
| https://twitter.com/MikeIsaac/status/1445196576956162050
|
| https://twitter.com/cullend/status/1445212476652535815
| nabakin wrote:
| Yesterday's post from Facebook
|
| https://engineering.fb.com/2021/10/04/networking-traffic/out...
| ghostoftiber wrote:
| Incidentally the facebook app itself really handled this
| gracefully. When the app can't connect to facebook, it displays
| "updates" from a pool of cached content. It looks and feels like
| facebook is there, but we know it's not. I didn't notice this
| until the outage and I thought it was neat.
| divbzero wrote:
| I'm curious if the app handled posts or likes gracefully too.
| Did it accept and cache the updates until it could reconnect to
| Facebook servers?
| [deleted]
| colanderman wrote:
| WhatsApp failed similarly, but I thought it was a poor design
| decision to do so. Anyone waiting on communication through
| WhatsApp had no indication (outside the media) that it was
| unavailable, and that they should find a nother communication
| channel.
|
| Don't paper connectivity failures. It disempowers users
| mumblemumble wrote:
| > Our systems are designed to audit commands like these to
| prevent mistakes like this, but a bug in that audit tool didn't
| properly stop the command.
|
| I'm so glad to see that they framed this in terms of a bug in a
| tool designed to prevent human error, rather than simply blaming
| it on human error.
| nuerow wrote:
| > I'm so glad to see that they framed this in terms of a bug in
| a tool designed to prevent human error, rather than simply
| blaming it on human error.
|
| Wouldn't human error reflect extremely poorly on the company
| though? I mean, for human error to be the root cause of this
| mega-outage, that would imply that the company's infrastructure
| and operational and security practices were so ineffective that
| a single person screwing up could inadvertently bring the whole
| company down.
|
| A freak accident that requires all stars to be aligned to even
| be possible, on the other hand, does not cause a lot of
| concerns.
| cecilpl2 wrote:
| Human error is a cop-out excuse anyway, since it's not
| something you can fix going forward. Humans err, and if a
| mistake was made once it could easily be made again.
| tobr wrote:
| The buggy audit tool was probably made by a human too, though.
| jhrmnn wrote:
| But reviewed by other humans. At some count, a collective
| human error becomes a system error.
| cecilpl2 wrote:
| Of course the system was built by humans, but we are
| discussing the _proximate_ cause of the outage.
| vitus wrote:
| Google had a comparable outage several years ago.
|
| https://status.cloud.google.com/incident/cloud-networking/19...
|
| This event left a lot of scar tissue across all of Technical
| Infrastructure, and the next few months were not a fun time (e.g.
| a mandatory training where leadership read out emails from
| customers telling us how we let them down and lost their trust).
|
| I'd be curious to see what systemic changes happen at FB as a
| result, if any.
| AdamHominem wrote:
| > This event left a lot of scar tissue across all of Technical
| Infrastructure, and the next few months were not a fun time
| (e.g. a mandatory training where leadership read out emails
| from customers telling us how we let them down and lost their
| trust).
|
| Bullshit.
|
| I'd believe this if it was not completely impossible for
| 99.999999% of google "customers" to contact anyone at the
| company. Or for the decade and a half of personal and
| professional observations of people getting fucked over by
| google and having absolutely nobody they could contact to try
| and resolve the situation.
|
| You googlers can't even disdain yourselves to talk to other
| workers at the company who are in a caste lower than you.
|
| The fundamental problem googlers have is that they all think
| they're so smart/good at what they do, it just doesn't seem to
| occur that they could have _possibly_ screwed something up, or
| something could go wrong or break, or someone might need help
| in a way your help page authors didn 't anticipate...and people
| might need to get ahold of an actual human to say "shit's
| broke, yo." Or worse, none of you give a shit. The company
| certainly doesn't. When you've got near monopoly and have your
| fingers in every single aspect of the internet, you don't need
| to care about fucking your customers over.
|
| I cannot legitimately name a single google product that I, or
| anyone I know, _likes_ or _wants to use_. We just don 't have a
| choice because of your market dominance.
| ro_bit wrote:
| > You googlers can't even disdain yourselves to talk to other
| workers at the company who are in a caste lower than you.
|
| We must know different googlers then. It's good to avoid
| painting a group with the same brush
| UncleMeat wrote:
| Hi there. I'm a Googler and I've directly interfaced with a
| nontrivial number of customers such that _I alone_ have
| interfaced with more than 0.000001% of the entire world
| population.
| [deleted]
| lvs wrote:
| All you need to do is browse any online forum, bug tracker,
| subreddit dedicated to a consumer-facing Google product to
| know that Google does not give a rat's ass about customer
| service. We know the customer is ultimately not the
| consumer.
| [deleted]
| colechristensen wrote:
| Maps, Mail, Drive, Scholar, and Search are all the best or
| near the best available. That doesn't mean I like every one
| of them or I wouldn't prefer others, but as far as I can tell
| the competition doesn't exist that works better.
|
| GCP and Pixel phones are a toss-up between them and
| competitors.
|
| It isn't market dominance, nobody has made anything better.
| silicon2401 wrote:
| > leadership read out emails from customers telling us how we
| let them down and lost their trust).
|
| That's amazing. I would never have expected my feedback to a
| company to actually be read, let alone taken seriously.
| Hopefully more companies do this than I thought.
| spaetzleesser wrote:
| From my experience this is more done to make leadership feel
| better and deflect blame from their leadership.
| variant wrote:
| I just read your comment out loud if it helps.
| hoppyhoppy2 wrote:
| Facebook also had a nearly 24-hour outage in 2019.
| https://www.nytimes.com/2019/03/14/technology/facebook-whats...
| (or http://archive.today/O7ycB )
| cratermoon wrote:
| > I'd be curious to see what systemic changes happen at FB as a
| result, if any.
|
| If history is any guide, Facebook will decide some division
| charged with preventing problems was an ineffective waste of
| money, shut it down, and fire a bunch of people.
| vitus wrote:
| To expand on why this made me think of the Google outage:
|
| It was a global backbone isolation, caused by configuration
| changes (as they all are...). It was detected fairly early on,
| but recovery was difficult because internal tools / debugging
| workflows were also impacted, and even after the problem was
| identified, it still took time to back out the change.
|
| "But wait, a global backbone isolation? Google wasn't totally
| down," you might say. That's because Google has two (primary)
| backbones (B2 and B4), and only B4 was isolated, so traffic
| spilled over onto B2 (which has much less capacity), causing
| heavy congestion.
| tantalor wrote:
| But the FB outage was _not_ a configuration change.
|
| > a command was issued with the intention to assess the
| availability of global backbone capacity, which
| unintentionally took down all the connections in our backbone
| network
| narrator wrote:
| This is kind of like Chernobyl where they were testing to
| see how hot they could run the reactor to see how much
| power it could generate. Then things went sideways.
| formerly_proven wrote:
| The Chernobyl test was not a test to drive the reactor to
| the limits, but actually a test to verify that the
| inertia of the main turbines is big enough to drive the
| coolant pumps for X amount of time in the case of grid
| failure.
| trevorishere wrote:
| Of possible interest:
|
| https://www.youtube.com/watch?v=Ijst4g5KFN0
|
| This is a presentation to students by an MIT professor
| that goes over exactly what happened, the sequence of
| events, mistakes made, and so on.
| formerly_proven wrote:
| Warning for others: I watched the above video and then
| watched the entire course (>30 hours).
| q845712 wrote:
| At first i thought it was inappropriate hyperbole to
| compare Facebook to Chernobyl, but then i realized that i
| think Facebook (along with twitter and other "web 2.0"
| graduates) has spread toxic waste across far larger of an
| area than Chernobyl. But I would still say that it's not
| the _outage_ which is comparable to Chernobyl, but the
| steady-state operations.
| [deleted]
| fabian2k wrote:
| As already said the test was about something entirely
| different. And the dangerous part was not the test
| itself, but the way they delayed the test and then
| continued to perform it despite the reactor being in a
| problematic state and the night shift being on duty, who
| were not trained on this test. The main problem was that
| they ran the reactor at reduced power long enough to have
| significant xenon poisoning, and then put the reactor at
| the brink when they tried to actually run the test under
| these unsafe conditions.
| willcipriano wrote:
| I'd say the failure at Chernobyl was that anyone who
| asked questions got sent to a labor camp and the people
| making the decisions really had no clue about the work
| being done. Everything else just stems from that. The
| safest reactor in the world would blow up under the same
| leadership.
| vitus wrote:
| From yesterday's post:
|
| "Our engineering teams have learned that _configuration
| changes_ on the backbone routers that coordinate network
| traffic between our data centers caused issues that
| interrupted this communication.
|
| ...
|
| Our services are now back online and we're actively working
| to fully return them to regular operations. We want to make
| clear that there was no malicious activity behind this
| outage -- its root cause was a faulty _configuration
| change_ on our end. "
|
| Ultimately, that faulty command changed router
| configuration globally.
|
| The Google outage was triggered by a configuration change
| due to an automation system gone rogue. But hey, it too was
| triggered by a human issuing a command at some point.
| quietbritishjim wrote:
| I'm inclined to believe the later post as they've had
| more time to assess the details. I think the point of the
| earlier post is really to say "we weren't hacked!" but
| they didn't want to use exactly that language.
| jeffbee wrote:
| Google also had a runaway automation outage where a process
| went around the world "selling" all the frontend machines
| back to the global resource pool. Nobody was alerted until
| something like 95% of global frontends had disappeared.
|
| This was an important lesson for SREs inside and outside
| Google because it shows the dangers of the antipattern of
| command line flags that narrow the scope of an operation
| instead of expanding it. I.e. if your command was supposed to
| be `drain -cell xx` to locally turn-down a small resource
| pool but `drain` without any arguments drains the whole
| universe, you have developed a tool which is too dangerous to
| exist.
| nathanyz wrote:
| I feel like this explains so much about why the gcloud
| command works the way it does. Sometimes feels overly
| complicated for minor things, but given this logic, I get
| it.
| vitus wrote:
| Agreed, but with an amendment:
|
| If your tool is capable of draining the whole universe,
| _period_ , it is too dangerous to exist.
|
| That was one of the big takeaways: global config changes
| must happen slowly. (Whether we've fully internalized that
| lesson is a different matter.)
| ethbr0 wrote:
| As FB opines at the end, at some point, it's a trade-off
| between power (being able access / do everything quickly)
| and safety (having speed bumps that slow larger
| operations down).
|
| The pure takeaway is probably that it's important to
| design systems where "large" operations are rarely
| required, and frequent ops actions are all "small."
|
| Because otherwise, you're asking for an impossible
| process (quick and protected).
| bbarnett wrote:
| _If your tool is capable of draining the whole universe_
|
| Why did I think of humans, when I read this. :P
| ratsmack wrote:
| >internal tools / debugging workflows were also impacted
|
| That's something that should never happen.
| agwa wrote:
| > a mandatory training where leadership read out emails from
| customers telling us how we let them down and lost their trust
|
| Is that normal at Google? Making people feel bad for an outage
| doesn't seem consistent with the "blameless postmortem" culture
| promoted in the SRE book[1].
|
| [1] https://sre.google/sre-book/postmortem-culture/
| UncleMeat wrote:
| "Blameless Postmortem" does not mean "No Consequences", even
| if people often want to interpret it that way. If an
| organization determines that a disconnect between ground work
| and a customer's experience is a contributing factor to poor
| decision making then they might conclude that making
| engineers more emotionally invested in their customers could
| be a viable path forward.
| bob1029 wrote:
| Relentless customer service is _never_ going to screw you
| over in my experience... It pains me that we have to
| constantly play these games of abstraction between engineer
| and customer. You are presumably working a _job_ which
| involves some business and some customer. It is not a
| fucking daycare. If any of my customers are pissed about
| their experience, I want to be on the phone with them as
| soon as humanly possible and I want to hear it myself. Yes,
| it is a dreadful experience to get bitched at, but it also
| sharpens your focus like you wouldn 't believe when you
| can't just throw a problem to the guy behind you.
|
| By all means, put the support/enhancement requests through
| a separate channel+buffer so everyone can actually get work
| done during the day. But, at no point should an engineer
| ever be allowed to feel like they don't have to answer to
| some customer. If you are terrified a junior dev is going
| to say a naughty phrase to a VIP, then invent an internal
| customer for them to answer to, and diligently proxy the
| end customer's sentiment for the engineer's benefit.
| ethbr0 wrote:
| I think of this is terms of empathy: every engineer
| should be able to provide a quick and accurate answer to
| "What do our customers want? And how do they use our
| product?"
|
| I'm not talking esoterica, but at least a first
| approximation.
| agwa wrote:
| From the SRE book: "For a postmortem to be truly blameless,
| it must focus on identifying the contributing causes of the
| incident without indicting any individual or team for bad
| or inappropriate behavior. A blamelessly written postmortem
| assumes that everyone involved in an incident had good
| intentions and did the right thing with the information
| they had. If a culture of finger pointing and shaming
| individuals or teams for doing the 'wrong' thing prevails,
| people will not bring issues to light for fear of
| punishment."
|
| If it's really the case that engineers are lacking
| information about the impact that outages have on customers
| (which seems rather unlikely), then leadership needs to
| find a way to provide them with that information without
| reading customer emails about how the engineers "let them
| down", which is blameful.
|
| Furthermore, making engineers "emotionally invested"
| doesn't provide concrete guidance on how to make better
| decisions in the future. A blameless portmortem does, but
| you're less likely to get good postmortems if engineers
| fear shaming and punishment, which reading those customer
| emails is a minor form of.
| ddalex wrote:
| Not the original googler responding, but I have never
| experienced what they describe.
|
| Postmortems are always blameless in the sense that "Somebody
| fat fingered it" is not an acceptable explanation for the
| causes of an incident - the possibility to fat finger it in
| the first place must be identified and eliminated.
|
| Opinions are my own, as always
| vitus wrote:
| > Not the original googler responding, but I have never
| experienced what they describe.
|
| I have also never experienced this outside of this single
| instance. It was bizarre, but tried to reinforce the point
| that _something needed to change_ -- it was the latest in a
| string of major customer-facing outages across various
| parts of TI, potentially pointing to cultural issues with
| how we build things.
|
| (And that's not wrong, there are plenty of internal memes
| about the focus on building new systems and rewarding
| complexity, while not emphasizing maintainability.)
|
| Usually mandatory trainings are things like "how to avoid
| being sued" or "how to avoid leaking confidential
| information". Not "you need to follow these rules or else
| all of Cloud burns down; look, we're already hemorrhaging
| customer goodwill."
|
| As I said, there was significant scar tissue associated
| with this event, probably caused in large part by the
| initial reaction by leadership.
| fires10 wrote:
| I don't think Google really cares about listening to their
| users. I have spent more than 6 hours trying to get simple
| warranty issues resolved. I wish they had to feel the pain of
| their actions and decisions.
| lozenge wrote:
| I assume it was training for all SREs, like "this is why
| we're all doing so much to prevent it from reoccurring"
| spaetzleesser wrote:
| " mandatory training where leadership read out emails from
| customers telling us how we let them down and lost their trust
| "
|
| The same leadership that demanded tighter and tighter deadlines
| and discouraged thinking things through?
| grumple wrote:
| The most remarkable thing about this is learning that anyone at
| Google read an email from a customer. Given the automated
| responses to complaints of account shutdowns, or complaints
| about app store rejections, etc, this is pretty surprising.
| lvs wrote:
| I'd love to get a read receipt each time someone at Google
| has actually read my feedback. Then it might be possible to
| determine whether I'm just shaking my fists at the heavens or
| not.
| [deleted]
| Ansil849 wrote:
| > We've done extensive work hardening our systems to prevent
| unauthorized access, and it was interesting to see how that
| hardening slowed us down as we tried to recover from an outage
| caused not by malicious activity, but an error of our own making.
| I believe a tradeoff like this is worth it -- greatly increased
| day-to-day security vs. a slower recovery from a hopefully rare
| event like this. From here on out, our job is to strengthen our
| testing, drills, and overall resilience to make sure events like
| this happen as rarely as possible.
|
| I found this to be an extremely deceptive conclusion. This makes
| it sound like the issue was that Facebook's physical security is
| just too gosh darn good. But the issue was not Facebook's data
| center physical security protocols. The issue was glossed over in
| the middle of the blogpost:
|
| > Our systems are designed to audit commands like these to
| prevent mistakes like this, but a bug in that audit tool didn't
| properly stop the command.
|
| The issue was faulty audit code. It is disingenuous to then
| attempt to spin this like the downtime was due to Facebook's
| amazing physec protocols.
| cratermoon wrote:
| Simple Testing Can Prevent Most Critical Failures[1], "We found
| the majority of catastrophic failures could easily have been
| prevented by performing simple testing on error handling code -
| the last line of defense - even without an understanding of the
| software design."
|
| 1
| https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...
| net4all wrote:
| That article should be required reading for all of us.
| jeffbee wrote:
| Having a separate testing instance of the internet might not
| be practical. How exactly would you test such a change?
| Simulating the effect of router commands is a very daunting
| challenge.
| temp_praneshp wrote:
| No one except you is trying to spin anything.
| Ansil849 wrote:
| Going on at lengths about how the trade off between prolonged
| downtime and strict security protocols is a worthy trade off
| is erecting a nonsensical strawman, the literal definition of
| spinning a story. The key issue had nothing to do with
| Facebook's data center security protocols.
| temp_praneshp wrote:
| No one except you is erecting a nonsensical strawman.
| elzbardico wrote:
| The FB PR army is at full force today
| Ansil849 wrote:
| Except, I quoted exactly where they were.
| detaro wrote:
| And as they say, it lead to _slower recovery_ from the event,
| which was caused, as they also clearly say, by something else.
| Given that "why is it taking so long to revert a config
| change?!!!" was a common comment, relevant to the discussion.
|
| It's disingenuous to point to a paragraph in the article and
| complain that it doesn't mention the root cause when they
| already said _before that, in the same article_ "This was the
| source of yesterday's outage" about something else.
| educationcto wrote:
| The original error was the network command, but the slower
| response and lengthy outage was partially due to the physical
| security they put in place to prevent malicious activity. Any
| event like this has multiple root causes.
| Ansil849 wrote:
| Yes, but the fact that the blogpost concludes on this
| relatively tangential note (which notably also conveniently
| allows Facebook to brag about their security measures) and
| not on the note that their audit code was apparently itself
| not sufficiently audited, is what makes this deceptive spin.
| closeparen wrote:
| Our postmortems have three sections. Prevention, detection,
| and mitigation. They all matter.
|
| Shit happens. People ship bugs. People fat-finger commands.
| An engineering team's responsibility doesn't stop there. It
| also needs to quickly activate responders who know what to
| do and have the tools & access to fix it. Sometimes the
| conditions that created the issue are within acceptable
| bounds; the real need for reform is in why it took so long
| to fix.
| cranekam wrote:
| I agree that there's an awkward emphasis on how FB
| prioritizes security and privacy but nothing is deceptive
| here. Had the audit bug not subsequently cut off access to
| internal tools and remote regions it would be easy to
| revert. Had there not been a global outage nobody would
| have known that the process for getting access in an
| emergency was too slow.
|
| Huge events like this _always_ have many factors that have
| to line up just right. To insist that the one and only true
| cause was a bug in the auditing system is reductive.
| Ansil849 wrote:
| > I agree that there's an awkward emphasis on how FB
| prioritizes security and privacy but nothing is deceptive
| here.
|
| I guess deceptive was the wrong word, so whatever's the
| term for "awkward emphasis" :).
| adrianmonk wrote:
| No, they just wanted to cover both "what caused it?" and
| "why did it take too long to fix it?" since both are topics
| people were obviously extremely interested in.
|
| It would have been surprising and disappointing if they
| didn't cover both of them.
| tedunangst wrote:
| Seems like appropriate emphasis given how many people
| yesterday were asking why aren't they back online yet. For
| every person asking why they deleted their routes there
| were two people asking why they didn't put them back.
| Jamie9912 wrote:
| What would they have done if the whole data center was destroyed?
| detaro wrote:
| continue working form all other data centers, possibly without
| users really noticing.
| xphos wrote:
| I don't know the -f in rm -rf isn't a bug xD. I feel sorry for
| the poor engineer who fat fingered the command. It definitely
| highlights an anti-pattern in the command line but the fact that,
| that singular console had the power to effect the entire network
| highlights an "interesting" design choice indeed.
| fakeythrow8way wrote:
| For a lot of people in countries outside the US, Facebook _is_
| the internet. Facebook has cut deals with various ISPs outside
| the US to allow people to use their services without it costing
| any data. Facebook going down is a mild annoyance for us but a
| huge detriment to, say, Latin America.
| tigerlily wrote:
| We want ramenporn
| cesarb wrote:
| Context for those who didn't see it yesterday:
| https://news.ycombinator.com/item?id=28749244
| jasonjei wrote:
| Will somebody lose a job over this?
| mman0114 wrote:
| They shouldn't. If FB has a proper PMA culture in place, you
| figure out why the processes in place didn't work that allowed
| this kind of change to happen, so more testing, etc. Should be
| a blameless exercise.
| _moof wrote:
| Not if FB has a halfway decent engineering culture. People make
| mistakes. They're practically fundamental to being a person.
| You can _minimize_ mistakes, but any system that requires
| perfect human performance will fail.
___________________________________________________________________
(page generated 2021-10-05 23:01 UTC)