[HN Gopher] Cloudflare outage on June 21, 2022
___________________________________________________________________
Cloudflare outage on June 21, 2022
Author : jgrahamc
Score : 580 points
Date : 2022-06-21 12:39 UTC (10 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| grenbys wrote:
| Would be great if the timeline covered 19 minutes of 6:32 -
| 06:51. How long did it take to get the right people on the call?
| How long did it take to identify deployment as a suspect?
|
| Another massive gap is the rollback: 6:58 - 7:42 - 44 minutes!
| What exactly was going on and why did it take so long? What were
| those back-up procedures mentioned briefly? Why engineers where
| stepping on each other toes? What's the story with reverting
| reverts?
|
| Adding more automation, tests and fixing that specific ordering
| issue of course is an improvement. But that adds more complexity
| and any automation ultimately will fail some day.
|
| Technical details are all appreciated. But it is going to be
| something else next time. Would be great to learn more about
| human interactions. That's where the resilience of a socio-
| technical system happened and I bet there is some room for
| improvement there.
| systemvoltage wrote:
| It would be fun to be a fly on the wall when shit hits the fan
| in general. From Nuclear meltdowns to 9/11 ATC recordings, it
| is fascinating to see how emergencies play out and what kind of
| things go on with boots-on-ground, all-hands-on-deck
| situations.
|
| Like, does Cloudflare have an emergency procedure for
| escalation? What does that look like? How does the CTO get
| woken up in the middle of the night? How to get in touch with
| critical and most important engineers? Who noticed Cloudflare
| down first? How do quick decisions get made and decided? Do
| people get on a giant zoom call? Or emails going around? What
| if they can't get hold of the most important people that can
| flip switches? Do they have a control room like the movies? CTO
| looking over the shoulder calling "Affirmative, apply the fix."
| followed by a progress bar painfully moving towards completion.
| nijave wrote:
| Sounds like they had engineers connecting to the devices and
| manually rolling back changes. Something like...
|
| Slack: "@here need to connect to <long list of devices> to
| rollback change asap"
| edf13 wrote:
| It's nearly always BGP when this level of failure occurs.
| jgrahamc wrote:
| I dunno man, you can really fuck things up with DNS also.
| ngz00 wrote:
| I was on a severely understaffed edge team fronting several
| thousand engineers at a fortune 500 - every deploy felt like
| a spacex launch from my cubicle. I have a lot of reverence
| for the engineers who take on that kind of responsibility.
| star-glider wrote:
| Generally speaking:
|
| You broke half the internet: BGP You broke half of your
| company's ability to access the internet: DNS
| sidcool wrote:
| This is a very nice write up.
| testplzignore wrote:
| Are there any steps that can be taken to test these types of
| changes in a non-production environment?
| vnkr wrote:
| It's very difficult if not impossible to create a staging
| environment that would well enough replicate production at this
| scale. What bog posts suggest as a remediation in the process:
| "There are several opportunities in our automation suite that
| would mitigate some or all of the impact seen from this event.
| Primarily, we will be concentrating on automation improvements
| that enforce an improved stagger policy for rollouts of network
| configuration and provide an automated "commit-confirm"
| rollback. The former enhancement would have significantly
| lessened the overall impact, and the latter would have greatly
| reduced the Time-to-Resolve during the incident."
| malikNF wrote:
| off-topic-ish, this post on /r/ProgrammerHumor gave me a chuckle
|
| https://www.reddit.com/r/ProgrammerHumor/comments/vh9peo/jus...
| jgrahamc wrote:
| That made me smile.
| Cloudef wrote:
| thejosh wrote:
| 07:42: The last of the reverts has been completed. This was
| delayed as network engineers walked over each other's changes,
| reverting the previous reverts, causing the problem to re-appear
| sporadically.
|
| Ouch
| jgrahamc wrote:
| Well, the "we can't reach these data centers at all and need to
| go through the break glass procedure" was pretty "ouch" also.
| gouggoug wrote:
| I'd be super interested in understanding what this means
| concretely. For example, are we talking about reverting
| commits? If so, why were engineers reverting reverts?
| yuliyp wrote:
| Developer 1 fetches code, changes flag A. Rebuilds config.
| Developer 2 fetches code, changes flag B. Rebuilds config.
| Developer 1 deploys built config. Developer 2 deploys built
| config, inadvertently reverts developer 1's changes.
| wstuartcl wrote:
| also can happen when your deploy process has two flows for
| revert a forward movement revert (where new bits and head
| are committed fixing the items that needed to be reverted)
| and a "previous head" revert which just goes back one
| revision in the rcs (or tagged version).
|
| Imagine the first eng team did a forward movement revert
| that corrected the issue and had a new head bits that gets
| deployed, where shortly after another eng fires off the
| second process type and tells the system to pull back to
| the last revision (which is now the bad revision as it was
| just replaced with fresher deploy bits).
|
| Having two revert processes in the toolkit and maybe a few
| disperse teams working to revert the issue without tight
| communication leads to this issue.
|
| I think this is more likely the basis issue vs a bad merge
| (I assume that the root cause was broadcasted wide and
| large to anyone making a merge)
| dangrossman wrote:
| I think I experienced first-hand the moment those network
| engineers were reverting their own reverts, breaking the web
| again. For example, DoorDash.com had come back online, then
| went back to serving only HTTP 500 errors from Cloudflare, then
| came back online again. I raised it in the HN discussion and
| @jgrahamc responded minutes later.
|
| https://news.ycombinator.com/item?id=31821290
| michaelmior wrote:
| This was something I was surprised not to see directly
| addressed in terms of follow up steps. When discussing process
| changes, they mention additional testing, but nothing to
| address what seems to be a significant communication gap.
| scottlamb wrote:
| I'm sure they have a more detailed internal postmortem, and I
| imagine it'd go into that. This is a nice high-level
| overview. They probably don't want to bury that under details
| of their communication processes, much less go into exactly
| who did what when for wide consumption by an audience that
| may not be on board with blameless postmortem culture.
| dpz wrote:
| really appreciate the speed, detail and transparency of this
| post-mortem. Really one of, if not the best in the industry
| ElectronShak wrote:
| What's it like to be an engineer designing and working on these
| systems? Must be sooo fulfiling! #Goals; Y'all are my heores!!
| jgrahamc wrote:
| https://www.cloudflare.com/careers/
| ElectronShak wrote:
| Thanks, unfortunately I live in Africa, no roles yet for my
| location. I'll wait as I use the products :)
| Icathian wrote:
| I'm currently waiting on a recruiter to get my panel
| interviews scheduled. You guys are in "dream gig" territory
| for me. Any tips? ;-)
| CodeWriter23 wrote:
| Gotta hand it to them, a shining example of transparency and
| taking responsibility for mistakes.
| minecraftchest1 wrote:
| Something else that I think would be smart to implement is a
| reorder detection. Have the change approval specificy point out
| stuff that gets reordered, and require manual approval for each
| section that gets moved around.
|
| I also think that having a script that walks through the file and
| points out any ovibious mistakes would be good to have as well.
| dane-pgp wrote:
| Yeah, there's got to be some sweet spot between "formally
| verify all the things" and "i guess this diff looks okay,
| yolo!".
|
| I'd say that if you're designing a system which has the
| potential to disconnect half your customers based on a
| misconfiguration, then you should spend at least an hour
| thinking about what sorts of misconfigurations are possible,
| and how you could prevent or mitigate them.
|
| The cost-benefit analysis of "how likely is it such a mistake
| would get to production (and what would that cost us)?" vs "how
| much effort would it take to write and maintain a verifier that
| prevents this mistake?" should then be fairly easy to estimate
| with sufficient accuracy.
| mproud wrote:
| If I use Cloudflare, what can I do -- if anything -- to avoid
| disruption when they go down?
| meibo wrote:
| On the enterprise plans, you are able to set up your own DNS
| server that can route users away from Cloudflare, either to
| your origin or to another CDN/proxy.
| junon wrote:
| Now _this_ is a post mortem.
| weird-eye-issue wrote:
| One of our sites uses Cloudflare and serves 400k pageviews per
| month and generates around $650/day in ad and affiliate revenue.
| If the site is not up the business is not making any money.
|
| Looking at the hourly chart in Google Analytics (compared to the
| previous day) there isn't even a blip during this outage.
|
| So for all the advantages we get from Cloudflare (caching, WAF,
| security [our WP admin is secured with Cloudflare Teams],
| redirects, page rules, etc) I'll take these minor outages that
| make HN go apeshit.
|
| Of course it helped that most our traffic is from the US and this
| happened when it did but in the past week alone we served over
| 180 countries which Cloudflare helps make sure is nice and fast
| :D
| algo_trader wrote:
| Could you possibly, kindly, mention which tools you use to
| track/buy/calculate conversions/revenue?
|
| Many thanks
|
| (Or DM the puppet email in my profile)
| harrydehal wrote:
| Not OP, but my team really, really enjoys using a combination
| of Segment.io for event tracking and piping that data into
| Amplitude for data viz, funnel metrics, A/B tests,
| conversion, etc.
| herpderperator wrote:
| I didn't quite understand this. It sounds like Cloudflare's
| outage didn't affect you depite being their customer. Why did
| their large outage not affect you?
| jgrahamc wrote:
| It wasn't a global outage.
| prophesi wrote:
| I thought it was global? 19 data centers were taken offline
| which "handle a significant proportion of [Cloudflare's]
| global traffic".
| madeofpalk wrote:
| I did not notice Cloudflare going down. Only reason I
| knew was because of this thread. Either it was because I
| was asleep, or my local PoP wasn't affected.
| jgrahamc wrote:
| I am in Lisbon and was not having trouble because
| Cloudflare's Lisbon data center was not affected. But
| over in Madrid there was trouble. It depended where you
| are.
| mkl wrote:
| From the article: "Depending on your location in the
| world you may have been unable to access websites and
| services that rely on Cloudflare. In other locations,
| Cloudflare continued to operate normally."
| ihaveajob wrote:
| But if your clients are mostly asleep while this is
| happening, they might not notice.
| uhhhhuhyes wrote:
| Because of the time at which the outage occurred, most of
| this person's customers were not trying to access the site.
| markdown wrote:
| Would you mind sharing which site that is?
| keyle wrote:
| How did no one at cloudflare think that this MCP thing should be
| part of the staging rollout? I imagine that was part of a //
| TODO.
|
| It sounds like it's a key architectural part of the system that
| "[...] convert all of our busiest locations to a more flexible
| and resilient architecture."
|
| 25 year experience and it's always the things that are supposed
| to make us "more flexible" and "more resilient" or
| robust/stable/safer <keyword> that ends up royally f'ing us where
| the light don't shine.
| kache_ wrote:
| shit dawg i just woke up
| rocky_raccoon wrote:
| Time and time again, this type of response proves that it's the
| right way handle a bad situation. Be humble, apologize, own your
| mistake, and give a transparent snapshot into what went wrong and
| how you're going to learn from the mistake.
|
| Or you could go the opposite direction and risk turning something
| like this into a PR death spiral.
| can16358p wrote:
| Exactly. I trust businesses/people that are transparent about
| their mistakes/failures much more than the ones that avoid them
| (except Apple which never accepts their mistakes, but I still
| trust their products, I think I'm affected by RDF).
|
| At the end of the day, everybody makes mistakes and that's
| okay. Everybody else also know that everybody makes mistakes.
| So why not accept it?
|
| I really don't get what's wrong with accepting mistakes,
| learning from them, and moving on.
| coob wrote:
| The exception that proves the rule with Apple:
|
| https://appleinsider.com/articles/12/09/28/apple-ceo-tim-
| coo...
| can16358p wrote:
| Yeah. Forgot that one. When it first came out it was
| terrible.
|
| Apparently so terrible that Apple apologized, perhaps for
| the first (and last) time for something.
| kylehotchkiss wrote:
| They didn't apologize about the direction the pro macs
| were going a few years back but they certainly listened
| and made amends for it with the recent Pro line and
| MacBook Pro enhancements
| dylan604 wrote:
| "Is it Apple Maps bad?" --Gavin Belson, Silicon Valley
|
| This one line will forever cement exactly how bad Apple
| Maps' release was. Thanks Mike Judge!
| stnmtn wrote:
| I agree, but lately (as in the past month) I've been
| finding myself using apple maps more and more than
| google. When on a complicated highway interchange, the 3d
| view that Apple Maps gives for which exit to take is a
| life-saver
| can16358p wrote:
| Yup. Just remember the episode. IIRC in that context
| Apple Maps was placed even worse than Windows Vista.
| dylan604 wrote:
| I would agree with that. Apple Maps was worse than the
| hockey puck mouse or the trashcan macpro. trying to
| decide if it is worse than the butterfly keyboard, but I
| think the keyboard wins for the shear fact that it
| impacted me in a way that was uncorrectable where I could
| just use a different Maps app
| rocky_raccoon wrote:
| > I really don't get what's wrong with accepting mistakes,
| learning from them, and moving on.
|
| Some people really struggle with this (myself included) but I
| think it's one of the easiest "power ups" you can use in
| business and in life. The key is that you have to actually
| follow through on the "learning from them" clause.
| dylan604 wrote:
| Sure, this can be a good thing when it's a rare occurrence.
| If it is a weekly event, then you just start to look
| incompetent
| gcau wrote:
| Am I the only who really doesn't think this is a big deal? They
| had an outage, they fixed it very quickly. Life goes on. Talking
| about the outage as if it's reason for us to all ditch CF, then
| buy/ run our own hardware (which will be totally better), so
| hyperbolic.
| Deukhoofd wrote:
| It was a bit of a thing as people in Europe started their
| office work, and found out a lot of their internet services
| were down, and they were unable to access the things they
| needed. It's rather dangerous that we all depend on this one
| service being online.
| nielsole wrote:
| > Talking about the outage as if it's reason for us to all
| ditch CF
|
| at time of writing no comment has done that except you.
| gcau wrote:
| I'm referring to other posts and discussions outside this
| website. I don't expect as much criticism in this post.
| simiones wrote:
| It is kind of a big deal to discover just how much of the
| Internet and the WWW is now dependent on CloudFlare.
|
| For their part, they handled this very well, and are to be
| commended (quick fix, quick explanation of failure).
|
| But you also can't help but see that they have a dangerous
| amount of control over such important systems.
| leetrout wrote:
| BGP changes should be like the display resolution changes on your
| PC...
|
| It should revert as a failsafe if not confirmed within X minutes.
| addingnumbers wrote:
| That's the "commit-confirm" process they mention they will use
| in the write-up:
|
| > Primarily, we will be concentrating on automation
| improvements ... and provide an automated "commit-confirm"
| rollback.
| Melatonic wrote:
| Surprised everyone has not switched to this already - great
| idea
| neuronexmachina wrote:
| I assume there's some non-trivial caveats when using this
| with a widely-distributed system.
| vnkr wrote:
| That's what is suggested in the blogpost as one of future
| prevention plans.
| antod wrote:
| There was a common pattern in use back in the day when I
| managed openbsd filewalls (can't remember if it was ipf or pf
| days). When changing firewall rules over ssh, you'd use a
| command line like:
|
| $ apply new rules; sleep 10; apply original rules
|
| If your ssh access was still working and various sites were
| still up during that 10sec you were probably good to go - or at
| least you hadn't shut yourself out.
| DRW_ wrote:
| Back when I was a briefly a network engineer at the start of my
| career, on cisco equipment we'd do 'reload in 5' before big
| changes - so it'd auto restart after 5 minutes unless
| cancelled.
|
| I'm sure there were and are better ways of doing it, but it was
| simple enough and worked for us.
| kazen44 wrote:
| most ISP tier routers have an entire commit engine to load
| and apply configs.
|
| junipers allows for instance, one to do the command commit
| confirmed, which will apply the configuration, and revert
| back to the previous version if one does not acknowledge this
| command within a predifined time. this prevents permanent
| lockout out of a system.
| ransom1538 wrote:
| Still seeing failed network calls.
|
| https://i.imgur.com/xHqvOzj.png
| jgrahamc wrote:
| Feel free to email me (jgc) details but based on that error I
| don't think that's us.
| ransom1538 wrote:
| One more? Ill email too. https://i.imgur.com/Cxwv58g.png
| zinekeller wrote:
| Yeah, that's not Cloudflare at all (it's unlikely that CF still
| uses nginx/1.14).
| Jamie9912 wrote:
| Is that actually coming from Cloudflare? iirc Cloudflare
| reports it self as Cloudflare not nginx in the 5xx error pages
| lenn0x wrote:
| correct, i saw that too. the outage returned 500/nginx. no
| version number either on footer. @jgrahamc thought that was
| strange too as few commenters last night were caught off
| guard trying to determine if it was their systems or
| cloudflare. supposedly its been forwarded along.
| kc10 wrote:
| yes, there is definitely an nginx service in the path. We
| don't have any nginx in our infrastructure, but this was
| the response we had for our urls during the outage.
|
| <html> <head><title>500 Internal Server
| Error</title></head> <body bgcolor="white"> <center><h1>500
| Internal Server Error</h1></center>
| <hr><center>nginx</center> </body> </html>
| samwillis wrote:
| The outage this morning manifested itself as a Nginx error
| page, somewhat unusually for CF.
| Belphemur wrote:
| It's interesting that in 2022 we still have network issues caused
| by wrong order of rules.
|
| Everybody at one time experiences the dreaded REJECT not being at
| the end of the rule stack but just too early.
|
| Kudos to CF for such a good explanation of what caused the issue.
| ec109685 wrote:
| I wonder what tool the engineers used to view that diff. With a
| side by side one, it's a bit more obvious when lines are
| reordered.
|
| Even better if the tool was syntax aware so it could highlight
| the different types of rules in unique colors.
| xiwenc wrote:
| I'm surprised they did not conclude roll outs should be executed
| over longer period with smaller batches. When a system is
| complicated as theirs with so much impact, the only sane strategy
| is slow rolling updates so that you can hit the brake when
| needed.
| jgrahamc wrote:
| That's literally one of the conclusions.
| ttul wrote:
| Every outage represents an opportunity to demonstrate resilience
| and ingenuity. Outages are guaranteed to happen. Might as well
| make the most of it to reveal something cool about their
| infrastructure.
| asadlionpk wrote:
| Been a fan of CF since they were an essential for DDOS protection
| for various Wordpress sites I deployed back then.
|
| I buy more NET every time I see posts like this.
| malfist wrote:
| Hackernews isn't wallstreetbets.
| nerdbaggy wrote:
| Really interesting that 19 cities handle 50% of the requests.
| JCharante wrote:
| Well half of those cities were in Asia during business hours,
| so given that the majority of humans live in Asia it makes
| sense. CF data centers in Asia also seem to be less distributed
| than in the West (e.g. Vietnam traffic seems to go to
| Singapore) meanwhile CF has multiple centers distributed
| throughout the US.
| jgrahamc wrote:
| Actually, I think the flip side is even more interesting. If
| you want to give good, low latency service to 50% of the world
| you need a lot of data centers.
| worldofmatthew wrote:
| If you have an efficient website, you can get decent
| performance to most of the world with one pop on the West
| cost of the USA.
| rubatuga wrote:
| Uh, shouldn't there be a staging environment for these sort of
| changes?
| alanning wrote:
| Yes, that was one of the issues they mentioned in the post. Not
| that they didn't have a staging/testing environment but that it
| didn't include the specific type of new architecture
| configuration, "MCP", that ultimately failed.
|
| One of their future changes is to include MCPs in their testing
| environments.
| nijave wrote:
| Ahh the old "dev doesn't quite match prod" issue
| ggalihpp wrote:
| The dns resolver also impacted and seems still have issue. We
| change to google dns and it solved.
|
| The problem is, we couldn't tell all our client they should
| change this :(
| jiggawatts wrote:
| The default way that most networking devices are managed is crazy
| in this day and age.
|
| Like the post-mortem says, they will put mitigations in place,
| but this is something every network admin has to implement
| bespoke after learning the hard way that the default management
| approach is dangerous.
|
| I've personally watched admins make routing changes where any
| error would cut them off from the device they are managing and
| prevent them from rolling it back -- pretty much what happened
| here.
|
| What should be the _default_ on every networking device is a two-
| stage commit where the second stage requires a new TCP
| connection.
|
| Many devices still rely on "not saving" the configuration, with a
| power cycle as the rollback to the previous saved state. This is
| a great way to turn a small outage into a big one.
|
| This style of device management may have been okay for small
| office routers where you can just walk into the "server closet"
| to flip the switch. It was okay in the era when device firmware
| was measured in kilobytes and boot times in single digit seconds.
|
| Globally distributed backbone routers are an entirely different
| scenario but the manufacturers use the same outdated management
| concepts!
|
| (I have seen some _small_ improvements in this space, such as
| devices now keeping a history of config files by default instead
| of a single current-state file only.)
| inferiorhuman wrote:
| The power cycle as a rollback is IMO reasonable. If you're
| talking about equipment in a data center you should presumably
| have some sort of remote power management on a separate
| network.
|
| Alternatively some sort of watchdog timer would be a great
| addition (e.g. rollback within X minutes if the changes are not
| confirmed).
| throwaway_uke wrote:
| i'm gonna go with the less popular view here that overly detailed
| post mortems do little in the grand scheme of things other than
| satisfy tech p0rn for a tiny, highly technical audience. does
| wonders for hiring indeed.
|
| sure, transparency is better than "something went wrong, we take
| this very seriously, sorry." (although the non technical crowd
| couldn't care less)
|
| only people who dont do anything make no mistakes, but doing such
| highly impactful changes so quickly (inside one day!) for where
| 50% of traffic happens seems a huge red flag to me, no matter the
| procedure and safety valves.
| kylegalbraith wrote:
| As others have said, this is a clear and concise write up of the
| incident. That is underlined even more when you take into account
| how quickly they published this. I have seen some companies take
| weeks or even months to publish an analysis that is half as good
| as this.
|
| Not trying to take the light away from the outage, the outage was
| bad. But the relative quickness to recovery is pretty impressive,
| in my opinion. Sounds like they could have recovered even quicker
| if not for a bit of toe stepping that happened.
| april_22 wrote:
| I think it's even better that they explained the backgorund of
| the outage in a really easy to understand way, so that not only
| experts can get a hang of what was happening.
| sharps_xp wrote:
| who will make the abstraction as a service we all need to protect
| us from config changes
| pigtailgirl wrote:
| -- how much you willing to pay for said system? --
| sharps_xp wrote:
| depends on how guaranteed is your solution?
| richardwhiuk wrote:
| 100%. You can never roll out any changes.
| ruined wrote:
| happy solstice everyone
| psim1 wrote:
| CF is the only company I have ever seen that can have an outage
| and get pages of praise for it. I don't have any (current) use
| for CloudFlare's products but I would love to see the culture
| that makes them praiseworthy spread to other companies.
| homero wrote:
| I'm also a huge fan
| capableweb wrote:
| I think a lot of companies don't realize the whole
| "Acknowledging our problems in public" thing CF got going for
| it is a positive. Lots of companies don't want to publish
| public post-mortems as they think it'll make them look weak
| rather than showing that they care about transparency in the
| face of failures/downtimes.
| [deleted]
| systemvoltage wrote:
| Nerds in the executive office (CEO & CTO, etc). People just
| like us.
| badrabbit wrote:
| Having been on the other side of similar outages, I am very
| impressee at their response timeline.
| lilyball wrote:
| They said they ran a dry-run. What did that do, just generate
| these diffs? I would have expected them to have some way of
| _simulating_ the network for BGP changes in order to verify that
| they didn 't just fuck up their traffic.
| kurtextrem wrote:
| Yet another BGP caused outage. At some point we should collect
| all of them:
|
| - Cloudflare 2022 (this one)
|
| - Facebook 2021: https://news.ycombinator.com/item?id=28752131 -
| this one probably had the single biggest impact, since engineers
| got locked out of their systems, which made the fixing part look
| like a sci-fi movie
|
| - (Indirectly caused by BGP: Cloudflare 2020:
| https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...)
|
| - Google Cloud 2020:
| https://www.theregister.com/2020/12/16/google_europe_outage/
|
| - IBM Cloud 2020:
| https://www.bleepingcomputer.com/news/technology/ibm-cloud-g...
|
| - Cloudflare 2019: https://news.ycombinator.com/item?id=20262214
|
| - Amazon 2018:
| https://www.techtarget.com/searchsecurity/news/252439945/BGP...
|
| - AWS: https://www.thousandeyes.com/blog/route-leak-causes-
| amazon-a... (2015)
|
| - Youtube: https://www.infoworld.com/article/2648947/youtube-
| outage-und... (2008)
|
| And then there are incidents caused by hijacking:
| https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us...
| eddieroger wrote:
| These are the public facing BGP announcements that cause
| problems, but doesn't account for the ones on private LANs that
| also happen. Previous employers of mine have had significant
| internal network issues because internal BGP between sites
| started causing problems. I'm not sure there's anything better
| (I am not a network guy), but this list can't be exhaustive.
| simiones wrote:
| Google in Japan 2017:
| https://www.internetsociety.org/blog/2017/08/google-leaked-p...
| jve wrote:
| Came here to say exactly this... things that mess with BGP have
| the power to wipe you off the internet.
|
| Some more:
|
| - Google 2016, configuration management bug/BGP:
| https://status.cloud.google.com/incident/compute/16007
|
| - Valve 2015: https://www.thousandeyes.com/blog/steam-outage-
| monitor-data-...
|
| - Cloudflare 2013: https://blog.cloudflare.com/todays-outage-
| post-mortem-82515/
| ssms27 wrote:
| The internet runs on BGP, I would think that most internet
| issues would be a result of BGP then.
| perlgeek wrote:
| There are lots of other causes of incidents, like cut cables,
| failed router hardware, data centers losing power etc.
|
| It just seems that most of these are local enough and the
| Internet resilient enough that they don't cause global
| issues. Maybe the exception would be AWS us-east-1 outages
| :-)
| tedunangst wrote:
| BGP is the reason you _don 't_ hear about cable cuts taking
| down the internet.
| addingnumbers wrote:
| Maybe a testament to BGP's effectiveness that so many
| large-scale outages are due to misconfiguring BGP rather
| than the frequent cable cuts and hardware failures that BGP
| routes around.
| mlyle wrote:
| > since engineers got locked out of their systems
|
| Sounds like the same happened here:
|
| "Due to this withdrawal, Cloudflare engineers experienced added
| difficulty in reaching the affected locations to revert the
| problematic change. We have backup procedures for handling such
| an event and used them to take control of the affected
| locations."
|
| But Cloudflare had sufficient backup connectivity to fix it.
| I'm curious how Cloudflare does that today-- the solution long
| ago was always a modem on an auxiliary port.
| jve wrote:
| > the solution long ago was always a modem on an auxiliary
| port
|
| Now you can use mobile Internet (4G/5G)
| ccakes wrote:
| Cell coverage inside datacenters isn't always suitable,
| occasionally even by-design.
| cwt137 wrote:
| They have their machines also connected to another AS, so
| when their network doesn't/can't route, they can still get to
| their machines to fix stuff.
| Melatonic wrote:
| Worst case if I was designing this I would probably have a
| satellite connection running over Iridium at each of their
| biggest DC's
|
| Also lets face it - the utility of a trusted security
| guard/staff with an old fashioned physical key is pretty hard
| to screw up!
| nijave wrote:
| Not sure how common it is, but you can get serial OOBM
| devices accessible over cellular which would then give you
| access to your equipment.
|
| I'm surprised more places don't implement a "click here to
| confirm changes or it'll be rolled back in 5 minutes" like
| all those monitor settings dialogues
| merlyn wrote:
| Thats like blaming the hammer for breaking.
|
| BGP is just a tool, it would be something else to do the same
| purpose.
| forrestthewoods wrote:
| Some tools are more fragile and error prone than others.
| techsupporter wrote:
| Except that this wasn't an example of BGP being prone to
| error or fragile. This was, as the blog post specifically
| calls out, human error. They put two BGP announcement rules
| after the "deny everything not previously allowed" rule.
| It's the same as if someone did this to a set of ACLs on a
| firewall.
|
| The main difference between BGP and all other tools is that
| if you mess up BGP, you've done a very visible thing
| because BGP underpins how we get to each other's networks.
| But it's not a sign of BGP being fragile, just very
| important.
| witcher_rat wrote:
| You say that like it hasn't been going on since the mid 1990's,
| when it got deployed.
|
| I'm not blaming BGP, since it prevents far more outages than it
| causes, but BGP-based outages have been a thing since its
| beginning. And any other protocol would have outages too - BGP
| just happens to be the protocol being used.
| Tsiklon wrote:
| This is a great concise explanation. Thank you for providing it
| so quickly
|
| If you forgive my prying, was this an implementation issue with
| the maintenance plan (operator or tooling error), a fundamental
| issue with the soundness of the plan as it stood, or an
| unexpected outcome from how the validated and prepared changes
| interacted with the system?
|
| I imagine that an outage of this scope wasn't foreseen in the
| development of the maintenance & rollback plan of the work.
| xtat wrote:
| Feels a little disingenuous to use the first 3/4 of the report to
| advertise.
| DustinBrett wrote:
| I wish computers could stop us from making these kinds of
| mistakes without turning into Skynet.
| terom wrote:
| TODO: use commit-confirm for automated rollbacks
|
| Sounds like a good idea!
| buggeryorkshire wrote:
| Is that the equivalent of Cisco 'save running config' with a
| timer? It's been many years so can't remember the exact
| incantations...
| trollied wrote:
| Sounds like Cloudflare need a small low-traffic MCP that they can
| deploy to first.
| samwillis wrote:
| In a world where it can take weeks for other companies to publish
| a postmortem after an outage (if they ever do), I never ceases to
| amaze me how quickly CF manage to get something like this out.
|
| I think it's a testament to their Ops/Incident response teams and
| internal processes, it builds confidence in their ability to
| respond quickly when something does go wrong. Incredible work!
| thejosh wrote:
| Yep, look at heroku and their big incident, and the amount of
| downtime they've had lately.
| mattferderer wrote:
| To further add to your point, the CTO is the one who shared it
| here & the CEO is incredibly active on forums & social media
| everywhere with customers. Communication has always been one of
| their strengths.
| bluehatbrit wrote:
| I do wonder what would happen should happen if either of them
| left the company, I feel like there's a lot of trust on HN
| (and other places) that's heavily attached to them as
| individuals and their track record of good communication.
| unityByFreedom wrote:
| Good communicators generally foster that environment. And
| their customers appreciate it, so there is an external
| expectation now too. Everything ends some day, but I think
| this will be regarded as a valuable attribute for awhile.
| jgrahamc wrote:
| This is deeply, deeply embedded in Cloudflare culture.
| unityByFreedom wrote:
| Devil's advocate, you could get taken over or end up with
| a different board. I wouldn't like to see it but
| someone's got to compete with you or we'll have to send
| in the FTC! :)
| bombcar wrote:
| It could be good or bad; I suspect they've thought about it
| and have worked on succession (I hope!) and have like-
| minded people in the wings.
|
| But once it happens things will change and, to be honest,
| likely for the worse.
|
| edit> fix typo
| unityByFreedom wrote:
| > secession
|
| Succession?
| bombcar wrote:
| Eep yes, auto spell check on macOS is usually good but
| sometimes it causes a civil war.
| formerkrogemp wrote:
| To contrast this with the Atlassian outage recently is night
| and day.
| agilob wrote:
| I'd love to see the postmortem from Facebook :(
| Melatonic wrote:
| To be fair though they sort of MUST do things like this to have
| our confidence - their whole business is about being FAST and
| AVAILABLE. Were not talking about Oracle here :-D
| viraptor wrote:
| I feel like others lose opportunities by not doing the same. By
| publishing early and publishing the details they: keep the
| company in the news with positive stuff (free ad), get an
| internal documentation of the incident (ignoring the customer
| oriented "we're sorry" part), effectively get a free
| recruitment post (you're reading this because you're in tech
| and we do cool stuff, wink), release some internal architecture
| info that people will reference in discussions later. At a
| certain size it feels stupid not to post them publicly. I
| wonder how much those posts are calculated and how much
| organic/culture related.
| zaidf wrote:
| >I feel like others lose opportunities by not doing the same
|
| IMO it is a slippery slope to see this as _opportunity_ too
| strongly. Sure, doing the right thing may be net beneficial
| to the business in the long run...but the $RIGHT_THING should
| be done first and foremost because it 's the right thing.
| ethbr0 wrote:
| I believe Marcus Aurelius had something similar to say on
| the matter. :-)
| jjtheblunt wrote:
| quodcumque erat ?
| hunter2_ wrote:
| > I wonder how much those posts are calculated and how much
| organic/culture related.
|
| Don't companies have a fiduciary duty to calculate things;
| the reason for doing something actually cannot just be that
| it's a nice thing to do? Not down to the word, but at least
| the general decision to be this way?
| rrss wrote:
| No.
|
| https://scholarship.law.cornell.edu/cgi/viewcontent.cgi?htt
| p...
| viraptor wrote:
| No they don't have such duty. In practice very little
| decision making is based on hard data in my experience.
| Real world being fuzzy and risk being hard to quantify do
| not help the situation.
| Uehreka wrote:
| Ehhhh... I think it's good (for us) that they do this, but I
| don't think it's a free ad (contrary to popular belief, not
| all news is good news, and this is bad news) and any sort of
| conversion rate on recruitment is probably vanishingly small
| (which would normally be fine, but incidents like these may
| turn off some actual customers, which is where actual revenue
| comes from).
|
| I think their calculation (to the extent you can call it
| that) is that in the interest of PR and damage control, it's
| better to get a thorough postmortem out quickly to stem the
| bleeding and keep people like us from going "I can't wait to
| hear what happened at Cloudflare" for a week. Now we know,
| the customers have an explanation, and this bad news cycle
| has a higher chance of ending quickly.
| smugma wrote:
| I agree that this is a free ad/recruitment. However, it's
| easy to see how more conservative businesses see this as a
| risk. They are highlighting their deficiencies, letting their
| big important clients know that human error can bring their
| network down.
|
| Additionally, these post-mittens work for Cloudflare because
| they have a great reputation and good uptime. If this were
| happening daily or weekly, it _would_ be a warning sign to
| customers.
|
| It's a strategy other companies could adopt, but to do it
| effectively requires changes all across the organization.
| saghm wrote:
| OTOH, I think most actual engineers would know that
| everywhere has deficiencies and can be brought down by
| human error, and I'd personally rather use a product where
| the people running it admit this rather than just claim
| that their genius engineers made it 100% foolproof and
| nothing could ever possibly go wrong
| ethbr0 wrote:
| Absolutely. The first step of good SRE is admitting
| (publicly and within the organization) that you have a
| problem.
| solardev wrote:
| No provider is perfect, but it's because of stuff like this
| that I trust Cloudflare waaaaaaaaaaay more than the likes of
| Amazon. Transparency engenders trust, and eventually, love!
| Thank you, Cloudflare.
|
| The sheer level of technical competence of your engineering
| team continues to astound me. (Yes, they made a mistake and
| didn't catch an error in the diff. But your response process
| went exactly as it should, and your postmortem is excellent.) I
| couldn't even _begin_ to think about designing or implementing
| something of this complexity, much less being able to explain
| it to a layperson after a failure. It is really impressive, and
| I hope you will continue to do so into the future!
|
| Most of the companies I've worked for unfortunately don't use
| your services, but I've always been a staunch advocate and
| converted a few. Maybe the higher-ups only see downtime and
| name recognition (i.e. you're not Amazon), but for what it's
| worth, us devs down the ladder definitely notice your
| transparency and communications, and it means the world. I've
| learned to structure my own postmortems after yours, and it's
| really aided in internal communications.
|
| Thank you again. I can't wait for the day I get to work in a
| fully-Cloudflare stack :)
| nijave wrote:
| AWS is pretty decent if you're in an NDA contract (you have
| paid support). You can request RCAs for any incident you were
| impacted and they'll usually get them within a day.
|
| Not as transparent as "post it on the internet" but at least
| better than the usual hand wavey bullshit
| kevin_nisbet wrote:
| I agree, I think the transparency builds trust and I encourage
| it where I can. The counter thought I had when reading this
| case though, is it almost feels too fast. What I mean by that
| is I hope there isn't an incentive to wrap up the internal
| investigation quickly and write the blog and send it, and go
| we're done.
|
| Doing incident response (both outage and security), the
| tactical fixes for a specific problem are usually pretty easy.
| We can fix a bug, or change this specific plan to avoid the
| problem. The search for conditions that allowed the incident to
| occur can be alot more time consuming, and most organizations
| I've worked for are happy to make a couple tactical changes and
| move on.
| cowsandmilk wrote:
| I have to agree. The environment that leads to a fast blog
| post may also lead to this quote from the post:
|
| > This was delayed as network engineers walked over each
| other's changes, reverting the previous reverts, causing the
| problem to re-appear sporadically.
|
| They are running as fast as they can and this extended the
| incident. There is a "slow is smooth, smooth is fast" lesson
| in here. I'd rather have a team that takes a day to put up
| the blog post, but doesn't unnecessarily extend downtime
| because they are sprinting.
| jgrahamc wrote:
| There's normal operating procedure and sign offs and
| automation etc. etc. and then there's "we've lost contact
| with these data centers and normal procedures don't work we
| need to break glass and use the secondary channels". In
| that situation you are in an emergency without normal
| visibility.
| bombcar wrote:
| It can be easy to arm-chair it afterwards, but unless
| things can be done in parallel (and systems should be
| designed so this can be done, things like "we're not sure
| what's wrong, we're bringing up a new cluster on the last
| known good version even as we try to repair this one") you
| have to make a choice, and sometimes it won't be optimal.
| jgrahamc wrote:
| _What I mean by that is I hope there isn 't an incentive to
| wrap up the internal investigation quickly and write the blog
| and send it, and go we're done._
|
| There is not. From here there's an ongoing process with a
| formal post-mortem, all sorts of tickets tracking work to
| prevent further reoccurrence. This post is just the beginning
| internally.
| kortilla wrote:
| Well cloudflare's entire value is in uptime and preventing
| outages. Showing they have a rapid response and strong
| fundamental technical understanding is much more critical in
| the "prevent downtime" business.
| tyingq wrote:
| >take weeks for other companies to publish a postmortem
|
| And with nowhere near the detail level of what was presented
| here. Typically lots of sweeping generalizations that don't
| tell you much about what happened, or give you any confidence
| they really know what happened or have the right fix in place.
| sschueller wrote:
| Nodejs is still having issues. For example:
| https://nodejs.org/dist/v16.15.1/node-v16.15.1-darwin-x64.ta...
| doesn't download if you do "n lts"
| thomashabets2 wrote:
| tl;dr: Another BGP outage due to bad config changes.
|
| Here's a somewhat old (2016) but very impressive system at a
| major ISP for avoiding exactly this:
| https://www.youtube.com/watch?v=R_vCdGkGeSk
| johnklos wrote:
| ...and yet they still push so hard for recentralization of the
| web...
| goodpoint wrote:
| If the centralization of email, social network, VPS, SaaS was
| not bad enough.
|
| It's pretty appalling that you are even being downvoted.
| samwillis wrote:
| CloudFlare are a hosting provider and CDN, they aren't
| "push[ing] ... hard for recentralization of the web".
|
| If it was AWS, Akamai, Google Cloud, or any of the other
| massive providers this comment wouldn't be made. I don't really
| understand the association between centralisation and
| CloudFlare, other than it being a Meme.
| viraptor wrote:
| It's often mentioned about AWS, especially when us-east-1
| fails. The others are not big enough to affect basically "the
| internet" when they go down, so don't get pointed out as
| centralisation issues as much.
|
| And yeah, cf is trying to get as much traffic to go through
| them as possible and add edge services for more opportunities
| - that's literally their business. Also now r2 with object
| storage. They're already too big, harmful (as in actually
| putting people in danger) and untouchable in some ways.
| johnklos wrote:
| I think you've already drunk the Flavor Aid.
|
| What do you have when you have all DNS going through them,
| via DoH, and all web requests going through them, if not
| recentralization?
|
| Sure, they want us to think they give us the freedom to host
| our web sites anywhere because they're "protected" by them,
| but that "protection" means we've agreed to recentralize.
|
| It's pretty dismissive to describe something as a meme just
| because you don't understand it, and either you're pretending
| to not understand it, or you truly don't.
|
| Look at it this way: If a single company goes down for an
| hour, and that company going down for an hour causes half the
| web traffic on the Internet to fail for that hour, what is
| that if not recentralization?
| samwillis wrote:
| I understand that for their WAF, DDOS and threat detection
| products they need to have a very large amount of traffic
| going through them. They have been very aggressive with
| their free service to achieve that, to the benefit of all
| their customers (including the free ones). Some could see
| that as a push to at centralisation, I don't.
|
| What I don't understand, or believe, is that they want to
| be the sole (as in centralised) network for the internet. I
| don't believe they as a company, or the people running it,
| want that. They obviously have ambition to be one of the
| largest networking/cloud providers, and are achieving that.
|
| I don't intend either to dismiss your concerns (which are a
| legitimate thing to have, centralisation would be very
| bad), my suggestion with the meme comment is that there is
| at times a trend to "brigade" on large successful companies
| in a meme-like way. That isn't to suggest you were.
| johnklos wrote:
| They want to be a monopoly. They want everyone to depend
| on them. They may not want recentralization in general,
| but they definitely want as much of the Internet to
| depend on them as possible.
| philipwhiuk wrote:
| Is there no system to unit test a rule-set?
| thesuitonym wrote:
| Where does one even start with learning BGP? It always seemed
| super interesting to me, but not really something that could be
| dealt with on a small scale, lab type basis. Or am I wrong there?
| _whiteCaps_ wrote:
| https://github.com/Exa-Networks/exabgp
|
| They've got some Docker examples in the README.
| ThaDood wrote:
| DN42 <https://dn42.eu/Home> gets mentioned a lot. Its basically
| a big dynamic VPN that you can do BGP stuff with. Pretty cool
| but I could never get my node working properly.
| bpye wrote:
| I started setting that up and totally forgot, maybe I should
| actually try and peer with someone.
| jamal-kumar wrote:
| Nah Cisco has labs you can download and learn for their
| networking certifications, which are kinda the standard.
|
| Networking talent is kind of hard to find and if you learn that
| your chances of employment get pretty high.
| nonameiguess wrote:
| You can learn BGP with mininet: https://mininet.org/
|
| You can simulate arbitrarily large networks and internetworks
| with this, provided you have the hardware to run a large enough
| number of virtual appliances, but they are pretty lightweight.
| Icathian wrote:
| Mininet is what the Georgia Tech OMSCS Computer Networking
| labs use. It's not bad, the two labs that stood out to me
| were using it to implement BGP and a Distance Vector Routing
| protocol.
___________________________________________________________________
(page generated 2022-06-21 23:00 UTC)