[HN Gopher] Understanding how Facebook disappeared from the inte...
___________________________________________________________________
Understanding how Facebook disappeared from the internet
Author : jgrahamc
Score : 299 points
Date : 2021-10-04 21:11 UTC (1 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| cwilkes wrote:
| Updating BGP configs should go through a flowchart like this:
|
| Do you want to update BGP?
|
| No: exit
|
| Yes: type this random 100 character phrase to continue no copy
| paste
| toomuchtodo wrote:
| "Are you sure? Can people you need to recover from this change
| already in the building?"
| _joel wrote:
| Maybe a timed rollback with the previous state stored on the
| device that needs to be rolled back, althogh if you're doing
| this at facebook scale I'm sure that's a little more
| difficult than it sounds, perhaps.
| nabaraz wrote:
| Is there a way to secure BGP? Some kind of BGP table backup and
| restore?
| Wolfspirit wrote:
| I think that it is possible to restore the previous state but
| the question is, if it makes sense. When do you decide that it
| was a failure? Facebook explicitly (even so automatically) told
| all others that they shouldn't use that routes anymore.
|
| When it comes to Facebooks side I guess they do have backups of
| their BGP config. Applying them (probably remotely) however
| seems to be harder then expected when the whole infrastructure
| is down.
| mnordhoff wrote:
| "Because of this Cloudflare's 1.1.1.1 DNS resolver could no
| longer respond to queries asking for the IP address of
| facebook.com or instagram.com."
|
| The instagram.com zone itself uses a third-party DNS service and
| didn't go down. (But e.g. www.instagram.com is a CNAME to a zone
| on FB DNS.)
| yftsui wrote:
| That's pretty much why during the downtime visit instagram.com
| showed a 503 from AWS instead.
| mschuster91 wrote:
| Wonder why they're still using AWS given that FB operates its
| own data centers...
| mnordhoff wrote:
| No idea. I'd speculate that it's some kind of historical
| reasons from before FB acquired IG.
| [deleted]
| thezapxs wrote:
| The thing is, because the deletion of the BGP, they have ruled
| out the cyberattack but what happened doesn't make sense
| andp97 wrote:
| No one.
|
| Cloudflare that write a report of some else network outage!
|
| AWESOW COMPANY!!!
| phoe-krk wrote:
| Do you really think that Facebook is going to write one right
| now?
|
| It's good that we have some coverage from companies that have
| some stake in the game because they are also affected by the
| outage, even if only partially.
| powera wrote:
| Facebook has to make some public statement; the shareholders
| will demand it.
|
| I expect the detail level to be roughly "an automated system
| pushed a broken configuration"; that is to say, there
| probably won't be any interesting information at all for the
| Hacker News crowd.
|
| I doubt that this was caused by "hackers" or "hostile
| governments" or "dissident employees upset about Facebook
| privacy issues", and also doubt that Facebook would admit
| such if it were true unless they were legally required to do
| so.
| 0des wrote:
| >Facebook has to make some public statement
|
| History has shown us they can give us zero response, or an
| incorrect response, and we (via our representatives) will
| accept it and continue living life as before.
| bilbo0s wrote:
| The fact is, it's altogether likely that they could be
| legally require NOT to make such a statement outlining the
| cause if it was a hostile actor. I've felt a distinct
| change recently. The US government is not messing around
| about cyber security anymore.
|
| The guys with the blue windbreakers show up, I'd pretty
| much say "yes, sir." Of course, I don't have FB's power,
| but I don't think it matters.
| Wolfspirit wrote:
| Even if they wanted to write one they have no way to host it
| :D
| clemenspw wrote:
| Thanks, great write up!
| amachefe wrote:
| Facebook.com is up now, looks like the issue now is the billions
| of request that is DDoS on the DNS servers
| AlbertCory wrote:
| It sorta sounds like Facebook is Too Big To Fail.
|
| Yet another reason to dismantle it.
| literallyWTF wrote:
| I would rather purge it from this planet and throw Zuck into
| jail.
| 0des wrote:
| Side note: I wonder if this comment would do better with a
| different username
| [deleted]
| robaato wrote:
| From article: [This chart shows] the availability of the DNS name
| 'facebook.com' on Cloudflare's DNS resolver 1.1.1.1. It stopped
| being available at around 15:50 UTC and returned at 21:20 UTC.
| Wolfspirit wrote:
| Seems like someone found the post-it with the admin password for
| the BGP routers now and they're back online
| ruoso wrote:
| > ... but as of 22:28 UTC Facebook appears to be ...
|
| Someone assumed London==UTC, when London is 1 hour ahead :) that
| was actually 21:28 UTC
| interestica wrote:
| No matter what time of year it is, people tend to use 'EST' for
| 'Eastern Time' even when we might be in Eastern Daylight Time
| rather than Standard.
|
| It's especially annoying when dealing with multiple countries
| that may or may not be using Daylight Saving Time.
| Wolfspirit wrote:
| Even google isn't quite sure about the summer time. Not sure
| if that is just a Google German thing...
|
| A few weeks ago I tried to find out what the current time in
| CET is. Asking google for "CET" gave me: "23:27 CET". Asking
| google for "CET time" (I know that "time" is twice in this
| case) gave me "00:27 CET".
|
| The last one is wrong and should be CEST or even more correct
| would be just the same result for CET as I asked for
| Wolfspirit wrote:
| Timezones are the most annoying thing... right after encoding
| paxys wrote:
| I personally find timezones more annoying. At least with
| encoding once you figure things out it will work
| indefinitely. Timezones can simply change from under you with
| or without notice.
| doublerabbit wrote:
| > right after encoding
|
| No joke. Today I ended up writing a whole essay explaining
| the issue I was having and sending it off to the core
| developers because I thought I had discovered an issue with
| the actual language. The bug was because I had forgot to
| convert too&from utf-8 in these two procedures:
| proc 2Hex { input } { binary encode hex [encoding convertto
| utf-8 "$input"] } ;# Converts string data to base16
| proc 2Base { input } { encoding convertfrom utf-8 [binary
| decode hex "$input"] } ;# Converts string hex data
| to base32
|
| On the plus side, I now have written documentation of the
| internals of my program.
| VBprogrammer wrote:
| Oh TCL. I didn't miss you.
| throwdecro wrote:
| This is a great write-up, but one thing I don't understand is why
| the effect of withdrawing the BGP prefixes was instantaneous (if
| I understand that correctly), but it's taking hours (so far) to
| re-announce the prefixes. Why would it take so long to flip the
| switch back the other way?
| adamcharnock wrote:
| I'm pretty new to BGP, but I'd imagine that cutting off access
| to an AS is fast because all it takes is for the neighbouring
| routers update their routes. At which point any traffic that
| makes it that far is simply dropped.
|
| Whereas to make an announcement, the entire internet (or at
| least all routers between the AS and the user) need to pickup
| the new announcement.
|
| (Note: I still need to read the article)
| bsedlm wrote:
| (I'm trying to better understand this)
|
| I think it's not so simple because authoritative DNS systems
| are involved.
|
| So it's not just a BGP error. It's a BGP error which
| disconnected authoritative DNS for all facebook. I'm not
| quite sure why that makes it so slow to fix. is it just
| because internal difficulties due to having no DNS at all?
| dec0dedab0de wrote:
| I haven't been following closely, but I think once they moved
| the prefixes they could no longer access the routers. Coupled
| with barebones staff at the data center due to the pandemic,
| and all internal communication being disrupted. Though I really
| expected it to be up within an hour or two.
| dr_orpheus wrote:
| Yeah, I think that is true. If you look at the Update near
| the end of the Cloudflare article there is a huge spike in
| the BGP activity (I assume re-announcing all of the routes).
| So that part of it was relatively instantaneous after they
| got all of their ducks in a row actually getting to the
| routers and locating the BGP from some earlier version before
| it went offline this morning that they could use.
| rifkiamil wrote:
| We have had out-of-band management ports & networks design
| for decades! I know the feeling of driving 8 hours because I
| lost connection to the device I was configuring.
| https://en.wikipedia.org/wiki/Out-of-band_management
| Wolfspirit wrote:
| I'm not sure if that is true (and I hope it is not cause that
| would be fatal) but I read somewhere that with facebook being
| down also means all internal infrastructure of facebook isn't
| available at the moment (chats, communication) including remote
| control tools for the BGP Routers. Therefor they require people
| to get physical access to the router while many people are
| working from home cause of the pandemic.
| kube-system wrote:
| Given my experience with DNS issues, I am guessing that they
| are running into dependencies along the way that assume/require
| DNS be available to function.
| bink wrote:
| With routing it's even worse than that. If they had no out-
| of-band method to connect to these routers and they botched
| the routing config then they had no way to route any traffic
| to them at all. At least with DNS you can still connect to
| the IPs.
|
| I would find it a bit surprising if Facebook didn't have OOB
| access to their data centers, however.
| withinboredom wrote:
| Assuming you don't need DNS to get authorization to enter
| the OOB access...
| theshadowknows wrote:
| Yeah as part of my job I often have to work with our DNS team
| to provision say a subdomain or get some domain verified.
| They've got like...three people...trying to service thousands
| of teams across the enterprise. I do not envy their job at
| all.
| PeterCorless wrote:
| If an authoritative DNS entry was removed, it can take up to 72
| hours for that change to be propagated around the world, though
| usually just a few hours for some other authoritative DNS
| systems to get you mostly back:
|
| https://ns1.com/resources/dns-
| propagation#:~:text=DNS%20prop....
| tester756 wrote:
| Why it takes this long?
| withinboredom wrote:
| Caching
| Hikikomori wrote:
| Restoring is just as simple as flipping the switch again, but
| access to that switch is another matter when your internal
| network is also down and you cannot even get access to your
| office or datacenters.
___________________________________________________________________
(page generated 2021-10-04 23:00 UTC)