[HN Gopher] Every device with FB app is now DDoSing recursive DN...
___________________________________________________________________
Every device with FB app is now DDoSing recursive DNS resolvers
Author : doener
Score : 285 points
Date : 2021-10-04 19:26 UTC (3 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| lrem wrote:
| Isn't this "every device with an app using the FB SDK"?
| Fordec wrote:
| Is it just me, or every time there's an issue with facebook
| server side, the FB SDK just absolutely damages any software
| with it installed as a dependency. The thought put into
| failover states under the hood is very lacking. It speaks to an
| ethos of if developers aren't a working data pipeline to
| facebook services, them and their own products can go pound
| sand.
| agilob wrote:
| Every app with facebook SDK used for whatever reason, like
| login or metrics or ads...
| littlecranky67 wrote:
| Oh great, and at some point when key ISP DNS servers crack under
| the load, more and more websites will appear to "go down" from a
| users perspective - suddenly gmail.com and outlook.com not
| working. More and more people reload websites, restart devices
| etc. and increase the load even further. People fall back to
| using SMS/telephone, but since it is not used to that heavy load
| of 2021, soon phone calls fail. With FB, WA, Email and Phone
| "down", engineers can't be reached to fix this. And if they can,
| they fail to call the Uber to get them somewhere. And even if
| they could, the streets are congested with people that cannot
| communicate remotely so try to get somewhere to convey messages
| in-person.
|
| Hope they will just go back to fix FB and this is just in my head
| :)
| littlecranky67 wrote:
| I wrote this as a pure joke, but now that I learned that
| SERVFAIL is not cached on browsers, clients, intermediate DNS
| servers [0] etc. I am curiously wondering what will be going
| on. It is not only FB apps, it is basically every website
| request (that uses FB JS for ads, tracking, etc.) that triggers
| a DNS request, which will be forwarded 1:1 from the ISP's DNS
| to the null-routed FB Subnet. This should put orders of
| magnitude more load on resolving DNS servers than usual.
|
| [0]: https://serverfault.com/questions/479367/how-long-a-dns-
| time...
| bowaggoner wrote:
| I enjoyed the comment. You might like my short story along
| the same lines from a few years ago:
| https://bowaggoner.com/writeups/robust.html
| tarsius wrote:
| For those you understand Swiss German: Mani Matter - I han es
| Zundholzli azundt https://www.youtube.com/watch?v=PkGatIgXERI
| Tainnor wrote:
| I lit a match / And it made a fire. / And for my cigarette /
| I wanted to take the fire from the match. / But the match
| slipped from my hand and landed on the rug. / And it almost
| made a hole in the rug.
|
| Well, you know what can happen / If you're not careful with
| fire. / And for the light on a cigarette / A rug feels rather
| too expensive. / And from the rug the fire, alas, / Might
| have spread to the whole house / And who knows what would
| have happened thereafter?
|
| There would have been a fire in the district / And the fire
| fighters would have had to come. / They would have honked in
| the streets / And unloaded their pipes. / And they would have
| sprayed fire / And it would have been in vain / And the whole
| city would have burned with nothing to protect it.
|
| And the people would have jumped around / Fearing for their
| possessions. / They would have thought somebody had started a
| fire. / They would have grabbed assault rifles. / Everyone
| would have shouted: "Whose fault is it?" / The whole country
| would have rioted. / And they would have shot at the
| ministers behind the lecterns.
|
| The UN would have become involved / And the UN enemies as
| well. / To preserve peace in Switzerland / Both would have
| come with tanks. / It would have spread, little by little, /
| To Europe, to Africa. / There would have been a World War and
| humans would be no more.
|
| I lit a match / And it made a fire. / And for my cigarette /
| I wanted to take the fire from the match. / But the match
| slipped from my hand and landed on the rug. / Thankfully, I
| picked it back up.
| asdff wrote:
| If its any consolation, I find SMS and telephone to be
| remarkably robust in these situations. In college during
| football games, the campus population would swell to probably
| 200k people within a few square miles each with a cell phone in
| a pocket. 3g and LTE would be worthless. Campus area wifi would
| be worthless. The only thing that would work is shutting all
| that off your phone and resorting to SMS and calling people
| over EDGE, but it worked flawlessly even with all the people
| stressing out a handful of towers at once.
| littlecranky67 wrote:
| Football game's are planned events, and the carriers plan
| capacity accordingly. I remember sms+cell phone going down on
| several music festivals with 20.000+ participants (especially
| when it ended at around midnight). Only some carriers
| supported those "sponteneous" gatherings in the middle of
| nowhere with mobile cell towers that would keep connectivity
| going - but low-cost carriers never did.
| mschuster91 wrote:
| German carrier O2 was/is notorious for offering somewhat
| decent-ish service in urban areas under normal conditions -
| but major events that happen to have lots of people moving
| around the city like fans congregating to a soccer game with
| public transport, political rallies or your average drunkard
| festival (=Oktoberfest)? Instant collapse...
| IgorPartola wrote:
| Isn't there an easy fix to just add a bullshit record for FB to
| DNS until this blows over?
|
| I am also thinking that all the poorly coded sites that don't
| work unless the Share on Facebook button loads are also going
| to hemorrhage money. So are all the e-commerce sites that rely
| on Login with FB.
|
| I hope this results in everyone rethinking adding all that shit
| to their infrastructure before the next time.
| noahtallen wrote:
| My understanding is that the Facebook network itself was
| unreachable from the internet because of BGP. So even if an
| IP was resolved from DNS, that IP wouldn't get routed to
| Facebook because it withdrew its routes from its peers where
| it connects to ISPs via BGP
| AnimalMuppet wrote:
| Not if the IP was 127.0.0.1...
| jbotz wrote:
| The problem isn't (wasn't) facebook.com's A records, it's
| that the authoritative nameservers for facebook.com are
| (were) unreachable. In theory someone could change the NS
| records for facebook on the .com nameservers to point
| somewhere else and serve up a fake facebook.com domain,
| but... 1) those NS records have a 6 hour TTL, so it would
| take a while be effective, and 2) who has the authority to do
| that?
| unanswered wrote:
| ISPs keep floating charging Netflix for their own customers' data
| traffic; by the same logic DNS operators should be charging
| Facebook.
| walrus01 wrote:
| If you're an ISP that is anything bigger than a mom-and-pop
| operation, you should have at least 3 or 4 geographically
| distributed anycast recursive resolvers.
|
| Recursive DNS is pretty easy to do for really large volumes on a
| $600 1U server. It's not like the days of 15 years ago...
| fmajid wrote:
| Ah, but what about all the marketing analytics ISPs deploy on
| their DNS servers so that if your browser ever looked up
| viagra.com, forever will it grace your web browsing retargeting
| ad units? /s
| Jumb0 wrote:
| Could anyone explain this so people with no DNS knowledge could
| understand?
| Jumb0 wrote:
| Could anyone explain this for people with no DNS knowledge?
| ricardo81 wrote:
| BGP is like the mail service.
|
| DNS is a translation from human readable addresses to machine
| addresses.
|
| BGP determines how to find those addresses from your server to
| theirs.
| MrStonedOne wrote:
| The internet phone book to convert .coms to ip addresses only
| scales up to the load level of hte internet because results are
| cached at multiple layers.
|
| your browser asks your computer which asks your router which
| asks your isp which asks .com's dns servers which ask
| facebook's dns servers for facebook's ip address.
|
| each layer will cache the results so say, even if 100,000
| people in seattle want to know facebook.com's ip address, only
| the 5 or isps who provide internet in seattle have to ask for
| facebook.com's ip address, so 100,000 requests, but only 5
| actual requests.
|
| even the per-device and per-home cache is helpful, because 500
| page loads in 15 minutes still only results in 1 actual dns
| request.
|
| Here's the issue:
|
| Failures aren't cached.
|
| so while 100,000 people in 1 second trying to get facebook's ip
| only resulted in 5 requests going to the core dns servers, now
| results in 100,000 requests in 1 second trying going to the
| core servers.
| Rd6n6 wrote:
| Ddos is a strong way to put this. Are we talking malware that is
| sending thousands of requests per device, or a bug from a
| connectivity issue?
| Jtsummers wrote:
| It's still a DDoS, just not an _attack_. Slashdotting a site is
| a DDoS, but usually not intended as a deliberate attack. Now
| take every WhatsApp, Facebook, Instagram, and Facebook
| Messenger user, every app that uses FB for user authentication,
| every _site_ that does the same, every app and site that serves
| FB ads, every app and site that uses FB for metrics, and we
| have an unintentional DDoS just waiting to happen.
| tantalor wrote:
| The word "attack" has multiple definitions, including "act
| harmfully on" e.g., a heart attack. It does not require
| intent or aggression.
| protomyth wrote:
| Yeah, I now know every user with the FB app installed. Its just
| wild to watch the log of all the phones asking for facebook.com.
| sschueller wrote:
| Is there no response code for a DNS to say "I don't have what you
| want right now, come back later but wait at least xxx seconds"?
|
| I guess alternatively you could return garbage (127.0.0.1) with a
| 5 min ttl or so to get clients to backoff but also problematic.
| a1369209993 wrote:
| > you could return garbage (127.0.0.1) with a 5 min ttl
|
| I use 0.0.0.0, though I'm not sure if some layer in that mess
| would interpret it creatively. Has worked on my machine for
| years at least: $ ping facebook.com
| connect: Invalid argument
|
| (If you do this, please set the TTL to at least a month, and
| preferably upwards of a decade.)
| asddubs wrote:
| lots of ISPs straight up ignore ttl
| Computeiful wrote:
| Clients should really be using something like exponential
| backoff ethernet-style.
| xenonite wrote:
| And even Hacker news is strangely slow to respond.
| mrep wrote:
| Lots of people interested in the event and normal web response
| queries their sql database for user data like upvotes which is
| their limiter.
|
| Logout which I believe they just forced for all users sends you
| to the cache and it'll be fast. dang@ mentioned it during the
| giant S3 outage a few years ago.
| munk-a wrote:
| Ah nice - I noticed it just got a lot more snappy to respond.
| bifrost wrote:
| Yep, with FB and IG down, people spending time on a much more
| important site.... This site.
| reayn wrote:
| yeah I thought I was the only one, never really noticed this on
| HN before...
| adtac wrote:
| really lol? HN goes down every full moon day or something
| like that
| littlecranky67 wrote:
| I'd thank anybody who would post a tutorial/configfile to setup a
| DNS server (dnsmasq?), forcefully caching even failed requests to
| a configurable timeout, and large cache sizes. We might need them
| in case DNS servers going down under the load of requests from
| "smart devices" :)
| agilob wrote:
| > forcefully caching even failed requests to a configurable
| timeout
|
| I've been doing ~SRE for 1.5 years and I've worked or helped on
| 3 outages related to negative DNS.. Please don't use negative
| cache, if you don't know how enough about DNS and can't monitor
| it
| littlecranky67 wrote:
| I wouldn't suggest any ISP should do that (and I am none) but
| probably host this for own personal usage/home networks. If
| recursive DNS servers go down under the load of "smart
| devices", having a local copy of a larger number/set of IPs I
| usually visit might come in handy (and none of my requests
| would worsen the issue of server overload).
| agilob wrote:
| This is 'cache', my OpenWRT router has this for thousands
| of records, but negative cache means: "remember this domain
| doesn't exist and don't retry asking other DNS providers".
| This is very dangerous.
|
| Your browser AND operating system AND router already
| provide DNS caching, it's not something average user should
| even think about. You might want to consider it when things
| in your ISP go wrong (hello BT), or majority of computer
| request the same domains frequently, but then again, your
| router should do it already.
| jaywalk wrote:
| This does a good job explaining how SERVFAIL caching works:
| https://serverfault.com/questions/479367/how-long-a-dns-time...
| littlecranky67 wrote:
| From your linked SO post, the accepted answer concludes:
|
| "In summary, SERVFAIL is unlikely to be cached, but even if
| cached, it'll be at most a double- or even a single-digit
| number of seconds."
|
| That would be fatal right now, wouldn't it? That would mean
| every major ISP's DNS server right now forwards millions of
| _identical_ DNS resolve requests to the (currently null-
| routed) Facebook DNS servers. These must be millions, as
| heck, every larger website uses FB tracking tools, "like
| buttons" etc. Are they at least smart enough to throttle
| based on a domain/ip hash? Else it could happen that DNS
| servers of major ISP are soon overloaded as (constantly
| failing and thus uncached) requests to FB DNS would eat up
| all bandwidth/ressources?
| toast0 wrote:
| Not that fatal. I think at least some recursive servers
| will do 'collapsed forwarding', where additional requests
| to resolve the same name while the first request is in
| progress will wait for the first request to finish and send
| the same results to all clients at that point. Although,
| perhaps that's just wishful thinking on my part.
|
| Then you have port limits, usually each request goes out on
| a new port, a recursive resolver can only have 64k requests
| outstanding to any given authoritative (or upstream) server
| IP for each IP the recursive uses. Facebook runs with 4
| hostnames listed, so that's a limit of 256k requests
| outstanding, 512k if your recursive does IPv4 and v6 (and 1
| M if they're also making whatsapp requests).
|
| DNS services for both domains appear to be back up by the
| way.
|
| On the authoritative side, it's not too hard to manage this
| load. If you can't handle the big crush to start with, drop
| all requests, and then accept all the requests from
| 1.0.0.0/8, and add one /8 at a time as CPU permits until
| you're allowing everything. Once you handle the initial
| crush from a resolver, it should go back to normal load,
| and there should be some distribution of load across the
| various /8s. I wouldn't expect it to be evenly distributed,
| but it should be even enough.
|
| Disclosure: I worked at WhatsApp, but left August 2019. I
| don't know anything about this outage other than idle
| speculation. I don't know if FB has a procedure to slow
| start DNS, but the theory is simple; the practice is
| complicated by the DNS ips being used in Anycast.
| littlecranky67 wrote:
| > servers will do 'collapsed forwarding', [...] perhaps
| that's just wishful thinking on my part
|
| I think it is wishful thinking, because that would
| basically be caching which is not allowed by the RFC. In
| 2017 the BIND implementation changed to a default cache
| time of 1s which would certainly ease the problem.
|
| > then you have port limits, usually each request goes
| out on a new port, a recursive resolver can only have 64k
|
| I'm unsure if this helps or worsens the situation,
| depending if the 'collapsed forwarding'/1s caching is in
| place. If this is not the case, ephemeral port exhaustion
| would kick in, at which point the DNS server will not be
| able to server other requests.
|
| > On the authoritative side, it's not too hard to manage
| this load
|
| Of course not, all you need to do is just present _any_
| response which will be cached by downstream resolvers. No
| smartphone /end user device will query the authoritative
| side as long as there is just any (even stale) response.
| toast0 wrote:
| > If this is not the case, ephemeral port exhaustion
| would kick in, at which point the DNS server will not be
| able to server other requests.
|
| You can use the same local ip/port to contact multiple
| server ip/ports, so filling up connections to FB ips
| shouldn't prevent you from connecting to others (but
| there are plenty of ways to do that wrong, I guess)
|
| >> On the authoritative side, it's not too hard to manage
| this load
|
| > Of course not, all you need to do is just present any
| response which will be cached by downstream resolvers.
|
| You need to present a response before the resolver times
| out. One can certainly imagine a situation where the
| incoming packet processing results in enough delay that
| the responses arrive too late and are discarded. In the
| right conditions, this queuing delay would never clear
| and things just get worse. If it doesn't happen, great,
| but if it does, dropping most of the requests so you can
| timely handle the few you accept is a good way to get
| moving.
| fmajid wrote:
| It's not so much the load as the DNS servers having to
| maintain state for all those queries until they time out.
| Must consume tremendous RAM and servers that are not event-
| driven could also be generating large numbers of threads.
| fermentation wrote:
| Excellent opportunity to set up a DNS-by-mail service. Just send
| me a letter with the names you want and I'll get back to you
| within 3 to 5 business days!
| idatum wrote:
| Never having explicitly queried fb.com before I never noticed how
| they (face:b00c) got clever with their IPv6 address:
|
| 2a03:2880:f1ff:83:face:b00c:0:25de
| ceejayoz wrote:
| Facebook also used brute force to get facebookcorewwwi.onion on
| Tor a while back.
| https://en.wikipedia.org/wiki/Facebook_onion_address
| ipaddr wrote:
| facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onio
| n is the latest
| john37386 wrote:
| At this point people hope it will just restart so that we can all
| resume a normal life. There can only be 2 options. It will be fix
| very soon or it will be a hell of a night with a lot of coffee.
| Ajedi32 wrote:
| Is this specifically an issue with the Facebook app? Or is it
| just a predicable consequence of DNS responses no longer being
| cached due to query failures for a site as popular as Facebook?
| treesknees wrote:
| It is certainly not specific to Facebook, but the scale at
| which Facebook is referenced across websites and apps is pretty
| unique (I can only think of a few key players like Google who
| would cause a similar load.)
|
| And to clarify a bit, the queries aren't "no longer being
| cached due to query failures", it's because their TTL expires
| and the resulting SERVFAIL from the next query (which fails)
| isn't cached at all.
| [deleted]
| justahuman1 wrote:
| Is it possible for facebook to instead rely on an anycast IP
| rather than DNS for their (non-web) phone apps?
| bifrost wrote:
| No. But FB's DNS is anycasted.
|
| FB's eggs were all in one basket, and the basket broke.
| treesknees wrote:
| Yes and no. Yes they could technically hardcode an anycasted IP
| address, however it'd be less reliable. Also you'd run into
| issues with TLS certificates. It'd be very inflexible and would
| probably result in more outages.
|
| But even if they did hardcode an IP, the underlying
| infrastructure for Facebook was also down not just DNS
| resolution of facebook.com. So even if the FB app didn't need
| to resolve a hostname, it would still be broken.
| [deleted]
| earth2mars wrote:
| Is it just the people trying to connect or the app itself keep
| polling and trying to send information from devices to Facebook
| servers continuously.
| slg wrote:
| This technology problem is a good metaphor for Facebook overall
| as a company. There is nothing fundamentally wrong with having
| your app regularly polling for DNS records when it can't find
| them, but that can be an actively harmful approach when you are
| the size of Facebook. Being that size comes with a whole swath of
| extra responsibilities to ensure that your behavior doesn't end
| up harming society as a whole.
| jmalicki wrote:
| There is something inherently wrong with that - it's why
| exponential backoff exists.
| earth2mars wrote:
| Is it just the people trying to connect or the app itself keep
| polling and trying to send information from devices to Facebook
| servers continuously.
___________________________________________________________________
(page generated 2021-10-04 23:00 UTC)