hngopher.com

       [HN Gopher] Cloudflare 1.1.1.1 Incident on July 14, 2025
       ___________________________________________________________________
        
       Cloudflare 1.1.1.1 Incident on July 14, 2025
        
       Author : nomaxx117
       Score  : 521 points
       Date   : 2025-07-16 03:44 UTC (19 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | Mindless2112 wrote:
       | Interesting that traffic didn't return to completely normal
       | levels after the incident.
       | 
       | I recently started using the "luci-app-https-dns-proxy" package
       | on OpenWrt, which is preconfigured to use both Cloudflare and
       | Google DNS, and since DoH was mostly unaffected, I didn't notice
       | an outage. (Though if DoH had been affected, it presumably would
       | have failed over to Google DNS anyway.)
        
         | anon7000 wrote:
         | They go into that more towards the end, sounds like some
         | smaller % of servers needed more direct intervention
        
         | caconym_ wrote:
         | > Interesting that traffic didn't return to completely normal
         | levels after the incident.
         | 
         | Anecdotally, I figured out their DNS was broken before it hit
         | their status page and switched my upstream DNS over to Google.
         | Haven't gotten around to switching back yet.
        
           | radicaldreamer wrote:
           | What would be a good reason to switch back from Google DNS?
        
             | sammy2255 wrote:
             | Depends who you trust more with your DNS traffic. I know
             | who I trust more.
        
               | nojs wrote:
               | Who? Honest question
        
               | misiek08 wrote:
               | Google is serving you ads, CF isn't.
               | 
               | And it's not conspiracy theory - it was very suspicious
               | when we did some testing on small, aware group. The
               | traffic didn't look like being handled anonymously at
               | Google side
        
               | mnordhoff wrote:
               | Unless the privacy policy changed recently, Google
               | shouldn't be doing anything nefarious with 8.8.8.8 DNS
               | queries.
        
               | DarkCrusader2 wrote:
               | They weren't supposed to do anything with our gmail data
               | as well. That didn't stop them.
        
               | Tijdreiziger wrote:
               | [citation needed]
        
               | johnklos wrote:
               | Read their TOS.
        
               | Tijdreiziger wrote:
               | If it's in the ToS, then it's not true that "[they]
               | weren't supposed to do anything with our gmail data".
        
               | daneel_w wrote:
               | Yeah it's not like they have a long track record of being
               | caught red-handed stepping all over privacy regulations
               | and snarfing up user activity data across their entire
               | range of free products...
        
               | opan wrote:
               | CF breaks half the web with their awful challenges that
               | fail in many non-mainstream browsers (even ones based on
               | chromium).
        
               | Elucalidavah wrote:
               | Realistically, either you ignore the privacy concerns and
               | set up routing to multiple providers preferring the
               | fastest, or you go all-in on privacy and route DNS over
               | Tor over bridge.
               | 
               | Although, perhaps, having an external VPS with a dns
               | proxy could be a good middle ground?
        
               | Tijdreiziger wrote:
               | Middle ground is ISP DNS, right?
        
               | davidcbc wrote:
               | If privacy is your primary concern I would 100% trust
               | Cloudflare or Google over an ISP in the US
        
               | daneel_w wrote:
               | If you're the technical type you can run Unbound locally
               | (even on Windows) and let it forward queries with DoT. No
               | need for neither Tor nor running your own external
               | resolver.
        
               | immibis wrote:
               | Myself, I suppose? Recursive resolvers are low-
               | maintenance, and you get less exposure to ISP censorship
               | (which "developed" countries also do).
        
               | daneel_w wrote:
               | Quad9, dns0.
        
             | Algent wrote:
             | After trying both several time I since stayed with google
             | due to cloudflare always returning really bad IPs for
             | anything involving CDN. Having users complain stuff take
             | age to load because you got matched to an IP on opposite
             | side of planet is a bit problematic especially when it
             | rarely happen on other dns providers. Maybe there is a way
             | to fix this but I admit I went for the easier option of
             | going back to good old 8.8.8.8
        
               | homebrewer wrote:
               | No, it's deliberately not implemented:
               | 
               | https://developers.cloudflare.com/1.1.1.1/faq/#does-1111-
               | sen...
               | 
               | I've also changed to 9.9.9.9 and 8.8.8.8 after using
               | 1.1.1.1 for several years because connectivity here is
               | not very good, and being connected to the wrong data
               | center means RTT in excess of 300 ms. Makes the web very
               | sluggish.
        
               | Aachen wrote:
               | Does that setup fall back to 8.8.8.8 if 9.9.9.9 fails to
               | resolve?
               | 
               | Quad9 has a very aggressive blocking policy (my site with
               | user-uploaded content was banned without even reporting
               | the malicious content; if you're a big brand name it
               | seems to be fine to have user-uploaded content though)
               | which this would be a possible workaround for, but it may
               | not take an nxdomain response as a resolver failure
        
         | motorest wrote:
         | > Interesting that traffic didn't return to completely normal
         | levels after the incident.
         | 
         | Clients cache DNS resolutions to avoid having to do that
         | request each time they send a request. It's plausible that some
         | clients held on to their cache for a significant period.
        
         | bastawhiz wrote:
         | If your Internet doesn't work, you'll get up and do other
         | things for a while. I strongly suspect most folks didn't switch
         | DNS providers in that time.
        
       | CuteDepravity wrote:
       | It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the
       | same change
       | 
       | I guess now we should start using a completely different provider
       | as dns backup Maybe 8.8.8.8 or 9.9.9.9
        
         | sammy2255 wrote:
         | 1.1.1.1 and 1.0.0.1 are served by the same service. It's not
         | advertised as a redundant fully separate backup or anything
         | like that...
        
           | yjftsjthsd-h wrote:
           | Wait, then why does 1.0.0.1 exist? I'll grant I've never seen
           | it advertised/documented as a backup, but I just assumed it
           | must be because why else would you have two? (Given that
           | 1.1.1.1 already isn't actually a single point, so I wouldn't
           | think you need a second IP for load balancing reasons.)
        
             | ta1243 wrote:
             | Far quicker to type ping 1.1 than ping 1.1.1.1
             | 
             | 1.0.0.0/24 is a different network than 1.1.1.0/24 too, so
             | can be hosted elsewhere. Indeed right now 1.1.1.1 from my
             | laptop goes via 141.101.71.63 and 1.0.0.1 via
             | 141.101.71.121, which are both hosts on the same LINX/LON1
             | peer but presumably from different routers, so there is
             | some resilience there.
             | 
             | Given DNS is about the easiest thing to avoid a single
             | point of failure on I'm not sure why you would put all your
             | eggs in a single company, but that seems to be the modern
             | internet - centralisation over resilience because
             | resilience is somehow deemed to be hard.
        
               | yjftsjthsd-h wrote:
               | > Far quicker to type ping 1.1 than ping 1.1.1.1
               | 
               | I guess. I wouldn't have thought it worthwhile for 4
               | chars, but yes.
               | 
               | > 1.0.0.0/24 is a different network than 1.1.1.0/24 too,
               | so can be hosted elsewhere.
               | 
               | I thought anycast gave them that on a single IP, though
               | perhaps this is even more resilient?
        
               | darkwater wrote:
               | Not a network expert but anycast will give you different
               | routes depending on where you are. But having 2 IPs will
               | give you different routes to them from the same location.
               | In this case since the error was BGP related, and they
               | clearly use the same system to announce both IPs, both
               | were affected.
        
               | ta1243 wrote:
               | In the internet world you can't really advertise subnets
               | smaller than a /24, so 1.1.1.1/32 isn't a route, it's via
               | 1.1.1.0/24
               | 
               | You can see they are separate routes, say looking at
               | Telia's routing IP
               | 
               | https://lg.telia.net/?type=bgp&router=fre-
               | peer1.se&address=1...
               | 
               | https://lg.telia.net/?type=bgp&router=fre-
               | peer1.se&address=1...
               | 
               | In this case they both are advertised from the same peer
               | above, I suspect they usually are - they certainly come
               | from the same AS, but they don't need to. You could have
               | two peers with cloudflare with different weights for each
               | /24
        
             | kalmar wrote:
             | I don't know of it's the reason, but inet_aton[0] and other
             | parsing libraries that match its behaviour will parse 1.1
             | as 1.0.0.1. I use `ping 1.1` as a quick connectivity test.
             | 
             | [0] https://man7.org/linux/man-
             | pages/man3/inet_aton.3.html#DESCR...
        
             | tom1337 wrote:
             | Wasn't it also because a lot of hotel / public routers used
             | 1.1.1.1 for captive portals and therefore you couldn't use
             | 1.1.1.1?
        
             | immibis wrote:
             | Because operating systems have two boxes for DNS server IP
             | addresses, and Cloudflare wants to be in both positions.
        
         | codingminds wrote:
         | Wasn't that the case since ever?
        
         | globular-toast wrote:
         | In general there's no such thing as "DNS backup". Most clients
         | just arbitrarily pick one from the list, they don't fall back
         | to the other one in case of failure or anything. So if one went
         | down you'd still find many requests timing out.
        
           | JdeBP wrote:
           | The reality is that it's rather complicated to say what "most
           | clients" do, as there is some behavioural variation amongst
           | the DNS client libraries when they are configured with
           | multiple IP addresses to contact. So whilst it's true to say
           | that fallback and redundancy does not always operate as one
           | might suppose at the DNS client level, it is untrue to go to
           | the opposite extreme and say that there's no such thing at
           | all.
        
         | 0xbadcafebee wrote:
         | In general, the idea of DNS's design is to use the DNS resolver
         | closest to you, rather than the one run by the largest company.
         | 
         | That said, it's a good idea to specifically pick multiple
         | resolvers in different regions, on different backbones, using
         | different providers, and _not_ use an Anycast address, because
         | Anycast can get a little weird. However, this can lead to hard-
         | to-troubleshoot issues, because DNS doesn 't always behave the
         | way you expect.
        
           | ben0x539 wrote:
           | Isn't the largest company most likely to have the DNS
           | resolver closest to me?
        
             | fragmede wrote:
             | Your ISP should have a DNS revolver closer to you. "Should"
             | doesn't necessarily mean faster, however.
        
               | lxgr wrote:
               | I've had ISPs with a DNS server (configured via DHCP)
               | farther away than 1.1.1.1 and 8.8.8.8.
        
               | encom wrote:
               | In case of Denmark, ISP DNS also means censored. Of
               | course it started with CP, as it always does, then
               | expanded to copyrights, pharmaceuticals, gambling and
               | "terrorism". Except for the occasional Linux ISO, I don't
               | partake in any of these topics, but I'm opposed to any
               | kind of censorship on principle. And naturally, this
               | doesn't stop anyone, but politicians get to stand in
               | front of television cameras and say they're protecting
               | children and stopping terrorists.
               | 
               | </soapbox>
        
               | nullify88 wrote:
               | Not just that. ISPs are often subject to certain data
               | retention laws. For Denmark (And other EU countries) that
               | maybe 6 months to 2 years. And considering close ties
               | with "9 eyes" means America potentially has access to my
               | information anyway.
               | 
               | Judging by Cloudflare's privacy policy, they hold less
               | personally identifiable information than my ISP while
               | offering EDNS and low latencies? Win, win, win.
        
             | sschueller wrote:
             | No, your ISP can have a server closer before any external
             | one.
        
           | dontTREATonme wrote:
           | What's your recommendation for finding the dns resolver
           | closest to me? I currently use 1.1 and 8.8, but I'm
           | absolutely open to alternatives.
        
             | LeoPanthera wrote:
             | The closest DNS resolver to you is the one run by your ISP.
        
               | JdeBP wrote:
               | Actually, it's about 20cm from my left elbow, which is
               | physically several orders of magnitude closer than
               | _anything_ run by my ISP, and logically at least 2
               | network hops closer.
               | 
               | And the closest resolving proxy DNS server for most of my
               | machines is listening on their loopback interface. The
               | closest such machine happens to be about 1m away, so is
               | beaten out of first place by centimetres. (-:
               | 
               | It's a shame that Microsoft arbitrarily ties such
               | functionality to the Server flavour of Windows, and does
               | not supply it on the Workstation flavour, but other
               | operating systems are not so artificially limited or
               | helpless; and even novice users on such systems can get a
               | working proxy DNS server out of the box that their sysops
               | don't actually _have_ to touch.
               | 
               | The idea that one has to rely upon an ISP, or even upon
               | CloudFlare and Google and Quad9, for this stuff is a bit
               | of a marketing tale that is put about by thse self-same
               | ISPs and CloudFlare and Google and Quad9. Not relying
               | upon them is not actually limited to people who are
               | skilled in system operation, i.e. who they are; but
               | rather merely limited by what people run: black box
               | "smart" tellies and whatnot, and the Workstation flavour
               | of Microsoft Windows. Even for such machines, there's the
               | option of a decent quality router/gateway or simply a
               | small box providing proxy DNS on the LAN.
               | 
               | In my case, said small box is roughly the size of my hand
               | and is smaller than my mass-market SOHO router/gateway.
               | (-:
        
               | lxgr wrote:
               | Is that really a win in terms of latency, considering
               | that the chance of a cache hit increases with the number
               | of users?
        
               | 0xbadcafebee wrote:
               | Keep in mind that low latency is a different goal than
               | reliability. If you want the lowest-latency, the anycast
               | address of a big company will often win out, because
               | they've spent a couple million to get those numbers. If
               | you want most reliable, then the closest hop to you
               | _should_ be the most reliable (there 's no accounting for
               | poor sysadmin'ing), which is often the ISP, but sometimes
               | not.
               | 
               | If you run your own _recursive DNS server_ (I keep
               | forgetting to use the right term) on a local network, you
               | can hit the root servers directly, which makes that the
               | most reliable possible DNS resolver. Yes you might get
               | more cache misses initially but I highly doubt you 'd
               | notice. (note: querying the root nameservers is bad
               | netiquette; you should always cache queries to them for
               | at least 5 minutes, and always use DNS resolvers to cache
               | locally)
        
               | lxgr wrote:
               | > If you want most reliable, then the closest hop to you
               | should be the most reliable (there's no accounting for
               | poor sysadmin'ing), which is often the ISP, but sometimes
               | not.
               | 
               | I'd argue that accounting for poorly managed ISP
               | resolvers is a critical part of reasoning about
               | reliability.
        
               | vel0city wrote:
               | I used to run unbound at home as a full resolver, and
               | ultimately this was my reason to go back to forwarding to
               | other large public resolvers. So many domains seemed to
               | be pretty slow to get a first query back, I had all kinds
               | of odd behaviors from devices around the house getting a
               | slow initial connection.
               | 
               | Changed back to just using big resolvers and all those
               | issues disappeared.
        
               | JdeBP wrote:
               | It is. If latency were important, one could always
               | aggregate across a LAN with forwarding caching proxies
               | pointing to a single resolving caching proxy, and gain
               | economies of scale by exactly the same mechanisms. But
               | latency is largely a wood-for-the-trees thing.
               | 
               | In terms of my everyday usage, for the past couple of
               | decades, cache miss delays are largely lost in the noise
               | of stupidly huge WWW pages, artificial service
               | greylisting delays, CAPTCHA delays, and so forth.
               | 
               | Especially as the first step in any _full_ cache miss, a
               | back-end query to the root content DNS server, is also
               | just a round-trip over the loopback interface. Indeed, as
               | is also the second step sometimes now, since some TLDs
               | also let one mirror their data. Thank you, Estonia.
               | https://news.ycombinator.com/item?id=44318136
               | 
               | And the gains in other areas are significant. Remember
               | that privacy and security are also things that people
               | want.
               | 
               | Then there's the fact that things like
               | Quad9's/Google's/CloudFlare's anycasting surprisingly
               | often results in hitting multiple independent servers for
               | successive lookups, not yielding the cache gains that a
               | superficial understanding would lead one to expect.
               | 
               | Just for fun, I did Bender's test at
               | https://news.ycombinator.com/item?id=44534938 a couple of
               | days ago, in a loop. I received reset-to-maximum TTLs
               | from multiple successive cache misses, on queries spaced
               | merely 10 seconds apart, from all three of Quad9, Google
               | Public DNS, and CloudFlare 1.1.1.1. With some maths, I
               | could probably make a good estimate as to how many
               | separate anycast caches on those services are answering
               | me from scratch, and not actually providing the cache
               | hits that one would naively think would happen.
               | 
               | I added 127.0.0.1 to Bender's list, of course. That had 1
               | cache miss at the beginning and then hit the cache every
               | single time, just counting down the TTL by 10 seconds
               | each iteration of the loop; although it did decide that
               | 42 days was unreasonably long, and reduced it to a week.
               | (-:
        
             | baobabKoodaa wrote:
             | Windows 11 doesn't allow using that combination
        
         | bigiain wrote:
         | I mean, aren't we already?
         | 
         | My Pi-holes both use OpenDNS, Quad9, and CloudFlare for
         | upstream.
         | 
         | Most of my devices use both of my Pi-holes.
        
           | johnklos wrote:
           | If you're already running Pi-hole, wny not just run your own
           | recursive, caching resolver?
        
       | geoffpado wrote:
       | This was quite annoying for me, having only switched my DNS
       | server to 1.1.1.1 approximately 3 weeks ago to get around my ISP
       | having a DNS outage. Is reasonably stable DNS really so much to
       | ask for these days?
        
         | bauruine wrote:
         | Why not use multiple? You can use 1.1.1.1, your ISPs and google
         | at the same time. Or just run a resolver yourself.
        
           | ripdog wrote:
           | >Or just run a resolver yourself.
           | 
           | I did this for a while, but ~300ms hangs on every DNS
           | resolution sure do get old fast.
        
             | xpe wrote:
             | Ouch. What resolver? What hardware?
             | 
             | With something like a N100- or N150-based single board
             | computer (perhaps around $200) running any number of open
             | source DNS resolvers, I would expect you can average around
             | 30 ms for cold lookups and <1 ms for cache hits.
        
               | ripdog wrote:
               | Not a hardware issue, but a physics problem. I live in
               | NZ. I guess the root servers are all in the US, so that's
               | 130ms per trip minimum.
        
               | johnklos wrote:
               | They are not all in the US.
        
               | ripdog wrote:
               | Well that's the experience I had. Obviously caching was
               | enabled (unbound), but most DNS keepalive times are so
               | short as to be fairly useless for a single user.
               | 
               | Even if a root server wasn't in the US, it will still be
               | pretty slow for me. Europe is far worse. Most of Asia has
               | bad paths to me, except for Japan and Singapore which are
               | marginally better than the US. Maybe Aus has one...?
        
               | janfoeh wrote:
               | According to [0], there is at least one in Auckland. No
               | idea about the veracity of that site, though.
               | 
               | [0] https://dnswatch.com/dns-docs/root-server-locations
        
               | encom wrote:
               | >DNS keepalive times are so short as to be fairly useless
               | 
               | Incompetent admins. dnsmasq at least has an option to
               | override it (--min-cache-ttl=<time>)
        
               | passivegains wrote:
               | I was going to reply about how New Zealand is as far from
               | almost everywhere else as the US, but I found out
               | something way more interesting: Other than servers in
               | Australia and New Zealand itself, the closest ones
               | actually _are_ in the US, just 3,000km north in American
               | Samoa. Basically right next door. (I need to go back to
               | work before my boss walks by and sees me screwing around
               | on Google Maps, but I 'm pretty sure the next closest are
               | in French Polynesia.)
        
         | bjoli wrote:
         | Run your own forwarder locally. Technitium dns makes it easy.
        
         | codingminds wrote:
         | If you consume a service that's free of charge, it's at least
         | not reasonable to complain if there's an outage.
         | 
         | Like mentioned by other comments, do it on your own if you are
         | not happy with the stability. Or just pay someone to provide it
         | - like your ISP..
         | 
         | And TBH I trust my local ISP more than Google or CF. Not in
         | availability, but it's covered by my local legislature. That's
         | a huge difference - in a positive way.
        
           | komali2 wrote:
           | > it's at least not reasonable to complain if there's an
           | outage.
           | 
           | I don't think this is fair when discussing infrastructure.
           | It's reasonable to complain about potholes, undrinkable tap
           | water, long lines at the DMV, cracked (or nonexistent)
           | sidewalks, etc. The internet is infrastructure and DNS
           | resolution is a critical part of it. That it hasn't been
           | nationalized doesn't change the fact that it's infrastructure
           | (and access absolutely should be free) and therefore everyone
           | should feel free to complain about it not working correctly.
           | 
           | "But you pay taxes for drinkable tap water," yes, and we paid
           | taxes to make the internet work too. For some reason, some
           | governments like the USA feel it to be a good idea to add a
           | middle man to spend that tax money on, but, fine, we'll
           | complain about the middle man then as well.
        
             | gkbrk wrote:
             | But you can just run a recursive resolver. Plenty of
             | packages to install. The root DNS servers were not
             | affected, so you would have been just fine.
             | 
             | DNS is infrastructure. But "Cloudflare Public Free DNS
             | Resolver" is not, it's just a convenience and a product to
             | collect data.
        
               | JdeBP wrote:
               | One can even run a private root content DNS server, and
               | not be affected by root problems _either_.
               | 
               | (This isn't a major concern, of course; and I mention it
               | just to extend your argument yet further. The major gain
               | of a private root content DNS server is the fraction of
               | really stupid nonsense DNS traffic that comes about
               | because of various things gets filtered out either on-
               | machine or at least without crossing a border router. The
               | gains are in security and privacy more than uptime.)
        
             | codingminds wrote:
             | You are right infrastructure is important.
             | 
             | But opposite to tap water there are a lot of different free
             | DNS resolvers that can be used.
             | 
             | And I don't see how my taxes funded CFs DNS service. But my
             | ISP fee covers their DNS resolving setup. That's the reason
             | why I wrote
             | 
             | > a service that's free of charge
             | 
             | Which CF is.
        
               | komali2 wrote:
               | DNS shouldn't be privatized at all since it's a critical
               | part of internet infrastructure, however at the same time
               | the idea that somehow it's something a corporation should
               | be allowed to sell to you at all (or "give you for free")
               | is silly given that the service is meaningless without
               | the infrastructure of the internet, which is built by
               | governments (through taxes). I can't even think of an
               | equivalent it's so ridiculous that it's allowed at all,
               | my best guess would be maybe, if your landlord was
               | allowed to charge you for walking on the sidewalk in
               | front of the apartment or something.
        
               | codingminds wrote:
               | DNS is not privatized. This is not about the root DNS
               | servers, it's just about one of many free resolvers out
               | there - in this case one of the bigger and popular ones.
        
             | delfinom wrote:
             | >That it hasn't been nationalized doesn't change the fact
             | that it's infrastructure (and access absolutely should be
             | free) and therefore everyone should feel free to complain
             | about it not working correctly.
             | 
             | >"But you pay taxes for drinkable tap water," yes, and we
             | paid taxes to make the internet work too. For some reason,
             | some governments like the USA feel it to be a good idea to
             | add a middle man to spend that tax money on, but, fine,
             | we'll complain about the middle man then as well.
             | 
             | You don't want DNS to be nationalized. Even the US would
             | have half the internet banned by now.
        
           | chii wrote:
           | > it's covered by my local legislature
           | 
           | which might not be a good thing in some jurisdictions - see
           | the porn block in the UK (it's done via dns iirc, and
           | trivially bypassed with a third party dns like cloudflare's).
        
             | codingminds wrote:
             | Yeah it has its pros and cons, sadly.
             | 
             | So far I'm lucky and the only ban I'm aware of is on
             | gambling. Which is fine for me personally.
             | 
             | But in a UK case I'd using a non local one as well.
        
         | pparanoidd wrote:
         | A single incident means 1.1.1.1 is no longer reasonably stable?
         | You are the unreasonable one
        
           | yjftsjthsd-h wrote:
           | Although I agree 1.1.1.1 is fine: To this particular
           | commenter they've had one major outage in 3 total weeks of
           | use, which isn't exactly a good record. (And it's
           | understandable to weigh personal experience above other
           | people claiming this isn't representative.)
        
           | cryptonym wrote:
           | I have been online for 30y and can't remember being affected
           | by downtime from my ISP DNS.
           | 
           | When DNS resolver is down, it affects everything, 100% uptime
           | is a fair expectation, hence redundancy. Looks like both
           | 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad
           | TBH, especially when you advise global usage.
           | 
           | RCA is not detailed and feels like a marketing stunt we are
           | now getting every other week.
        
             | sophacles wrote:
             | I too have been online for 30 years, and frequently had ISP
             | caused dns issues, even when I wasn't using their
             | resolvers... because of the dns interception fuckery they
             | like to engage in. Before I started running my own resolver
             | I saw downtime from my ISP's DNS resolver. This is across a
             | few ISPs in that time. Anecdata is great isn't it?
        
           | geoffpado wrote:
           | Two incidents from two completely different providers in
           | three weeks means that my personal experience with DNS is
           | remarkably less stable recently than the last 20-ish years
           | I've been using the Internet.
        
             | rthnbgrredf wrote:
             | Your personal experience is valuable but does not
             | generalize in this case. I have 8.8.8.8 and 1.1.1.1
             | (failover) set up for ever never experienced an outage.
        
       | jallmann wrote:
       | Good writeup.
       | 
       | > It's worth noting that DoH (DNS-over-HTTPS) traffic remained
       | relatively stable as most DoH users use the domain cloudflare-
       | dns.com, configured manually or through their browser, to access
       | the public DNS resolver, rather than by IP address.
       | 
       | Interesting, I was affected by this yesterday. My router
       | (supposedly) had Cloudflare DoH enabled but nothing would
       | resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
        
         | bauruine wrote:
         | How does DoH work? Somehow you need to know the IP of
         | cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for
         | this.
        
           | ta1243 wrote:
           | And even if you have already resolved it the TTL is only 5
           | minutes
        
           | stavros wrote:
           | Are we meant to use a domain? I've always just used the IP.
        
             | landgenoot wrote:
             | You need a domain in order to get the s in https to work
        
               | federiconafria wrote:
               | What about a reverse DNS lookup?
        
               | yread wrote:
               | what about certificate for IP address?
        
               | landgenoot wrote:
               | What about a route that gets hijacked? There is no HSTS
               | for IP addresses.
        
               | sathackr wrote:
               | Presumably the route hijacker wouldn't have a valid
               | private key for the certificate so they wouldn't pass
               | validation
        
               | bigiain wrote:
               | That's not correct.
               | 
               | LetEncrypt are trialling ip address https/TLS
               | certificates right now:
               | 
               | https://letsencrypt.org/2025/07/01/issuing-our-first-ip-
               | addr...
               | 
               | They say:
               | 
               | "In principle, there's no reason that a certificate
               | couldn't be issued for an IP address rather than a domain
               | name, and in fact the technical and policy standards for
               | certificates have always allowed this, with a handful of
               | certificate authorities offering this service on a small
               | scale."
        
               | noduerme wrote:
               | right, this was announced about two weeks ago to some
               | fanfare. So in principle there was no reason not to do it
               | two decades ago? It would've been nice back then. I never
               | heard of any certificate authority offering that.
        
               | bombcar wrote:
               | It the beginning of HTTPS you were supposed to look for
               | the padlock to prove if was a safe site. Scammers
               | wouldn't take the time and money to get a cert, after
               | all!
               | 
               | So certs were often tied with identity which an IP really
               | isn't so few providers offered them.
        
               | kbolino wrote:
               | An IP is about as much of an identity as a domain is.
               | 
               | There are two main reasons IP certificates were not
               | widely used in the past:
               | 
               | - Before the SAN extension, there was just the CN, and
               | there's only one CN per certificate. It would generally
               | be a waste to set your only CN to a single IP address (or
               | spend more money on more certs and the infrastructure to
               | maintain them). A domain can resolve to multiple IPs,
               | which can also be changed over time; users usually want
               | to go to e.g. microsoft.com, not whatever IP that
               | currently resolves to. We've had SANs for awhile now, so
               | this limitation is gone.
               | 
               | - Domain validation (serve this random DNS record)
               | involves ordinary forward-lookup records under your
               | domain. Trying to validate IP addresses over DNS would
               | involve adding records to the reverse-lookup in-addr.arpa
               | domain which varies in difficulty from annoying (you work
               | for a large org that owns its own /8, /16, or /24) to
               | impossible (you lease out a small number of unrelated IPs
               | from a bottom-dollar ISP). IP addresses are much more
               | doable now thanks to HTTP validation (serve this random
               | page on port 80), but that was an unnecessary/unsupported
               | modality before.
        
               | fs111 wrote:
               | > I never heard of any certificate authority offering
               | that.
               | 
               | DigiCert does. That is where 1.1.1.1 and 9.9.9.9 get
               | their valid certificates from
        
               | crabique wrote:
               | Most CAs offer them, the only requirement is that it's at
               | least an OV (not DV) level cert, and the subject
               | organization proves it owns the IP address.
        
               | maxloh wrote:
               | Nope. That is not correct. https://1.1.1.1/dns-query is a
               | perfectly valid DoH resolver address I've been using for
               | months.
               | 
               | Your operating system can validate the IP address of the
               | DNS response by using the Subject Alternative Name (SAN)
               | field within the CA certificate presented by the DoH
               | server: https://g.co/gemini/share/40af4514cb6e
        
           | stingraycharles wrote:
           | Yeah I don't understand this part either, maybe it's supposed
           | to be bootstrapped using your ISP's DNS server?
        
             | tom1337 wrote:
             | Pretty much that. You set up a bootstrap DNS server (could
             | be your ISPs or any other server) which then resolves the
             | IP of the DoH server which then can be used for all future
             | requests.
        
           | maxloh wrote:
           | Yeah, your operating system will first need to resolve
           | cloudflare-dns.com. This initial resolution will likely occur
           | unencrypted via the network's default DNS. Only then will
           | your system query the resolved address for its DoH requests.
           | 
           | Note that this introduces one query overhead per DNS request
           | if the previous cache has expired. For this reason, I've been
           | using https://1.1.1.1/dns-query instead.
           | 
           | In theory, this should eliminate that overhead. Your
           | operating system can validate the IP address of the DNS
           | response by using the Subject Alternative Name (SAN) field
           | within the CA certificate presented by the DoH server:
           | https://g.co/gemini/share/40af4514cb6e
        
         | Hamuko wrote:
         | My (Unifi) router is set to automatic DoH, and I think that
         | means it's using Cloudflare and Google. Didn't notice any
         | disruptions so either the Cloudflare DoH kept working or it
         | used the Google one while it was down.
        
           | zahrc wrote:
           | Check Jallmann's response
           | https://news.ycombinator.com/item?id=44578490#44578917
           | 
           | TLDR; DoH was working
        
             | Thorrez wrote:
             | AFAICS, Jallmann just left 1 comment and it was top-level.
             | I'm not sure what you mean by "Jallmann's response".
        
         | sneak wrote:
         | I disagree. The actual root cause here is shrouded in jargon
         | that even experienced admins such as myself have to struggle to
         | parse.
         | 
         | It's corporate newspeak. "legacy" isn't a clear term, it's used
         | to abstract and obfuscate.
         | 
         | > _Legacy components do not leverage a gradual, staged
         | deployment methodology. Cloudflare will deprecate these systems
         | which enables modern progressive and health mediated deployment
         | processes to provide earlier indication in a staged manner and
         | rollback accordingly._
         | 
         | I know what this means, but there's absolutely no reason for it
         | to be written in this inscrutable corporatese.
        
           | willejs wrote:
           | If you carry on reading, its quite obvious they misconfigured
           | a service and routed production traffic to that instead of
           | the correct service, and the system used to do that was built
           | in 2018 and is considered legacy (probably because you can
           | easily deploy bad configs). Given that, I wouldn't say the
           | summary is "inscrutable corporatese" whatever that is.
        
             | bigiain wrote:
             | I agree it's not "inscrutable corporatese"
             | 
             | It's carefully written so my boss's boss thinks he
             | understands it, and that we cannot possibly have that
             | problem because we obviously don't have any "legacy
             | components" because we are "modern and progressive".
             | 
             | It is, in my opinion, closer to "intentionally misleading
             | corporatese".
        
               | noduerme wrote:
               | Joe Shmo committed the wrong config file to production.
               | Innocent mistake. Sally caught it in 30 seconds. We were
               | back up inside 2 minutes. Sent Joe to the margarita shop
               | to recover his shattered nerves. Kid deserves a raise.
               | Etc.
        
               | sathackr wrote:
               | Yea the "timeline" indicating impact start/end is
               | entirely false when you look at the traffic graph shared
               | later in the post.
               | 
               | Or they have a different definition of impact than I do
        
           | stingraycharles wrote:
           | I disagree, the target audience is also going to be less
           | technical people, and the gist is clear to everyone: they
           | just deploy this config from 0 to 100% to production, without
           | feature gates or rollback. And they made changes to the
           | config that wasn't deployed for weeks until some other change
           | was made, which also smells like a process error.
           | 
           | I will not say whether or not it's acceptable for a company
           | of their size and maturity, but it's definitely not hidden in
           | corporate lingo.
           | 
           | I do believe they could have elaborate more on the follow up
           | steps they will take to prevent this from happening again, I
           | don't think staggered roll outs are the only answer to this,
           | they're just a safety net.
        
         | noduerme wrote:
         | Funny. I was configuring a new domain today, and for about 20
         | minutes I could only reach it through Firefox on one laptop.
         | Google's DNS tools showed it active. SSH to an Amazon server
         | that could resolve it. My local network had no idea of it.
         | Flush cache and all. Turns out I had that one FF browser set up
         | to use Cloudflare's DoH.
        
         | sathackr wrote:
         | Good writeup except the entirely false timeline shared at the
         | beginning of the post
        
           | bartvk wrote:
           | You need to clarify such a statement, in my opinion.
        
       | angst wrote:
       | I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8
       | 
       | Maybe there is noticeable difference?
       | 
       | I have seen more outage incident reports of cloudflare than of
       | google, but this is just personal anecdote.
        
         | ta1243 wrote:
         | I guess it depends on where you are and what you count as an
         | outage. Is a single failed query an outage?
         | 
         | For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time
         | of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are
         | 15.0ms, and 9.9.9.9 is 13.8ms.
         | 
         | All of those servers return over 3-nines of uptime when
         | quantised in the "worst result in a given 1 minute bucket" from
         | my monitoring points, which seem fine to have in your mix of
         | upstream providers. Personally I'd never rely on a single
         | provider. Google gets 4 nines, but that's only over 90 days so
         | I wouldn't draw any long term conclusions.
        
         | Pharaoh2 wrote:
         | https://www.dnsperf.com/#!dns-resolvers
         | 
         | Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
        
       | dawnerd wrote:
       | Oh this explains a lot. I kept having random connection issues
       | and when I disabled AdGuard dns (self hosted) it started working
       | so I just assumed it was something with my vm.
        
       | thunderbong wrote:
       | How does Cloudflare compare with OpenDNS?
        
         | blurrybird wrote:
         | You'd be better off comparing it to Quad9 based on performance,
         | privacy claims, and response accuracy.
        
         | johnklos wrote:
         | Cloudflare is a for-profit company in the US. Their privacy
         | claims can't be believed. Even if we did believe them, we have
         | no idea if rsolution data isn't taken by US TLA agencies.
        
           | forbiddenlake wrote:
           | Hm, what distinction are you trying to make here? OpenDNS is
           | also an American company, acquired by Cisco (an American
           | company) in 2015.
        
             | johnklos wrote:
             | I don't know much about OpenDNS, but yes, I wouldn't trust
             | Cisco to do anything that didn't somehow push money in
             | their direction. I was just offering relevant information
             | about Cloudflare.
        
           | johnklos wrote:
           | It seems we have a lot of Cloudflare fanbois and apologists
           | here. This is not unexpected. But is anything I'm writing
           | untrue, or just unpopular? Does anyone who's downvoting me
           | care to point out any inaccuracies about what I've written?
        
       | nodesocket wrote:
       | I used to configure 1.1.1.1 as primary and 8.8.8.8 as secondary
       | but noticed that Cloudflare on aggregate was quicker to respond
       | to queries and changed everything to use 1.1.1.1 and 1.0.0.1.
       | Perhaps I'll switch back to using 8.8.8.8 as secondary, though my
       | understanding is DNS will round-robin between primary and
       | secondary, it's not primary and then use secondary ONLY if
       | primary is down. Perhaps I am wrong though.
       | 
       | EDIT: Appears I was wrong, it is failover not round-robin between
       | the primary and secondary DNS servers. Thus, using 1.1.1.1 and
       | 8.8.8.8 makes sense.
        
         | ta1243 wrote:
         | Depends on how you configure it. In resolv.conf systems for
         | example you can set a timeout of say 1 second and do it as
         | main/reserve, or set it up to round-robin. From memory it's
         | something like "options:rotate"
         | 
         | If you have a more advanced local resolver of some sort
         | (systemd for example) you can configure whatever behaviour you
         | want.
        
       | chrismorgan wrote:
       | I'm surprised at the delay in impact detection: it took their
       | internal health service _more than five minutes_ to notice (or at
       | least alert) that their main protocol's traffic had abruptly
       | dropped to around 10% of expected and was staying there. Without
       | ever having been involved in monitoring at that kind of scale,
       | I'd have pictured alarms firing for something _that_ extreme
       | within a minute. I'm curious for description of how and why that
       | might be, and whether it's reasonable or surprising to
       | professionals in that space too.
        
         | TheDong wrote:
         | I'm not surprised.
         | 
         | Let's say you've got a metric aggregation service, and that
         | service crashes.
         | 
         | What does that result in? Metrics get delayed until your
         | orchestration system redeploys that service elsewhere, which
         | looks like a 100% drop in metrics.
         | 
         | Most orchestration take a sec to redeploy in this case,
         | assuming that it could be a temporary outage of the node (like
         | a network blip of some sort).
         | 
         | Sooo, if you alert after just a minute, you end up with people
         | getting woken up at 2am for nothing.
         | 
         | What happens if you keep waking up people at 2am for something
         | that auto-resolves in 5 minutes? People quit, or eventually
         | adjust the alert to 5 minutes.
         | 
         | I know you often can differentiate no data and real drops, but
         | the overall point, of "if you page people constantly, people
         | will quit" I think is the important one. If people keep getting
         | paged for too tight alarms, the alarms can and should be
         | loosened... and that's one way you end up at 5 minutes.
        
           | mentalgear wrote:
           | Its not wrong for smaller companies. But there's an argument
           | that a big system critical company/provider like Cloudflare
           | should be able to afford its own always on team with a night
           | shift.
        
             | chrismorgan wrote:
             | Not even a night shift, just normal working hours in
             | another part of the world.
        
               | bigiain wrote:
               | There are kinds big step/jumps as the size of a company
               | goes up.
               | 
               | Step 1: You start out with the founders being on call
               | 27x7x365 or people in the first 10 or 20 hires "carry the
               | pager" on weekends and evenings and your entire company
               | is doing unpaid rostered on call.
               | 
               | Step 2: You steal all the underwear.
               | 
               | Step 3: You have follow-the-sun office-hours support
               | staff teams distributed around the globe with sufficient
               | coverage for vacations and unexpected illness or
               | resignations.
        
               | chrismorgan wrote:
               | I confess myself bemused by your Step 2.
        
               | bigiain wrote:
               | I'm like, come on! It's a South Park reference? Surely
               | everybody here gets that???
               | 
               | <google google google>
               | 
               | "Original air date: December 16, 1998"
               | 
               | Oh, right. Half of you weren't even born... Now I feel
               | ooooooold.
        
             | misiek08 wrote:
             | Please don't. It doesn't make sense, doesn't help, doesn't
             | improve anything and is just waste of money, time, power
             | and people.
             | 
             | Now without crying: I saw multiple, big companies getting
             | rid of NOC and replacing that with on duties in multiple,
             | focused teams. Instead of 12 people sitting 24/7 in group
             | of 4 and doing some basic analysis and steps before calling
             | others - you page correct people in 3-5 minutes, with exact
             | and specific alert.
             | 
             | Incident resolution times went greatly down (2-10x times -
             | depends on company), people don't have to sit overnight and
             | sleep for most of the time and no stupid actions like
             | service restart taken to slow down incident resolution.
             | 
             | And I'm not liking that some platforms hire 1500 people for
             | job that could be done with 50-100, but in terms of
             | incident response - if you already have teams with
             | separated responsibilities then NOC it's "legacy"
        
               | immibis wrote:
               | 24/7 on-call is basically mandatory at any major network,
               | which cloudflare is. Your contractual relations with
               | other networks will require it.
        
               | easterncalculus wrote:
               | I'm not convinced that the SWE crowd of HN, particularly
               | the crowd showing up to every thread about AI 'agents'
               | really knows what it takes to run a global network or
               | what a NOC is. I know saying this on here runs the risk
               | of Vint Cerf or someone like that showing up in my
               | replies, but this is seriously getting out of hand now.
               | Every HN thread that isn't about fawning over AI
               | companies is devolving into armchair redditor analysis of
               | topics people know nothing about. This has gotten way
               | worse since the pre-ChatGPT days.
        
               | genewitch wrote:
               | > Every HN thread that isn't about fawning over AI
               | companies is devolving into armchair redditor analysis of
               | topics people know nothing about.
               | 
               | It took me a very long time to realize that^. I've worked
               | _with_ two NOC at two _huge_ companies, and i know they
               | still exist as teams at those companies. I 'm not an SWE,
               | though. And I'm not certain i'd qualify either company as
               | truly "global" except in the loosest sense - as in, one
               | has "American" in the name of the primary subsidiary.
               | 
               | ^ i even regularly have used "the comments were people
               | incorrecting each other about <x>", so i knew
               | subconsciously that HN is just a different subset of
               | general internet comments. The issue comes from this site
               | appearing to be moderated, and the group of people that
               | select for commenting here seem like they would be above
               | average at understanding and backing up claims. The
               | "incorrecting" label comes from n-gate, which hasn't been
               | updated since the early '20s, last i checked.
        
               | JohnMakin wrote:
               | Lol preach
               | 
               | (Have worked as SRE at large global platform)
               | 
               | I just mostly over the last few years tune out such
               | responses and try not to engage them. The whole
               | uninformed "Well, if it were me, I would simply not do
               | that" kind of comment style has been pervasive on this
               | site for longer than AI though, IMO.
        
               | degamad wrote:
               | The question is, which is better: 24/7 shift work (so
               | that someone is always at work to respond, with disrupted
               | sleep schedules at regular planned intervals) or 24/7 on-
               | call (with monitoring and alerting that results in random
               | intermittent disruptions to sleep, sometimes for false
               | positives)?
        
             | amelius wrote:
             | I think it is reasonable if the alarm trigger time is, say
             | 5-10% of the time required to fix most problems.
        
               | amelius wrote:
               | Instead of downvoting me, I'd like to know why this is
               | not reasonable?
        
           | croemer wrote:
           | It's not rocket science. You do a 2 stage thing: Why not
           | check if the aggregation service has crashed before firing
           | the alarm if it's within the first 5 minutes? How many types
           | of false positives can there be? You just need to eliminate
           | the most common ones and you gradually end up with fewer of
           | them.
           | 
           | Before you fire a quick alarm, check that the node is up,
           | check that the service is up etc.
        
             | sophacles wrote:
             | > How many types of false positives can there be?
             | 
             | Operating at the scale of cloudflare? A lot.
             | 
             | * traffic appears to be down 90% but we're only getting
             | metrics from the regions of the world that are asleep
             | because of some pipeline error
             | 
             | * traffic appears to be down 90% but someone put in a
             | firewall rule causing the metrics to be dropped
             | 
             | * traffic appears to be down 90% but actually the counter
             | rolled over and prometheus handled it wrong
             | 
             | * traffic appears to be down 90% but the timing of the new
             | release just caused polling to show wierd numbers
             | 
             | * traffic appears to be down 90% but actually there was a
             | metrics reporting spike and there was pipeline lag
             | 
             | * traffic appears to be down 90% but it turns out that the
             | team that handles transit links forgot to put the right
             | acls around snmp so we're just not collecting metrics for
             | 90% of our traffic
             | 
             | * I keep getting alerts for traffic down 90%.... thousands
             | and thousands of them, but it turns out that really its
             | just that this rarely used alert had some bitrot and
             | doesn't use the aggregate metrics but the per-system ones.
             | 
             | * traffic is actually down 90% because theres an internet
             | routing issue (not the dns team's problem)
             | 
             | * traffic is actually down 90% at one datacenter because of
             | a fiber cut somewhere
             | 
             | * traffic is actually down 90% because the normal usage
             | pattern is trough traffic volume is 10% of peak traffic
             | volume
             | 
             | * traffic is down 90% from 10s ago, but 10s ago there was
             | an unusual spike in traffic.
             | 
             | And then you get into all sorts of additional issues caused
             | by the scale and distributed nature of a metrics system
             | that monitors a huge global network of datacenters.
        
           | __turbobrew__ wrote:
           | The real issue in your hypothetical scenario is a single bad
           | metrics instance can bring the entire thing down. You could
           | deploy multiple geographically distributed metrics
           | aggregation services which establish the "canonical state"
           | through a RAFT/PAXOS quorum. Then as long as a majority of
           | metric aggregator instances are up the system will continue
           | to work.
           | 
           | When you are building systems like 1.1.1.1 having an alert
           | rollup of five minutes is not acceptable as it will hide
           | legitimate downtime that lasts between 0 and 5 minutes.
           | 
           | You need to design systems which do not rely on orchestration
           | to remediate short transient errors.
           | 
           | Disclosure: I work on a core SRE team for a company with over
           | 500 million users.
        
         | perlgeek wrote:
         | There's a constant tension between speed of detection and false
         | positive rates.
         | 
         | Traditional monitoring systems like Nagios and Icinga have
         | settings where they only open events/alerts if a check failed
         | three times in a row, because spurious failed checks are quite
         | common.
         | 
         | If you spam your operators with lots of alerts for monitoring
         | checks that fix themselves, you stress the unnecessarily and
         | create alert blindness, because the first reaction will be
         | "let's wait if it fixes itself".
         | 
         | I've never operated a service with as much exposure as CF's DNS
         | service, but I'm not really surprised that it took 8 minutes to
         | get a reliable detection.
        
           | sbergot wrote:
           | I work on the SSO stack in a b2b company with about 200k
           | monthly active users. One blind spot in our monitoring is
           | when an error occurs on the client's identity provider
           | because of a problem on our side. The service is unusable and
           | we don't have any error logs to raise an alert. We tried to
           | setup an alert based on expected vs actual traffic but we
           | concluded that it would create more problems for the reason
           | you provided.
        
           | chrismorgan wrote:
           | At Cloudflare's scale on 1.1.1.1, I'd imagine you could do
           | something comparatively simple like track ten-minute and ten-
           | second rolling averages (I know, I know, I make that sound
           | much easier and more practical than it actually would be),
           | and if they differ by more than 50%, sound the alarm. (Maybe
           | the exact numbers would need to be tweaked, e.g. 20 seconds
           | or 80%, but it's the idea.)
           | 
           | Were it much less than 1.1.1.1 itself, taking longer than a
           | minute to alarm probably wouldn't surprise me, but this is
           | 1.1.1.1, they're dealing with _vasts_ amounts of probably
           | fairly consistent traffic.
        
             | perlgeek wrote:
             | I'm sure some engineer at cloudflare is evaluating
             | something like this right now, and try it on historical
             | data how many false positives that would've generated in
             | the past, if any.
             | 
             | Thing is, it's probably still some engineering effort, and
             | most orgs only really improve their monitoring after it
             | turned out to be sub-optimal.
        
               | chrismorgan wrote:
               | This is hardly the first 1.1.1.1 outage. It's also
               | probably about the first external monitoring behaviour I
               | imagine you'd come up with. That's why I'm surprised--
               | more surprised the longer I think about it, actually;
               | more than five minutes is a _really_ long delay to notice
               | such a fundamental breakage.
        
               | roughly wrote:
               | Is your external monitor working? How many checks failed,
               | in what order? Across how many different regions or
               | systems? Was it a transient failure? How many times do
               | you retry, and at what cadence? Do you push your success
               | or failure metrics? Do you pull? What if your metrics
               | don't make it back? How long do you wait before
               | considering it a problem? What other checks do you run,
               | and how long do those take? What kind of latency is
               | acceptable for checks like that? How many false alarms
               | are you willing to accept, and at what cadence?
        
             | briangriffinfan wrote:
             | I would want to make sure we avoid "We should always do the
             | exact specific thing that would have prevented this exact
             | specific issue"-style thinking.
        
             | Anon1096 wrote:
             | I work on something at a similar scale to 1.1.1.1, if we
             | had this kind of setup our oncall would never be asleep
             | (well, that is almost already the case, but alas). It's
             | easy to say "just implement X monitor and you'd have caught
             | this" but there's a real human cost and you have to work
             | extremely vigilently at deleting monitors or you'll be
             | absolutely swamped with endless false positive pages. I
             | don't think a 5 minute delay is unreasonable for a service
             | this scale.
        
               | chrismorgan wrote:
               | This just seems kinda fundamental: the _entire service_
               | was basically down, and it took 6+ minutes to notice? I'm
               | just increasingly perplexed at how that could be. This
               | isn't an _advanced_ monitor, this is perhaps the first
               | and most important monitor I'd expect to implement (based
               | on no closely relevant experience).
        
               | roughly wrote:
               | > based on no closely relevant experience
               | 
               | I don't want to devolve this to an argument from
               | authority, but - there's a lot of trade offs to
               | monitoring systems, especially at that scale. Among other
               | things, aggregation takes time at scale, and with enough
               | metrics and numbers coming in, your variance is all over
               | the place. A core fact about distributed systems at this
               | scale is that something is always broken somewhere in the
               | stack - the law of averages demands it, and so if you're
               | going to do an all-fire-alarm alert any time part of the
               | system isn't working, you've got alarms going off 24/7.
               | Actually detecting that an actual incident is actually
               | happening on a machine of the size and complexity we're
               | talking about within 5 minutes is absolutely fantastic.
        
         | philipwhiuk wrote:
         | Remember they have no SLA for this service.
        
           | chrismorgan wrote:
           | So?
           | 
           | They have a rather significant vested interest in it being
           | reliable.
        
         | bombcar wrote:
         | This is one of those graphs that would have been on the giant
         | wall in the NOC in the old days - someone would glance up and
         | see it had dropped and say "that's not right" and start
         | scrambling.
        
           | seb1204 wrote:
           | That's how I picture it. Is that not how it is? Everyone
           | working from home and the big chart is on the TV but someone
           | in the family changed channels?
        
         | kccqzy wrote:
         | Having alarms firing within a minute just becomes a stress test
         | for your alarm infrastructure. Is your alarm infrastructure
         | able to get metrics and perform calculations consistently
         | within a minute of real time?
        
         | bastawhiz wrote:
         | The service almost certainly wasn't completely hard down at the
         | time the impact began, especially if that's the start of a
         | global rollout. It would have taken time for the impact to
         | become measurable.
        
       | egamirorrim wrote:
       | What's that about a hijack?
        
         | homero wrote:
         | Related, non-causal event: BGP origin hijack of 1.1.1.0/24
         | exposed by withdrawal of routes from Cloudflare. This was not a
         | cause of the service failure, but an unrelated issue that was
         | suddenly visible as that prefix was withdrawn by Cloudflare.
        
           | kylestanfield wrote:
           | So someone just started advertising the prefix when it was up
           | for grabs? That's pretty funny
        
             | woutifier wrote:
             | No they were already doing that, the global withdrawal of
             | the legitimate route just exposed it.
        
               | SemioticStandrd wrote:
               | How is there absolutely no further comment about that in
               | their RCA? That seems like a pretty major thing...
        
           | JdeBP wrote:
           | And because people highlighted it on social media at the time
           | of the outage, many thought that the bogus route _was_ the
           | cause of the problem.
        
           | ollien wrote:
           | I'm a bit uneducated here - why was the other 1.1.1.0/24
           | announcement previously suppressed? Did it just express a
           | high enough cost that no one took it on compared to the CF
           | announcement?
        
             | whiatp wrote:
             | CF had their route covered by RPKI, which at a high level
             | uses certs to formalize delegation of IP address space.
             | 
             | What caused this specific behavior is the dilemma of
             | backwards comparability when it comes to BGP security. We
             | area long ways off from all routes being covered by rpki,
             | (just 56% of v4 routes according to https://rpki-
             | monitor.antd.nist.gov/ROV ) so invalid routes tend to be
             | treated as less preferred, not rejected by BGP speakers
             | that support RPKI.
        
       | 0xbadcafebee wrote:
       | > A configuration change was made for the same DLS service. The
       | change attached a test location to the non-production service;
       | this location itself was not live, but the change triggered a
       | refresh of network configuration globally.
       | 
       | Say what now? A test triggered a global production change?
       | 
       | > Due to the earlier configuration error linking the 1.1.1.1
       | Resolver's IP addresses to our non-production service, those
       | 1.1.1.1 IPs were inadvertently included when we changed how the
       | non-production service was set up.
       | 
       | You have a process that allows some other service to just hoover
       | up address routes already in use in production by a different
       | service?
        
       | sneak wrote:
       | 1.1.1.1 does not operate in isolation.
       | 
       | It is designed to be used in conjunction with 1.0.0.1. DNS has
       | fault tolerance built in.
       | 
       | Did 1.0.0.1 go down too? If so, why were they on the same
       | infrastructure?
       | 
       | This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole
       | point is that it can go down at any time and everything keeps
       | working.
       | 
       | Shouldn't the fix be to ensure that these are served out of
       | completely independent silos and update all docs to make sure
       | anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
       | 
       | If I ran a service like this I would regularly do blackouts or
       | brownouts on the primary to make sure that people's resolvers are
       | configured correctly. Nobody should be using a single IP as a
       | point of failure for their internet access/browsing.
        
         | detaro wrote:
         | You don't need to test if peoples resolvers handle this
         | cleanly, because its already known that many don't. DNS
         | fallback behavior across platforms is a mess.
        
         | notpushkin wrote:
         | > Did 1.0.0.1 go down too?
         | 
         | Yes.
         | 
         | > Shouldn't the fix be to ensure that these are served out of
         | completely independent silos [...]?
         | 
         | Yes.
         | 
         | > If so, why were they on the same infrastructure?
         | 
         | Apparently, they weren't independent enough: something in CF
         | has announced both addresses and that got out.
         | 
         | The solution for the end user is, of course, to use 1.1.1.1 and
         | 8.8.8.8 (or any other combination of two _different_
         | resolvers).
        
       | rswail wrote:
       | I now run unbound locally as a recursive DNS server, which really
       | should be the default. There's no reason not to in modern
       | routers.
       | 
       | Not sure what the "advantage" of stub resolvers is in 2025 for
       | anything.
        
       | i_niks_86 wrote:
       | Many commenters assume fallback behavior exists between DNS
       | providers, but in practice, DNS clients - especially at the OS or
       | router level -rarely implement robust failover for DoH. If you're
       | using cloudflare-dns(.)com and it goes down, unless the stub
       | resolver or router explicitly supports multi-provider failover
       | (and uses a trust-on-first-use or pinned cert model), you're
       | stuck. The illusion of redundancy with DoH needs serious UX
       | rethinking.
        
         | tankenmate wrote:
         | I use routedns[0] for this specific reason it handles almost
         | all DNS protocols; UDP, TCP, DoT, DoH, DoQ (including 0-RTT).
         | But more importantly is has a very configurable route steering
         | even down to a record by record basis if you want to put up
         | with all the configuration involved. It's very robust and is
         | very handy, I use 1.1.1.1 on my desktops and servers and when
         | the incident happened I didn't even notice as the failover
         | "just worked". I had to actually go look at the logs because I
         | didn't notice.
         | 
         | [0] https://github.com/folbricht/routedns
        
       | hkon wrote:
       | To say I was surprised when I finally checked the status page of
       | cloudflare is an understatement.
        
       | v5v3 wrote:
       | > For many users, not being able to resolve names using the
       | 1.1.1.1 Resolver meant that basically all Internet services were
       | unavailable.
       | 
       | Don't you normally have 2 DnS servers listed on any device. So
       | was the second also down, if not why didn't it go to that.
        
         | rat9988 wrote:
         | Not all users have configured two DNS servers?
        
           | quacksilver wrote:
           | It is highly recommended to configure two or more DNS servers
           | incase one is down.
           | 
           | I would count not configuring at least two as 'user error'.
           | Many systems require you to enter a primary and alternate
           | server in order to save a configuration.
        
             | tgv wrote:
             | The default setting on most computers seems to be: use the
             | (wifi) router. I suppose telcos like that because it keeps
             | the number of DNS requests down. So I wouldn't necessarily
             | see it as user error.
        
             | SketchySeaBeast wrote:
             | The funny part with that is that sites like cloudflare say
             | "Oh, yeah, just use 1.0.0.1 as your alternate", when, in
             | reality, it should be an entirely different service.
        
           | daneel_w wrote:
           | OK. But there's no reason or excuse not to, if they already
           | manually configured a primary.
        
         | rom1v wrote:
         | On Android, in Settings, Network & internet, Private DNS, you
         | can only provide one in "Private DNS provider hostname"
         | (AFAIK).
         | 
         | Btw, I really don't understand why it does not accept an IP
         | (1.1.1.1), so you have to give an address (one.one.one.one). It
         | would be more sensible to configure a DNS server from an IP
         | rather than from an address to be resolved by a DNS server :/
        
           | quacksilver wrote:
           | Private DNS on Android refers to 'DNS over HTTPS' and would
           | normally only accept a hostname.
           | 
           | Normal DNS can normally be changed in your connection
           | settings for a given connection on most flavours of Android.
        
             | quaintdev wrote:
             | Its DNS over TLS. Android does not support DNS over HTTPS
             | except Google's DNS
        
               | KoolKat23 wrote:
               | As far as I understand it, it's Google or Cloudflare?
        
               | lxgr wrote:
               | It does since Android 11.
        
               | Tarball10 wrote:
               | For a limited set of DoH providers. It does not let you
               | enter a custom DoH URL, only a DoT hostname.
        
             | rom1v wrote:
             | > Private DNS on Android refers to 'DNS over HTTPS'
             | 
             | Yes, sorry, I did not mention it.
             | 
             | So if you want to use DNS over HTTPS on Android, it is not
             | possible to provide a fallback.
        
               | ignoramous wrote:
               | > _So if you want to use DNS over HTTPS on Android, it is
               | not possible to provide a fallback._
               | 
               | Not true. If the (DoH) host has multiple A/AAAA records
               | (multiple IPs), any decent DoH client would retry its
               | requests over multiple or all of those IPs.
        
               | lxgr wrote:
               | Does Cloudflare offer any hostname that also resolves to
               | a different organization's resolver (which must also have
               | a TLS certificate for the Cloudflare hostname or DoH
               | clients won't be able to connect)?
        
               | ignoramous wrote:
               | Usually, for plain old DNS, primary and secondary
               | resolvers are from the same provider, serving from
               | distinct IPs.
        
               | lxgr wrote:
               | Yes, but you were talking about DoH. I don't know how
               | that could plausibly work.
        
               | ignoramous wrote:
               | > _but you were talking about DoH_
               | 
               | DoH hosts can resolve to multiple IPs (and even different
               | IPs for different clients)?
               | 
               | Also see TFA                 It's worth noting that DoH
               | (DNS-over-HTTPS) traffic remained relatively stable as
               | most DoH users use the domain cloudflare-dns.com,
               | configured manually or through their browser, to access
               | the public DNS resolver, rather than by IP address. DoH
               | remained available and traffic was mostly unaffected as
               | cloudflare-dns.com uses a different set of IP addresses.
        
               | lxgr wrote:
               | > DoH hosts can resolve to multiple IPs (and even
               | different IPs for different clients)?
               | 
               | Yes, but not from a different organization. That was GPs
               | point with
               | 
               | > So if you want to use DNS over HTTPS on Android, it is
               | not possible to provide a fallback.
               | 
               | A cross-organizational fallback is not possible with DoH
               | in many clients, but it is with plain old DNS.
               | 
               | > It's worth noting that DoH (DNS-over-HTTPS) traffic
               | remained relatively stable as most DoH users use the
               | domain cloudflare-dns.com
               | 
               | Yes, but that has nothing to do with failovers to an
               | infrastructurally/operationally separate secondary
               | server.
        
               | ignoramous wrote:
               | > _A cross-organizational fallback is not possible with
               | DoH in many clients, but it is with plain old DNS._
               | 
               | That's client implementation lacking, not some issue
               | inherent to DoH?                  The DoH client is
               | configured with a URI Template, which describes how to
               | construct the URL to use for resolution. Configuration,
               | discovery, and updating of the URI Template is done out
               | of band from this protocol.            Note that
               | configuration might be manual (such as a user typing URI
               | Templates in a user interface for "options") or automatic
               | (such as URI Templates being supplied in responses from
               | DHCP or similar protocols). DoH servers MAY support more
               | than one URI Template. This allows the different
               | endpoints to have different properties, such as different
               | authentication requirements or service-level guarantees.
               | 
               | https://datatracker.ietf.org/doc/html/rfc8484#section-3
        
               | lxgr wrote:
               | Yes, but this restriction of only a single DoH URL seems
               | to be the norm for many popular implementations. The
               | protocol theoretically allowing better behavior doesn't
               | really help people using these.
        
             | eptcyka wrote:
             | Cloudflare has valid certs for 1.1.1.1
        
             | fs111 wrote:
             | No, it is not DNS over HTTPS it is DNS over TLS, which is
             | different.
        
               | lxgr wrote:
               | Android 11 and newer support both DoH and DoT.
        
               | politelemon wrote:
               | Where is this option? How can I distinguish the two, the
               | dialog simply asks for a host name
        
         | zamadatix wrote:
         | 1.1.1.1 is also what they call the resolver service as a whole,
         | the impact section (seems to) be saying both 1.0.0.0/24 and
         | 1.1.1.0/24 were affected (among other ranges).
        
         | Gieron wrote:
         | I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I
         | understand this correctly, both were down.
        
           | Algent wrote:
           | Yeah pretty much. In a perfect world you would pair it with
           | another service I guess but usually you use the official
           | backup IP because it's not supposed to break at same time.
        
             | carlhjerpe wrote:
             | I would rather fall back to the slow path of resolving
             | through root servers than fall back from one recursive
             | resolver to another.
        
           | moontear wrote:
           | Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault
           | tolerance in terms of provider as well.
        
             | rvnx wrote:
             | Quad9 is reselling the traffic logs, so it means if you
             | connect to secret hosts (like for your work), they will be
             | leaked
        
               | Demiurge wrote:
               | Is this true? They claim that they don't keep any logs.
               | Do you have a source?
        
               | jeffbee wrote:
               | They don't claim that. Less than a week ago HN discussed
               | their top resolved domains report. Such a report implies
               | they have logs.
        
               | Demiurge wrote:
               | From their homepage:
               | 
               | > How Quad9 protects your privacy?
               | 
               | > When your devices use Quad9 normally, no data
               | containing your IP address is ever logged in any Quad9
               | system.
               | 
               | Of course they have some kinds of logs. Aggregating
               | resolved domains without logging client IPs is not what
               | the implication of "Quad9 is reselling the traffic logs"
               | seems to be.
        
               | jeffbee wrote:
               | We're not discussing IP addresses, we are discussing
               | whether their logs can leak your secret domain name.
        
               | Demiurge wrote:
               | Thats more clear, I get your point now. Again, though,
               | that's not how most people would read the original
               | comment. I've never even contemplated that I might
               | generate some hostnames existence of which might be
               | considered sensitive. It seems like a terrible idea to
               | begin with, as I'm sure there are other avenues for those
               | "secret" domains to be leaked. Perhaps name your secret
               | VMs vm1, vm2, ..., instead of <your root password>. But
               | yeah, this is not my area of expertise, nor a concern for
               | the vast majority of internet users who want more privacy
               | than their ISP will provide.
               | 
               | I am curious though, do you have any suggestions for
               | alternative DNS that is better?
        
               | jeffbee wrote:
               | I use Google DNS because I feel it suits my personal
               | theory of privacy threats. Among the various public DNS
               | resolver services, I feel that they have the best
               | technical defenses agains insider snooping and outside
               | hackers infiltrating their systems, and I am unperturbed
               | about their permanent logs. I also don't care about
               | Quad9's logs, except to the extent that it seems
               | inconsistent with the privacy story they are selling. I
               | used Quad9 as my resolver of last resort in my config. I
               | doubt any queries actually go there in practice.
        
               | daneel_w wrote:
               | Could you show a citation? Your statement completely
               | opposes Quad9's official information as published on
               | quad9.net, and what's more it doesn't align at all with
               | Bill Woodcock's known advocacy for privacy.
        
               | gruez wrote:
               | See: https://quad9.net/privacy/policy/
               | 
               | It doesn't say they sell traffic logs outright, but they
               | do send telemetry on blocked domains to the blocklist
               | provider, and provides "a sparse statistical sampling of
               | timestamped DNS responses" to "a very few carefully
               | vetted security researchers". That's not exactly "selling
               | traffic logs", but is fairly close. Moreover colloquially
               | speaking, it's not uncommon to claim "google sells your
               | data", even they don't provide dumps and only disclose
               | aggregated data.
        
               | daneel_w wrote:
               | Disagree that it's fairly close to the statement "they
               | resell traffic logs" and the implication that they leak
               | all queried hostnames ("secret hosts, like for your work,
               | will be leaked"). Unless Quad9 is deceiving users, both
               | statements are, in fact, completely false.
               | 
               | https://quad9.net/privacy/policy/#22-data-collected
        
               | gruez wrote:
               | >and the implication that they leak all queried hostnames
               | ("secret hosts, like for your work, will be leaked").
               | 
               | The part about sharing data with "a very few carefully
               | vetted security researchers" doesn't preclude them from
               | leaking domains. For instance if the security researcher
               | exports a "SELECT COUNT(*) GROUP BY hostname" query that
               | would arguably count as "summary form", and would include
               | any secret hostnames.
               | 
               | >https://quad9.net/privacy/policy/#22-data-collected
               | 
               | If you're trying to imply that they can't possibly be
               | leaking hostnames because they don't collect hostnames,
               | that's directly contradicted by the subsequent sections,
               | which specifically mention that they share metrics
               | grouped by hostname basis. Obviously they'll need to
               | collect hostname to provide such information.
        
               | daneel_w wrote:
               | I'm implying that I'm convinced they are not storing
               | statistics on (thus leaking) every queried hostname. By
               | your very own admission, they clearly state that they
               | perform statistics on _a set of malicious domains
               | provided by a third party_ , as part of their blocking
               | program. Additionally they publish a "top 500 domains"
               | list regularly. You're really having a go with the
               | shoehorn if you want "secret domains, like for your work"
               | (read: every distinct domain queried) to fit here.
        
               | gruez wrote:
               | >I'm implying that I'm convinced they are not storing
               | statistics on (thus leaking) every queried hostname. By
               | your very own admission, they clearly state that they
               | perform statistics on a set of malicious domains provided
               | by a third party, as part of their blocking program.
               | 
               | Right, but the privacy policy also says there's a
               | separate program for "a very few carefully vetted
               | security researchers" where they can get data in "summary
               | form", which can leak domain name in the manner I
               | described in my previous comment. Maybe they have a great
               | IRB (or similar) that would prevent this from happening,
               | but that's not mentioned in the privacy policy. Therefore
               | it's totally in the realm of possibility that secret
               | domain names could be leaked, no "really having a go with
               | the shoehorn" required.
        
               | sophacles wrote:
               | Im sorry... what is a secret hostname that is publicly
               | resolvable?
               | 
               | The very idea strikes me as irresponsible and misguided.
        
               | notpushkin wrote:
               | It could be some subdomain that's hard to guess. You
               | can't (generally) enumerate all subdomains through DNS,
               | and if you use a wildcard TLS certificate (or self-signed
               | / no cert at all), it won't be leaked to CT logs either.
               | Secret hostname.
        
               | rvnx wrote:
               | Examples: github.internal.companyname.com or
               | jira.corp.org or jenkins-ci.internal-finance.acme-
               | corp.com or grafana.monitoring.initech.io or
               | confluence.prod.internal.companyx.com etc
               | 
               | These, if you don't know the host, you will not be able
               | to hit the backend service. But if you know, you can
               | start exploiting it, either by lack of auth, or by trying
               | to exploit the software itself
        
               | Quad9 wrote:
               | We are fully committed to end-user privacy. As a result,
               | Quad9 is intentionally designed to be incapable of
               | capturing end-users' PII. Our privacy policy is clear
               | that queries are never associated with individual persons
               | or IP addresses, and this policy is embedded in the
               | technical (in)capabilities of our systems.
        
               | rvnx wrote:
               | It is about the hostnames themselves like:
               | git.nationalpolice.se but I understand that there is not
               | much choice if you want to keep the service free to use
               | so this is fair
        
               | staviette wrote:
               | Is that really a concern for most people? Trying to keep
               | hostnames secret is a losing battle anyways these days.
               | 
               | You should probably be using a trusted TLS certificate
               | for your git hosting. And that means the host name will
               | end up in certificate transparency logs which are even
               | easier to scrape than DNS queries.
        
             | baobabKoodaa wrote:
             | Windows 11 does not allow using this combination
        
               | snickerdoodle12 wrote:
               | Huh? Did they break the primary/secondary DNS server
               | setup that has been present in all operating systems for
               | decades?
        
               | antonvs wrote:
               | DNS over HTTPS adds a requirement for an additional field
               | - a URL template - and Windows doesn't handle defaulting
               | that correctly in all cases. If you set them manually it
               | works fine.
        
               | snickerdoodle12 wrote:
               | What does that have to do with plain old dns?
        
               | antonvs wrote:
               | Nothing, but Windows can automatically use DNS over HTTPS
               | if it recognizes the server, which is the source of the
               | issue the other commenter mentioned.
        
               | lxgr wrote:
               | How so? Does it reject a secondary DNS server that's not
               | in the same subnet or something similar?
        
               | antonvs wrote:
               | It's using DNS over HTTPS, and it doesn't default the URL
               | templates correctly when mixing (some) providers. You can
               | set them manually though, and it works.
        
               | lxgr wrote:
               | Ah, this is for DoH, gotcha!
               | 
               | This "URL template" thing seems odd - is Windows doing
               | something like creating a URL out of the DNS IP and a
               | pattern, e.g. 1.1.1.1 + "https://<ip>/foo" would yield
               | https://1.1.1.1/foo?
               | 
               | If so, why not just allow providing an actual URL for
               | each server?
        
               | antonvs wrote:
               | It does allow you to provide a URL for each server. The
               | issue is just that its default behavior doesn't work for
               | all providers. I have another comment in this thread
               | telling the original commenter how to configure it.
        
               | lxgr wrote:
               | Very cool, thank you!
        
               | antonvs wrote:
               | You can use it, you just need to set the DNS over HTTPS
               | templates correctly, since there's an issue with the
               | defaults it tries to use when mixing providers.
               | 
               | The templates you need are:
               | 
               | 1.1.1.1: https://cloudflare-dns.com/dns-query
               | 
               | 9.9.9.9: https://dns.quad9.net/dns-query
               | 
               | 8.8.8.8: https://dns.google/dns-query
               | 
               | See https://learn.microsoft.com/en-us/windows-
               | server/networking/... for info on how to set the
               | templates.
        
               | baobabKoodaa wrote:
               | Awesome! Thank you!
        
               | antonvs wrote:
               | You're welcome. btw I came across a description of doing
               | it via the GUI here: https://github.com/Curious4Tech/DNS-
               | over-HTTPS-Set-Up
        
             | Aachen wrote:
             | I became a bit disillusioned with quad9 when they started
             | refusing to resolve my website. It's like wetransfer but
             | supporting wget and without the AI scanning or
             | interstitials. A user had uploaded malware and presumably
             | sent the link to a malware scanner. Instead of reporting
             | the malicious upload or blocking the specific URL1, the
             | whole domain is now blocked on a DNS level. The competing
             | wetransfer.com resolves just fine at 9.9.9.9
             | 
             | I haven't been able to find any recourse. The malware was
             | online for a few hours but it has been weeks and there
             | seems to be no way to clear my name. Someone on github (the
             | website is open source) suggested that it's probably
             | because they didn't know of the website, like everyone
             | heard of wetransfer and github and so they don't get the
             | whole domain blocked for malicious user content. I can't
             | find any other difference, but also no responsible party to
             | ask. The false-positive reporting tool on quad9's website
             | just reloads the page and doesn't do anything
             | 
             | 1 I'm aware DNS can't do this, but with a direct way of
             | contacting a very responsive admin (no captchas or annoying
             | forms, just email), I'd not expect scanners to resort to
             | blocking the domain outright to begin with, at least not
             | after they heard back the first time and the problematic
             | content has been cleared swiftly
        
               | mnordhoff wrote:
               | You should email them about the form and about your
               | domain. Their email address is listed on the website.
               | <https://quad9.net/support/contact/>
               | 
               | Sometimes the upstream blocklist provider will be easy to
               | contact directly as well. Sometimes not so much.
        
               | ajdude wrote:
               | I've been the victim of similar abuse before, for my mail
               | servers and one of my community forums that I used to
               | run. It's frustrating when you try to do everything right
               | but you're at the mercy of a cold and uncompromising
               | rules engine.
               | 
               | You just convinced me to ditch quad9.
        
               | Quad9 wrote:
               | What is your ticket #? Let's see if we can get this
               | resolved for you.
        
               | dmitrygr wrote:
               | Why not address the _REAL_ issue:
               | 
               | > I haven't been able to find any recourse. [...] there
               | seems to be no way to clear my name.
        
               | seb1204 wrote:
               | From the parent comment the path of recourse is a ticket.
               | Does not help if hn is needed to have it looked at.
        
           | rvnx wrote:
           | 8.8.8.8 + 1.1.1.1 is stable and mostly safe
        
             | baobabKoodaa wrote:
             | Windows 11 does not allow using this combination
        
               | heraldgeezer wrote:
               | it does if you set it on the interface
        
             | ziml77 wrote:
             | This is what I do. I have both services set in my router,
             | so the full list it tries are 1.1.1.1, 1.0.0.1, 8.8.8.8,
             | and 8.8.4.4
        
         | bmicraft wrote:
         | My Mikrotik router (and afaict all of them) don't support more
         | than one DoH address.
        
         | ahoka wrote:
         | Or run your own, if you are able to.
        
         | sschueller wrote:
         | Yes, I would also highly recommend using a DNS closest to you
         | (for those that have ISPs that don't mess around (blocking
         | etc.) with their DNS you usually get much better response
         | times) and multiple from different providers.
         | 
         | If your device doesn't support proper failover use a local DNS
         | forwarder on your router or an external one.
         | 
         | In Switzerland I would use Init7 (isp that doesn't filter) ->
         | quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
        
           | dylan604 wrote:
           | How busy in life are you that we're concerning ourselves with
           | nearest DNS? Are you browsing the internet like a high
           | frequency stock trader? Seriously, in everyone's day to day,
           | other than when these incidents happen, does someone notice a
           | delay from resolving a domain name?
           | 
           | I get that in theory blah blah, but we now have choices in
           | who gets to see all of our requests and the ISP will always
           | lose out to the other losers in the list
        
             | jeffbee wrote:
             | You know, I recently went through a period of thinking my
             | MacBook was just broken. It had the janks. Everything on
             | the browser was just slower than you're used to. After a
             | week or two of pulling my hair, I figured it out. The
             | newly-configured computer was using the DHCP-assigned DNS
             | instead of Google DNS. Switched it, and it made a massive
             | difference.
        
               | dylan604 wrote:
               | but that's the opposite of the request to move from a
               | googDNS to a local one because of latency. so your ISP's
               | DNS sucked, which is a broad statement, and is part of
               | the why services like 1.1.1.1 or 8.8.8.8 exist. you
               | didn't make the change of DNS because you were picking
               | one based on nearest location.
        
               | jeffbee wrote:
               | There is more to latency than distance. Server response
               | time is also important. In my case, the problem was that
               | the DNS forwarder in the local wifi access point/router
               | was very slow, even though the ICMP latency from my
               | laptop to that device is obviously low.
        
               | dylan604 wrote:
               | which is well and fine, but my original comment was that
               | moving to a closer DNS isn't worth it just for being
               | closer especially when it is usually your ISP's server.
               | so now, you're confirming that just moving closer isn't
               | the solve, so it just reassures that not using the
               | closest DNS is just fine.
        
             | tredre3 wrote:
             | news.ycombinator.com has a TTL of 1, so every page load
             | will do one DNS request (possibly multiple).
             | 
             | If you choose a resolver that is very far, 100ms longer
             | page loads do end add up quickly...
        
               | sumtechguy wrote:
               | Even something simple like www.google.com serves from 5
               | different DNS names. I have seen as high as 50. It is
               | surprisingly snappier. Especially on older browsers that
               | would only have 2 connections at a time open. It adds up
               | faster than you would intuitively think. I used to have
               | local resolvers that would mess with the TTL. But that
               | was more trouble than it was worth. But it also gave a
               | decent speedup. Was it 'worth' doing. Well it was kinda
               | fun to mess with, I guess.
        
         | Macha wrote:
         | Cloudflare's own suggested config is to use their backup server
         | 1.0.0.1 as the secondary DNS, which was also affected by this
         | incident.
        
           | stingraycharles wrote:
           | TBH at this point the failure modes in which 1.1.1.1 would go
           | down and 1.0.0.1 would not are not that many. At CloudFlare's
           | scale, it's hardly believable a single of these DNS servers
           | would go down, and it's rather a large-scale system failure.
           | 
           | But I understand why Cloudflare can't just say "use 8.8.8.8
           | as your backup".
        
             | bombcar wrote:
             | At least some machines/routers do NOT have a primary and
             | backup but instead randomly round-robin between them.
             | 
             | Which means that you'd be on cloudflare half the time and
             | on google half the time which may not be what you wanted.
        
             | toast0 wrote:
             | It would depend on how Cloudflare set up their systems.
             | From this and other outages, I think it's pretty clear that
             | they've set up their systems as a single failure domain.
             | But it would be possible for them to have setup for 1.1.1.1
             | and 1.0.0.1 to have separate failure domains --- separate
             | infrastructure, at least some sites running one but not the
             | other.
        
         | Bluescreenbuddy wrote:
         | Yup. I have Cloudfare and Quad9
        
         | bongodongobob wrote:
         | 3 at every place I've ever worked.
        
         | Polizeiposaune wrote:
         | Cloudflare recommends you configure 1.1.1.1 and 1.0.0.1 as DNS
         | servers.
         | 
         | Unfortunately, the configuration mistake that caused this
         | outage disabled Cloudflare's BGP advertisements of both
         | 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
        
           | kingnothing wrote:
           | A better recommendation is to use Cloudflare for one of your
           | DNS servers and a completely different company for the other.
        
             | butlike wrote:
             | Yeah but on paper they're never going to recommend using a
             | competitor
        
             | itake wrote:
             | Just wondering, how do y'all manage wifi portals and
             | manually setting DNS services? I used to use cf and
             | google's but it was so annoying to disable and re-enable
             | that every time I use a public wifi network.
        
       | udev4096 wrote:
       | This is why running your own resolver is so important. Clownflare
       | will always break something or backdoor something
        
       | perlgeek wrote:
       | An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a
       | year.
       | 
       | It would be interesting to see the service level objective (SLO)
       | that cloudflare internally has for this service.
       | 
       | I've found https://www.cloudflare.com/r2-service-level-agreement/
       | but this seems to be for payed services, so this outage would put
       | July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10%
       | refund for the month if you payed for it.
        
         | philipwhiuk wrote:
         | Probably 99.9% or better annually just from a 'maintaining
         | reputation for reliability' standpoint.
        
           | stingraycharles wrote:
           | What really matters with these percentages is whether it's
           | per month or per year. 99.9% per year allows for much longer
           | outages than 99.9% per month.
        
       | kachapopopow wrote:
       | Interesting to see that they probably lost 20% of 1.1.1.1 usage
       | from a roughly 20 minute incident.
       | 
       | Not sure how cloudflare keeps struggling with issues like these,
       | this isn't the first (and probably won't be the last) time they
       | have these 'simple', 'deprecated', 'legacy' issues occuring.
       | 
       | 8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for
       | almost a decade.
       | 
       | 1: localized issues did exist, but that's really the fault of the
       | internet and they did remain running when google itself suffered
       | severe downtime in various different services.
        
         | Tepix wrote:
         | There's more to DNS than just availability (granted, it's very
         | important). There's also speed and privacy.
         | 
         | European users might prefer one of the alternatives listed at
         | https://european-alternatives.eu/category/public-dns over US
         | corporations subject to the CLOUD act.
        
           | immibis wrote:
           | HN users might prefer to run their own. It's a low
           | maintenance service. It's not like running a mail server.
        
             | daneel_w wrote:
             | I think that might be overestimating the technical prowess
             | of HN readers on the whole. Sure, it doesn't require
             | wizardry to set up e.g. Unbound as a catch-all DoT
             | forwarder, but it's not the click'n'play most people
             | require. It should be compared to just changing the system
             | resolvers to dns0, Quad9 etc.
        
             | lossolo wrote:
             | One issue here is that you can be tracked easily.
        
             | kachapopopow wrote:
             | Running your own and being the sole user is the exact same
             | thing as using a dns server (you need to obtain nameservers
             | for any given domain which you have to contact a dns server
             | for).
        
           | daneel_w wrote:
           | Everyone, European or not, should prefer _anything_ but
           | Cloudflare and Google if they feel that privacy has any
           | value.
        
           | adornKey wrote:
           | I think just setting up Unbound is even less trouble. Servers
           | come and go. Getting rid of the dependency altogether is
           | better than having to worry who operates the DNS-servers and
           | how long it's going to be available.
        
             | genewitch wrote:
             | i am 95% certain i run unbound in a datacenter, and i have
             | pihole local, my PC connects to pihole first, and if that's
             | down, it connects to my DC; pihole connects to the DC and
             | one of the filtered DNS providers (don't remember which)
             | and GTEi's old server, that still works and has never let
             | me down. No, not that one, the other one.
             | 
             | i have musknet, though, so i can't edit the DNS providers
             | on the router without buying another router, so cellphones
             | aren't automatically on this plan, nor are VMs and the
             | like.
        
               | adornKey wrote:
               | Having a 2nd trustworthy router consumes extra energy,
               | but maybe it's worth it. More than once my router made an
               | update and silently disabled the pi-hole.
               | 
               | Having a fully configured spare pi-hole in a box also
               | helps. Another time my pi-hole refused to boot after a
               | power outage.
        
         | barbazoo wrote:
         | Then you'd be using a google DNS though which is undesirable
         | for many.
        
         | kod wrote:
         | > Not sure how cloudflare keeps struggling with issues like
         | these
         | 
         | Cloudflare has a reasonable culture around incident response,
         | but it doesn't incentivize proactive prevention.
        
         | zamadatix wrote:
         | Regarding the 20% some clients/resolvers will mark a server as
         | temporarily down if it fails to respond to multiple queries in
         | a row. That way the user doesn't have to wait the timeout delay
         | 500 times in a row on the next 500 queries.
         | 
         | From the longer term graphs it looks like volume returned to
         | normal https://imgur.com/a/8a1H8eL
        
         | heraldgeezer wrote:
         | Yes, I honestly switched back to 8.8.8.8 and 8.8.4.4 google
         | DNS. 100% stable, no filtering, fast in the EU.
        
         | user3939382 wrote:
         | You're not sure how they're struggling to fix an engineering
         | problem characterized by complexity and scale encountered by
         | 0.001% of network engineers?
        
       | nness wrote:
       | Interesting side-effect, the Gluetun docker image uses 1.1.1.1
       | for DNS resolution -- as a result of the outage Gluetun's health
       | checks failed and the images stopped.
       | 
       | If there were some way to view torrenting traffic, no doubt
       | there'd be a 20 minute slump.
        
         | johnklos wrote:
         | Personally, I'd consider any Docker image that does its own DNS
         | resolution outside of the OS a Trojan.
        
       | greggsy wrote:
       | I'd love to know legacy systems they're referring to.
        
       | wreckage645 wrote:
       | This is a good post mortem, but improvements only come with
       | change on processes. It seems every team at CloudFlare is
       | approaching this in isolation, without a central problem
       | management. Every week we see a new CloudFlare global outage. It
       | seems like the change management processes is broken and needs to
       | be looked at..
        
       | sylware wrote:
       | cloudflare is providing a service designed to block
       | noscript/basic (x)html browsers.
       | 
       | I know.
        
       | trollbridge wrote:
       | I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and
       | Google's 8.8.8.8 with both primary and secondary.
       | 
       | Secondary DNS is supposed to be in an independent network to
       | avoid precisely this.
        
       | neurostimulant wrote:
       | I never noticed the outage because my isp hijack all outbound udp
       | traffic to port 53 and redirect them to their own dns server so
       | they can apply government-mandated cencorship :)
        
       | nu11ptr wrote:
       | Question: Years ago, back when I used to do networking, Cisco
       | Wireless controllers used 1.1.1.1 internally. They seemed to
       | literally blackhole any comms to that IP in my testing. I assume
       | they changed this when 1.0.0.0/8 started routing on the Internet?
        
         | blurrybird wrote:
         | Yeah part of the reason why APNIC granted Cloudflare access to
         | those very lucrative IPs is to observe the misconfiguration
         | volume.
         | 
         | The theory is CF had the capacity to soak up the junk traffic
         | without negatively impacting their network.
        
         | yabones wrote:
         | The general guidance for networking has been to only use IPs
         | and domains that you actually control... But even 5-8 years
         | ago, the last time I personally touched a cisco WLC box, it
         | still had 1.1.1.1 hardcoded. Cisco loves to break their own
         | rules...
        
       | homebrewer wrote:
       | This is a good time to mention that dnsmasq lets you setup
       | several DNS servers, and can race them. The first responder wins.
       | You won't ever notice one of the services being down:
       | all-servers       server=8.8.8.8       server=9.9.9.9
       | server=1.1.1.1
        
         | mnordhoff wrote:
         | Even without "all-servers", DNSMasq will race servers
         | frequently (after 20 seconds, unless it's changed), and when
         | retrying. A sudden outage should only affect you for a few
         | seconds, if at all.
        
         | anthonyryan1 wrote:
         | Additionally, as long as you don't set strict-order, dnsmasq
         | will automatically use all-servers for retries.
         | 
         | If you were using systemd-resolved however, it retries all
         | servers in the order they were specified, so it's important to
         | interleave upstreams.
         | 
         | Using the servers in the above example, and assuming IPv4 +
         | IPv6:                   1.1.1.1         2001:4860:4860::8888
         | 9.9.9.9         2606:4700:4700::1111         8.8.8.8
         | 2620:fe::fe         1.0.0.1         2001:4860:4860::8844
         | 149.112.112.112         2606:4700:4700::1001         8.8.4.4
         | 2620:fe::9
         | 
         | will failover faster and more successfully on systemd-resolved,
         | than if you specify all Cloudflare IPs together, then all
         | Google IPs, etc.
         | 
         | Also note that Quad9 is default filtering on this IP while the
         | other two or not, so you could get intermittent differences in
         | resolution behavior. If this is a problem, don't mix filtered
         | and unfiltered resolvers. You definitely shouldn't mix DNSSEC
         | validatng and not DNSSEC validating resolvers if you care about
         | that (all of the above are DNSSEC validating).
        
           | matthewtse wrote:
           | wow good tip
           | 
           | I was handling an incident due to this outage. I ended up
           | adding Google DNS resolvers using systemd-resolved, but I
           | didn't think to interleave them!
        
         | karel-3d wrote:
         | dnsdist is AMAZINGLY easy to set up as a secure local resolver
         | that forwards all queries to DoH (and checks SSL) and checks
         | liveliness every second
         | 
         | I need to do a write-up one day
        
           | jzebedee wrote:
           | Please do. I'd be curious what a secure-by-default self
           | hosted resolver would look like.
        
             | daneel_w wrote:
             | For what it may be worth, here's a most basic (but fully
             | working) config for running Unbound as a DoT-only
             | forwarder:                 server:           logfile: ""
             | log-queries: no                  # adjust as necessary
             | interface: 127.0.0.1@53           access-control:
             | 127.0.0.0/8 allow                  infra-keep-probing: yes
             | tls-system-cert: yes              forward-zone:
             | name: "."           forward-tls-upstream: yes
             | forward-addr: 9.9.9.9@853#dns.quad9.net           forward-
             | addr: 193.110.81.9@853#zero.dns0.eu           forward-addr:
             | 149.112.112.112@853#dns.quad9.net           forward-addr:
             | 185.253.5.9@853#zero.dns0.eu
        
         | karmakaze wrote:
         | I don't consider these interchangeable. They have different
         | priorities and policies. If anything I'd choose one and use my
         | ISP default as fallback.
        
           | nemonemo wrote:
           | Agreed in principle, but has anyone seen any practical
           | difference between these DNS services? What would be a more
           | detailed downside for using these in parallel instead of the
           | ISP default as a fallback?
        
           | eli wrote:
           | My ISP has already been caught selling personally
           | identifiable customer data. I trust them less than any of
           | those companies.
        
             | sumtechguy wrote:
             | My ISP one got kicked to the curb once they started
             | returning results for anything including invalid sites.
             | Basically to try to steer you towards their search.
        
           | outworlder wrote:
           | My ISP (one of the largest in the US) like to hijack DNS
           | responses (specially NXDOMAIN) and serve crap. No thanks.
           | Which is also why I have to use encryption to talk to public
           | DNS servers otherwise they will hijack anyways.
        
         | whitehexagon wrote:
         | That sounds good in principle, but is there a more private
         | configuration that doesnt send DNS resolutions to cloudfare,
         | google et al. ie. avoid BigTech tracking, and not wanting DOH.
         | 
         | dnsmasq with a list of smaller trusted DNS providers sounds
         | perfect, as long as it is not considered bad etiquette to spam
         | multiple DNS providers for every resolution?
         | 
         | But where to find a trusted list of privacy focused DNS
         | resolvers. The couple I tried from random internet advice
         | seemed unstable.
        
           | agolliver wrote:
           | There are no good private DNS configurations, but if you
           | don't trust the big caching recursive resolvers then I'd
           | consider just running your own at home. Unbound is easy to
           | set up and you'll probably never notice a speed difference.
        
             | hdgvhicv wrote:
             | I trust my isp far more than I trust cloudflare and google
        
               | bagels wrote:
               | Why? Some were injecting ads, blocking services,
               | degrading video and other wrongdoings.
        
           | sophacles wrote:
           | You can just run unbound or similar and do your own recursive
           | resolving.
        
           | Tmpod wrote:
           | Quad9 and NextDNS are usually thrown around.
        
           | Yeri wrote:
           | https://www.dns0.eu/ is an option
        
           | mcpherrinm wrote:
           | I've reviewed the privacy policy and performance of various
           | DoH servers, and determined in my opinion that Cloudflare and
           | Google both provide privacy-respecting policies.
           | 
           | I believe that they follow their published policies and have
           | reasonable security teams. They're also both popular
           | services, which mitigates many of the other types of DNS
           | tracking possible.
           | 
           | https://developers.google.com/speed/public-dns/privacy
           | https://developers.cloudflare.com/1.1.1.1/privacy/public-
           | dns...
        
           | bsilvereagle wrote:
           | I haven't had any problems with OpenNIC: https://opennic.org/
           | 
           | > OpenNIC (also referred to as the OpenNIC Project) is a user
           | owned and controlled top-level Network Information Center
           | offering a non-national alternative to traditional Top-Level
           | Domain (TLD) registries; such as ICANN.
        
           | hamandcheese wrote:
           | NextDNS. Generous free tier, very affordable paid tier. Happy
           | customer for several years and I've never noticed an outage.
        
             | Melatonic wrote:
             | This
        
           | paradao wrote:
           | Using DNSCrypt with anonymized DNS could be an option:
           | https://github.com/DNSCrypt/dnscrypt-
           | proxy/wiki/Anonymized-D...
        
           | localtoast wrote:
           | dnsforge.de comes to mind.
        
         | xyst wrote:
         | Probably great for users. Awful for trying to reproduce an
         | issue. I prefer a more deterministic approach myself.
        
         | itscrush wrote:
         | Looks like AdGuard allows for same, thanks for mentioning
         | dnsmasq support! I overlooked it on setup.
        
         | heavyset_go wrote:
         | I think systemd-resolved does something similar if you use
         | that. Does DoT and DNSSEC by default.
         | 
         | If you want to eschew centralized DNS altogether, if you run a
         | Tor daemon, it has an option to expose a DNS resolver to your
         | network. Multiple resolvers if you want them.
        
       | alyandon wrote:
       | Cloudflare's 1.1.1.1 Resolver service became unavailable to the
       | Internet starting at 21:52 UTC and ending at 22:54 UTC
       | 
       | Weird. According to my own telemetry from multiple networks they
       | were unavailable for a lot longer than that.
        
       | chrisgeleven wrote:
       | I been lazy and was using Cloudflare's resolver only recently. In
       | hindsight I probably should just setup two instances of Unbound
       | on my home network that don't rely on upstream resolvers and call
       | it a day. It's unlikely both will go down at the same time and if
       | I'm having an total Internet outage (unlikely as I have Comcast
       | as primary + T-Mobile Home Internet as a backup), it doesn't
       | matter if DNS is or isn't resolving.
        
       | tacitusarc wrote:
       | Perhaps I am over-saturated, but this write up felt like AI- at
       | least largely edited by a model.
        
       | xyst wrote:
       | Am not a fan of CF in general due to their role in centralization
       | of the internet around their services.
       | 
       | But I do appreciate these types of detailed public incident
       | reports and RCAs.
        
       | zac23or wrote:
       | It's no surprise that Cloudflare is having a service issue again.
       | 
       | I use Cloudflare at work. Cloudflare has many bugs, and some
       | technical decisions are absurd, such as the worker's cache.delete
       | method, which only clears the cache contents in the data center
       | where the Worker was invoked!!!
       | https://developers.cloudflare.com/workers/runtime-apis/cache...
       | 
       | In my experience, Cloudflare support is not helpful at all,
       | trying to pass the problem onto the user, like "Just avoid
       | holding it in that way. ".
       | 
       | At work, I needed to use Cloudflare. The next job I get, I'll put
       | a limit on my responsibilities: I don't work with Cloudflare.
       | 
       | I will never use Cloudflare at home and I don't recommend it to
       | anyone.
       | 
       | Next week: A new post about how Cloudflare saved the web from a
       | massive DDOS attack.
        
         | kentonv wrote:
         | > some technical decisions are absurd, such as the worker's
         | cache.delete method, which only clears the cache contents in
         | the data center where the Worker was invoked!!!
         | 
         | The Cache API is a standard taken from browsers. In the
         | browser, cache.delete obviously only deletes that browser's
         | cache, not all other browsers in the world. You could certainly
         | argue that a global purge would be more useful in Workers, but
         | it would be inconsistent with the standard API behavior, and
         | also would be extraordinarily expensive. Code designed to use
         | the standard cache API would end up being much more expensive
         | than expected.
         | 
         | With all that said, we (Workers team) do generally feel in
         | retrospect that the Cache API was not a good fit for our
         | platform. We really wanted to follow standards, but this
         | standard in this case is too specific to browsers and as a
         | result does not work well for typical use cases in Cloudflare
         | Workers. We'd like to replace it with something better.
        
           | freedomben wrote:
           | Just wanted to say, I always appreciate your comments and
           | frankness!
        
           | zac23or wrote:
           | >cache.delete obviously only deletes that browser's cache,
           | not all other browsers in the world.
           | 
           | To me, it only makes sense if the put method creates a cache
           | only in the datacenter where the Worker was invoked. Put and
           | delete need to be related, in my opinion.
           | 
           | Now I'm curious: what's the point of clearing the cache
           | contents in the datacenter where the Worker was invoked? I
           | can't think of any use for this method.
           | 
           | My criticisms aren't about functionality per see or
           | developers. I don't doubt the developers' competence, but I
           | feel like there's something wrong with the company culture.
        
         | freedomben wrote:
         | Cloudflare is definitely not perfect (and when they make a
         | change that breaks the existing API contract it always makes
         | for several miserable days for me), but on the whole Cloudflare
         | is pretty reliable.
         | 
         | That said, I don't use workers and don't plan to. I personally
         | try to stay away from non cross-platform stuff because I've
         | been burned too heavily with vendor/platform lock-in in the
         | past.
        
           | kentonv wrote:
           | > and when they make a change that breaks the existing API
           | contract it always makes for several miserable days for me
           | 
           | If we changed an API in Workers in a way that broke any
           | Worker in production, we consider that an incident and we
           | will roll it back ASAP. We really try to avoid this but
           | sometimes it's hard for us to tell. Please feel free to
           | contact us if this happens in the future (e.g. file a support
           | ticket or file a bug on workerd on GitHub or complain in our
           | Discord or email kenton@cloudflare.com).
        
             | freedomben wrote:
             | Thank you! To clarify it's been API contracts in the DNS
             | record setting API that have hit me. I'm going from memory
             | here and it's been a couple years I think so might be a bit
             | rusty, but one example was a slight change in data type
             | acceptance for TTL on a record. It used to take either a
             | string or integer in the JSON but at some point started
             | rejecting integers (or strings, whichever one I was sending
             | at the time stopped being accepted) so the API calls were
             | suddenly failing (to be fair that might not have
             | technically been a violation of the contract, but it was a
             | change in behavior that had been consistent for years and
             | which I would not have expected). Another one was regarding
             | returning zone_id for records where the zone_id stopped
             | getting populated in the returned record. Luckily my code
             | already had the zone_id because it needs that to build the
             | URL path, but it was a rough debugging session and then I
             | had to hack around it by either re-adding the zone ID to
             | the returned record or removing zone ID from my equality
             | check, neither of which were preferred solutions.
             | 
             | If we start using workers though I'll definitely let you
             | know if any API changes!
        
       | aftbit wrote:
       | >Even though this release was peer-reviewed by multiple engineers
       | 
       | I find it somewhat surprising that none of the multiple engineers
       | who reviewed the original change in June noticed that they had
       | added 1.1.1.0/24 to the list of prefixes that should be rerouted.
       | I wonder what sort of human mistake or malice led to that
       | original error.
       | 
       | Perhaps it would be wise to add some hard-coded special-case
       | mitigations to DLS such that it would not allow 1.1.1.1/32 or
       | 1.0.0.1/32 to be reassigned to a single location.
        
         | burnte wrote:
         | It's probably much simpler, "I trust Jerry, I'm sure this is
         | fine, approved."
        
         | roughly wrote:
         | I'm generally more a "blame the tools" than "blame the people"
         | - depending on how the system is set up and how the configs are
         | generated, it's easy for a change like this to slip by -
         | especially if a bunch of the diff is autogenerated. It's still
         | humans doing code review, and this kind of failure indicates
         | process problems, regardless of whether or not laziness or
         | stupidity were also present.
         | 
         | But, yes, a second mitigation here would be defense in depth -
         | in an ideal world, all your systems use the same ops/deploy/etc
         | stack, in this one, you probably want an extra couple steps in
         | the way of potentially taking a large public service offline.
        
       | b0rbb wrote:
       | I don't know about you all but I love a well written RCA. Nicely
       | done.
        
       | alexandrutocar wrote:
       | > It's worth noting that DoH (DNS-over-HTTPS) traffic remained
       | relatively stable as most DoH users use the domain cloudflare-
       | dns.com, configured manually or through their browser, to access
       | the public DNS resolver, rather than by IP address.
       | 
       | I use their DNS over HTTPS and if I hadn't seen the issue being
       | reported here, I wouldn't have caught it at all. However, this--
       | along with a chain of past incidents (including a recent
       | cascading service failure caused by a third-party outage)--led me
       | to reduce my dependencies. I no longer use Cloudflare Tunnels or
       | Cloudflare Access, replacing them with WireGuard and mTLS
       | certificates. I still use their compute and storage, but for
       | personal projects only.
        
       | cadamsdotcom wrote:
       | > The way that Cloudflare manages service topologies has been
       | refined over time and currently consist of a combination of a
       | legacy and a strategic system that are synced.
       | 
       | This writing is just brilliant. Clear to technical and non-
       | technical readers. Makes the in-progress migration sound way more
       | exciting than it probably is!
       | 
       | > We are sorry for the disruption this incident caused for our
       | customers. We are actively making these improvements to ensure
       | improved stability moving forward and to prevent this problem
       | from happening again.
       | 
       | This is about as good as you can get it from a company as serious
       | and important as Cloudflare. Bravo to the writers and vetters for
       | not watering this down.
        
         | kccqzy wrote:
         | I can't tell if you are being sarcastic, but "legacy" is a term
         | most often used by technical people whereas "strategic" is a
         | term most often used by marketing and non-technical leadership.
         | Mixing them together annoys both kinds of readers.
        
           | hbay wrote:
           | you were annoyed by that sentence?
        
           | marcusb wrote:
           | You cannot throw a rock without hitting a product marketer
           | describing everything not-their-product as "legacy."
        
       | nixpulvis wrote:
       | Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this
       | after trying to use my hotspot from my Linux laptop with it set
       | for my default DNS.
       | 
       | Very frustrating.
        
       ___________________________________________________________________
       (page generated 2025-07-16 23:00 UTC)