[HN Gopher] Cloudflare 1.1.1.1 Incident on July 14, 2025
___________________________________________________________________
Cloudflare 1.1.1.1 Incident on July 14, 2025
Author : nomaxx117
Score : 521 points
Date : 2025-07-16 03:44 UTC (19 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| Mindless2112 wrote:
| Interesting that traffic didn't return to completely normal
| levels after the incident.
|
| I recently started using the "luci-app-https-dns-proxy" package
| on OpenWrt, which is preconfigured to use both Cloudflare and
| Google DNS, and since DoH was mostly unaffected, I didn't notice
| an outage. (Though if DoH had been affected, it presumably would
| have failed over to Google DNS anyway.)
| anon7000 wrote:
| They go into that more towards the end, sounds like some
| smaller % of servers needed more direct intervention
| caconym_ wrote:
| > Interesting that traffic didn't return to completely normal
| levels after the incident.
|
| Anecdotally, I figured out their DNS was broken before it hit
| their status page and switched my upstream DNS over to Google.
| Haven't gotten around to switching back yet.
| radicaldreamer wrote:
| What would be a good reason to switch back from Google DNS?
| sammy2255 wrote:
| Depends who you trust more with your DNS traffic. I know
| who I trust more.
| nojs wrote:
| Who? Honest question
| misiek08 wrote:
| Google is serving you ads, CF isn't.
|
| And it's not conspiracy theory - it was very suspicious
| when we did some testing on small, aware group. The
| traffic didn't look like being handled anonymously at
| Google side
| mnordhoff wrote:
| Unless the privacy policy changed recently, Google
| shouldn't be doing anything nefarious with 8.8.8.8 DNS
| queries.
| DarkCrusader2 wrote:
| They weren't supposed to do anything with our gmail data
| as well. That didn't stop them.
| Tijdreiziger wrote:
| [citation needed]
| johnklos wrote:
| Read their TOS.
| Tijdreiziger wrote:
| If it's in the ToS, then it's not true that "[they]
| weren't supposed to do anything with our gmail data".
| daneel_w wrote:
| Yeah it's not like they have a long track record of being
| caught red-handed stepping all over privacy regulations
| and snarfing up user activity data across their entire
| range of free products...
| opan wrote:
| CF breaks half the web with their awful challenges that
| fail in many non-mainstream browsers (even ones based on
| chromium).
| Elucalidavah wrote:
| Realistically, either you ignore the privacy concerns and
| set up routing to multiple providers preferring the
| fastest, or you go all-in on privacy and route DNS over
| Tor over bridge.
|
| Although, perhaps, having an external VPS with a dns
| proxy could be a good middle ground?
| Tijdreiziger wrote:
| Middle ground is ISP DNS, right?
| davidcbc wrote:
| If privacy is your primary concern I would 100% trust
| Cloudflare or Google over an ISP in the US
| daneel_w wrote:
| If you're the technical type you can run Unbound locally
| (even on Windows) and let it forward queries with DoT. No
| need for neither Tor nor running your own external
| resolver.
| immibis wrote:
| Myself, I suppose? Recursive resolvers are low-
| maintenance, and you get less exposure to ISP censorship
| (which "developed" countries also do).
| daneel_w wrote:
| Quad9, dns0.
| Algent wrote:
| After trying both several time I since stayed with google
| due to cloudflare always returning really bad IPs for
| anything involving CDN. Having users complain stuff take
| age to load because you got matched to an IP on opposite
| side of planet is a bit problematic especially when it
| rarely happen on other dns providers. Maybe there is a way
| to fix this but I admit I went for the easier option of
| going back to good old 8.8.8.8
| homebrewer wrote:
| No, it's deliberately not implemented:
|
| https://developers.cloudflare.com/1.1.1.1/faq/#does-1111-
| sen...
|
| I've also changed to 9.9.9.9 and 8.8.8.8 after using
| 1.1.1.1 for several years because connectivity here is
| not very good, and being connected to the wrong data
| center means RTT in excess of 300 ms. Makes the web very
| sluggish.
| Aachen wrote:
| Does that setup fall back to 8.8.8.8 if 9.9.9.9 fails to
| resolve?
|
| Quad9 has a very aggressive blocking policy (my site with
| user-uploaded content was banned without even reporting
| the malicious content; if you're a big brand name it
| seems to be fine to have user-uploaded content though)
| which this would be a possible workaround for, but it may
| not take an nxdomain response as a resolver failure
| motorest wrote:
| > Interesting that traffic didn't return to completely normal
| levels after the incident.
|
| Clients cache DNS resolutions to avoid having to do that
| request each time they send a request. It's plausible that some
| clients held on to their cache for a significant period.
| bastawhiz wrote:
| If your Internet doesn't work, you'll get up and do other
| things for a while. I strongly suspect most folks didn't switch
| DNS providers in that time.
| CuteDepravity wrote:
| It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the
| same change
|
| I guess now we should start using a completely different provider
| as dns backup Maybe 8.8.8.8 or 9.9.9.9
| sammy2255 wrote:
| 1.1.1.1 and 1.0.0.1 are served by the same service. It's not
| advertised as a redundant fully separate backup or anything
| like that...
| yjftsjthsd-h wrote:
| Wait, then why does 1.0.0.1 exist? I'll grant I've never seen
| it advertised/documented as a backup, but I just assumed it
| must be because why else would you have two? (Given that
| 1.1.1.1 already isn't actually a single point, so I wouldn't
| think you need a second IP for load balancing reasons.)
| ta1243 wrote:
| Far quicker to type ping 1.1 than ping 1.1.1.1
|
| 1.0.0.0/24 is a different network than 1.1.1.0/24 too, so
| can be hosted elsewhere. Indeed right now 1.1.1.1 from my
| laptop goes via 141.101.71.63 and 1.0.0.1 via
| 141.101.71.121, which are both hosts on the same LINX/LON1
| peer but presumably from different routers, so there is
| some resilience there.
|
| Given DNS is about the easiest thing to avoid a single
| point of failure on I'm not sure why you would put all your
| eggs in a single company, but that seems to be the modern
| internet - centralisation over resilience because
| resilience is somehow deemed to be hard.
| yjftsjthsd-h wrote:
| > Far quicker to type ping 1.1 than ping 1.1.1.1
|
| I guess. I wouldn't have thought it worthwhile for 4
| chars, but yes.
|
| > 1.0.0.0/24 is a different network than 1.1.1.0/24 too,
| so can be hosted elsewhere.
|
| I thought anycast gave them that on a single IP, though
| perhaps this is even more resilient?
| darkwater wrote:
| Not a network expert but anycast will give you different
| routes depending on where you are. But having 2 IPs will
| give you different routes to them from the same location.
| In this case since the error was BGP related, and they
| clearly use the same system to announce both IPs, both
| were affected.
| ta1243 wrote:
| In the internet world you can't really advertise subnets
| smaller than a /24, so 1.1.1.1/32 isn't a route, it's via
| 1.1.1.0/24
|
| You can see they are separate routes, say looking at
| Telia's routing IP
|
| https://lg.telia.net/?type=bgp&router=fre-
| peer1.se&address=1...
|
| https://lg.telia.net/?type=bgp&router=fre-
| peer1.se&address=1...
|
| In this case they both are advertised from the same peer
| above, I suspect they usually are - they certainly come
| from the same AS, but they don't need to. You could have
| two peers with cloudflare with different weights for each
| /24
| kalmar wrote:
| I don't know of it's the reason, but inet_aton[0] and other
| parsing libraries that match its behaviour will parse 1.1
| as 1.0.0.1. I use `ping 1.1` as a quick connectivity test.
|
| [0] https://man7.org/linux/man-
| pages/man3/inet_aton.3.html#DESCR...
| tom1337 wrote:
| Wasn't it also because a lot of hotel / public routers used
| 1.1.1.1 for captive portals and therefore you couldn't use
| 1.1.1.1?
| immibis wrote:
| Because operating systems have two boxes for DNS server IP
| addresses, and Cloudflare wants to be in both positions.
| codingminds wrote:
| Wasn't that the case since ever?
| globular-toast wrote:
| In general there's no such thing as "DNS backup". Most clients
| just arbitrarily pick one from the list, they don't fall back
| to the other one in case of failure or anything. So if one went
| down you'd still find many requests timing out.
| JdeBP wrote:
| The reality is that it's rather complicated to say what "most
| clients" do, as there is some behavioural variation amongst
| the DNS client libraries when they are configured with
| multiple IP addresses to contact. So whilst it's true to say
| that fallback and redundancy does not always operate as one
| might suppose at the DNS client level, it is untrue to go to
| the opposite extreme and say that there's no such thing at
| all.
| 0xbadcafebee wrote:
| In general, the idea of DNS's design is to use the DNS resolver
| closest to you, rather than the one run by the largest company.
|
| That said, it's a good idea to specifically pick multiple
| resolvers in different regions, on different backbones, using
| different providers, and _not_ use an Anycast address, because
| Anycast can get a little weird. However, this can lead to hard-
| to-troubleshoot issues, because DNS doesn 't always behave the
| way you expect.
| ben0x539 wrote:
| Isn't the largest company most likely to have the DNS
| resolver closest to me?
| fragmede wrote:
| Your ISP should have a DNS revolver closer to you. "Should"
| doesn't necessarily mean faster, however.
| lxgr wrote:
| I've had ISPs with a DNS server (configured via DHCP)
| farther away than 1.1.1.1 and 8.8.8.8.
| encom wrote:
| In case of Denmark, ISP DNS also means censored. Of
| course it started with CP, as it always does, then
| expanded to copyrights, pharmaceuticals, gambling and
| "terrorism". Except for the occasional Linux ISO, I don't
| partake in any of these topics, but I'm opposed to any
| kind of censorship on principle. And naturally, this
| doesn't stop anyone, but politicians get to stand in
| front of television cameras and say they're protecting
| children and stopping terrorists.
|
| </soapbox>
| nullify88 wrote:
| Not just that. ISPs are often subject to certain data
| retention laws. For Denmark (And other EU countries) that
| maybe 6 months to 2 years. And considering close ties
| with "9 eyes" means America potentially has access to my
| information anyway.
|
| Judging by Cloudflare's privacy policy, they hold less
| personally identifiable information than my ISP while
| offering EDNS and low latencies? Win, win, win.
| sschueller wrote:
| No, your ISP can have a server closer before any external
| one.
| dontTREATonme wrote:
| What's your recommendation for finding the dns resolver
| closest to me? I currently use 1.1 and 8.8, but I'm
| absolutely open to alternatives.
| LeoPanthera wrote:
| The closest DNS resolver to you is the one run by your ISP.
| JdeBP wrote:
| Actually, it's about 20cm from my left elbow, which is
| physically several orders of magnitude closer than
| _anything_ run by my ISP, and logically at least 2
| network hops closer.
|
| And the closest resolving proxy DNS server for most of my
| machines is listening on their loopback interface. The
| closest such machine happens to be about 1m away, so is
| beaten out of first place by centimetres. (-:
|
| It's a shame that Microsoft arbitrarily ties such
| functionality to the Server flavour of Windows, and does
| not supply it on the Workstation flavour, but other
| operating systems are not so artificially limited or
| helpless; and even novice users on such systems can get a
| working proxy DNS server out of the box that their sysops
| don't actually _have_ to touch.
|
| The idea that one has to rely upon an ISP, or even upon
| CloudFlare and Google and Quad9, for this stuff is a bit
| of a marketing tale that is put about by thse self-same
| ISPs and CloudFlare and Google and Quad9. Not relying
| upon them is not actually limited to people who are
| skilled in system operation, i.e. who they are; but
| rather merely limited by what people run: black box
| "smart" tellies and whatnot, and the Workstation flavour
| of Microsoft Windows. Even for such machines, there's the
| option of a decent quality router/gateway or simply a
| small box providing proxy DNS on the LAN.
|
| In my case, said small box is roughly the size of my hand
| and is smaller than my mass-market SOHO router/gateway.
| (-:
| lxgr wrote:
| Is that really a win in terms of latency, considering
| that the chance of a cache hit increases with the number
| of users?
| 0xbadcafebee wrote:
| Keep in mind that low latency is a different goal than
| reliability. If you want the lowest-latency, the anycast
| address of a big company will often win out, because
| they've spent a couple million to get those numbers. If
| you want most reliable, then the closest hop to you
| _should_ be the most reliable (there 's no accounting for
| poor sysadmin'ing), which is often the ISP, but sometimes
| not.
|
| If you run your own _recursive DNS server_ (I keep
| forgetting to use the right term) on a local network, you
| can hit the root servers directly, which makes that the
| most reliable possible DNS resolver. Yes you might get
| more cache misses initially but I highly doubt you 'd
| notice. (note: querying the root nameservers is bad
| netiquette; you should always cache queries to them for
| at least 5 minutes, and always use DNS resolvers to cache
| locally)
| lxgr wrote:
| > If you want most reliable, then the closest hop to you
| should be the most reliable (there's no accounting for
| poor sysadmin'ing), which is often the ISP, but sometimes
| not.
|
| I'd argue that accounting for poorly managed ISP
| resolvers is a critical part of reasoning about
| reliability.
| vel0city wrote:
| I used to run unbound at home as a full resolver, and
| ultimately this was my reason to go back to forwarding to
| other large public resolvers. So many domains seemed to
| be pretty slow to get a first query back, I had all kinds
| of odd behaviors from devices around the house getting a
| slow initial connection.
|
| Changed back to just using big resolvers and all those
| issues disappeared.
| JdeBP wrote:
| It is. If latency were important, one could always
| aggregate across a LAN with forwarding caching proxies
| pointing to a single resolving caching proxy, and gain
| economies of scale by exactly the same mechanisms. But
| latency is largely a wood-for-the-trees thing.
|
| In terms of my everyday usage, for the past couple of
| decades, cache miss delays are largely lost in the noise
| of stupidly huge WWW pages, artificial service
| greylisting delays, CAPTCHA delays, and so forth.
|
| Especially as the first step in any _full_ cache miss, a
| back-end query to the root content DNS server, is also
| just a round-trip over the loopback interface. Indeed, as
| is also the second step sometimes now, since some TLDs
| also let one mirror their data. Thank you, Estonia.
| https://news.ycombinator.com/item?id=44318136
|
| And the gains in other areas are significant. Remember
| that privacy and security are also things that people
| want.
|
| Then there's the fact that things like
| Quad9's/Google's/CloudFlare's anycasting surprisingly
| often results in hitting multiple independent servers for
| successive lookups, not yielding the cache gains that a
| superficial understanding would lead one to expect.
|
| Just for fun, I did Bender's test at
| https://news.ycombinator.com/item?id=44534938 a couple of
| days ago, in a loop. I received reset-to-maximum TTLs
| from multiple successive cache misses, on queries spaced
| merely 10 seconds apart, from all three of Quad9, Google
| Public DNS, and CloudFlare 1.1.1.1. With some maths, I
| could probably make a good estimate as to how many
| separate anycast caches on those services are answering
| me from scratch, and not actually providing the cache
| hits that one would naively think would happen.
|
| I added 127.0.0.1 to Bender's list, of course. That had 1
| cache miss at the beginning and then hit the cache every
| single time, just counting down the TTL by 10 seconds
| each iteration of the loop; although it did decide that
| 42 days was unreasonably long, and reduced it to a week.
| (-:
| baobabKoodaa wrote:
| Windows 11 doesn't allow using that combination
| bigiain wrote:
| I mean, aren't we already?
|
| My Pi-holes both use OpenDNS, Quad9, and CloudFlare for
| upstream.
|
| Most of my devices use both of my Pi-holes.
| johnklos wrote:
| If you're already running Pi-hole, wny not just run your own
| recursive, caching resolver?
| geoffpado wrote:
| This was quite annoying for me, having only switched my DNS
| server to 1.1.1.1 approximately 3 weeks ago to get around my ISP
| having a DNS outage. Is reasonably stable DNS really so much to
| ask for these days?
| bauruine wrote:
| Why not use multiple? You can use 1.1.1.1, your ISPs and google
| at the same time. Or just run a resolver yourself.
| ripdog wrote:
| >Or just run a resolver yourself.
|
| I did this for a while, but ~300ms hangs on every DNS
| resolution sure do get old fast.
| xpe wrote:
| Ouch. What resolver? What hardware?
|
| With something like a N100- or N150-based single board
| computer (perhaps around $200) running any number of open
| source DNS resolvers, I would expect you can average around
| 30 ms for cold lookups and <1 ms for cache hits.
| ripdog wrote:
| Not a hardware issue, but a physics problem. I live in
| NZ. I guess the root servers are all in the US, so that's
| 130ms per trip minimum.
| johnklos wrote:
| They are not all in the US.
| ripdog wrote:
| Well that's the experience I had. Obviously caching was
| enabled (unbound), but most DNS keepalive times are so
| short as to be fairly useless for a single user.
|
| Even if a root server wasn't in the US, it will still be
| pretty slow for me. Europe is far worse. Most of Asia has
| bad paths to me, except for Japan and Singapore which are
| marginally better than the US. Maybe Aus has one...?
| janfoeh wrote:
| According to [0], there is at least one in Auckland. No
| idea about the veracity of that site, though.
|
| [0] https://dnswatch.com/dns-docs/root-server-locations
| encom wrote:
| >DNS keepalive times are so short as to be fairly useless
|
| Incompetent admins. dnsmasq at least has an option to
| override it (--min-cache-ttl=<time>)
| passivegains wrote:
| I was going to reply about how New Zealand is as far from
| almost everywhere else as the US, but I found out
| something way more interesting: Other than servers in
| Australia and New Zealand itself, the closest ones
| actually _are_ in the US, just 3,000km north in American
| Samoa. Basically right next door. (I need to go back to
| work before my boss walks by and sees me screwing around
| on Google Maps, but I 'm pretty sure the next closest are
| in French Polynesia.)
| bjoli wrote:
| Run your own forwarder locally. Technitium dns makes it easy.
| codingminds wrote:
| If you consume a service that's free of charge, it's at least
| not reasonable to complain if there's an outage.
|
| Like mentioned by other comments, do it on your own if you are
| not happy with the stability. Or just pay someone to provide it
| - like your ISP..
|
| And TBH I trust my local ISP more than Google or CF. Not in
| availability, but it's covered by my local legislature. That's
| a huge difference - in a positive way.
| komali2 wrote:
| > it's at least not reasonable to complain if there's an
| outage.
|
| I don't think this is fair when discussing infrastructure.
| It's reasonable to complain about potholes, undrinkable tap
| water, long lines at the DMV, cracked (or nonexistent)
| sidewalks, etc. The internet is infrastructure and DNS
| resolution is a critical part of it. That it hasn't been
| nationalized doesn't change the fact that it's infrastructure
| (and access absolutely should be free) and therefore everyone
| should feel free to complain about it not working correctly.
|
| "But you pay taxes for drinkable tap water," yes, and we paid
| taxes to make the internet work too. For some reason, some
| governments like the USA feel it to be a good idea to add a
| middle man to spend that tax money on, but, fine, we'll
| complain about the middle man then as well.
| gkbrk wrote:
| But you can just run a recursive resolver. Plenty of
| packages to install. The root DNS servers were not
| affected, so you would have been just fine.
|
| DNS is infrastructure. But "Cloudflare Public Free DNS
| Resolver" is not, it's just a convenience and a product to
| collect data.
| JdeBP wrote:
| One can even run a private root content DNS server, and
| not be affected by root problems _either_.
|
| (This isn't a major concern, of course; and I mention it
| just to extend your argument yet further. The major gain
| of a private root content DNS server is the fraction of
| really stupid nonsense DNS traffic that comes about
| because of various things gets filtered out either on-
| machine or at least without crossing a border router. The
| gains are in security and privacy more than uptime.)
| codingminds wrote:
| You are right infrastructure is important.
|
| But opposite to tap water there are a lot of different free
| DNS resolvers that can be used.
|
| And I don't see how my taxes funded CFs DNS service. But my
| ISP fee covers their DNS resolving setup. That's the reason
| why I wrote
|
| > a service that's free of charge
|
| Which CF is.
| komali2 wrote:
| DNS shouldn't be privatized at all since it's a critical
| part of internet infrastructure, however at the same time
| the idea that somehow it's something a corporation should
| be allowed to sell to you at all (or "give you for free")
| is silly given that the service is meaningless without
| the infrastructure of the internet, which is built by
| governments (through taxes). I can't even think of an
| equivalent it's so ridiculous that it's allowed at all,
| my best guess would be maybe, if your landlord was
| allowed to charge you for walking on the sidewalk in
| front of the apartment or something.
| codingminds wrote:
| DNS is not privatized. This is not about the root DNS
| servers, it's just about one of many free resolvers out
| there - in this case one of the bigger and popular ones.
| delfinom wrote:
| >That it hasn't been nationalized doesn't change the fact
| that it's infrastructure (and access absolutely should be
| free) and therefore everyone should feel free to complain
| about it not working correctly.
|
| >"But you pay taxes for drinkable tap water," yes, and we
| paid taxes to make the internet work too. For some reason,
| some governments like the USA feel it to be a good idea to
| add a middle man to spend that tax money on, but, fine,
| we'll complain about the middle man then as well.
|
| You don't want DNS to be nationalized. Even the US would
| have half the internet banned by now.
| chii wrote:
| > it's covered by my local legislature
|
| which might not be a good thing in some jurisdictions - see
| the porn block in the UK (it's done via dns iirc, and
| trivially bypassed with a third party dns like cloudflare's).
| codingminds wrote:
| Yeah it has its pros and cons, sadly.
|
| So far I'm lucky and the only ban I'm aware of is on
| gambling. Which is fine for me personally.
|
| But in a UK case I'd using a non local one as well.
| pparanoidd wrote:
| A single incident means 1.1.1.1 is no longer reasonably stable?
| You are the unreasonable one
| yjftsjthsd-h wrote:
| Although I agree 1.1.1.1 is fine: To this particular
| commenter they've had one major outage in 3 total weeks of
| use, which isn't exactly a good record. (And it's
| understandable to weigh personal experience above other
| people claiming this isn't representative.)
| cryptonym wrote:
| I have been online for 30y and can't remember being affected
| by downtime from my ISP DNS.
|
| When DNS resolver is down, it affects everything, 100% uptime
| is a fair expectation, hence redundancy. Looks like both
| 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad
| TBH, especially when you advise global usage.
|
| RCA is not detailed and feels like a marketing stunt we are
| now getting every other week.
| sophacles wrote:
| I too have been online for 30 years, and frequently had ISP
| caused dns issues, even when I wasn't using their
| resolvers... because of the dns interception fuckery they
| like to engage in. Before I started running my own resolver
| I saw downtime from my ISP's DNS resolver. This is across a
| few ISPs in that time. Anecdata is great isn't it?
| geoffpado wrote:
| Two incidents from two completely different providers in
| three weeks means that my personal experience with DNS is
| remarkably less stable recently than the last 20-ish years
| I've been using the Internet.
| rthnbgrredf wrote:
| Your personal experience is valuable but does not
| generalize in this case. I have 8.8.8.8 and 1.1.1.1
| (failover) set up for ever never experienced an outage.
| jallmann wrote:
| Good writeup.
|
| > It's worth noting that DoH (DNS-over-HTTPS) traffic remained
| relatively stable as most DoH users use the domain cloudflare-
| dns.com, configured manually or through their browser, to access
| the public DNS resolver, rather than by IP address.
|
| Interesting, I was affected by this yesterday. My router
| (supposedly) had Cloudflare DoH enabled but nothing would
| resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
| bauruine wrote:
| How does DoH work? Somehow you need to know the IP of
| cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for
| this.
| ta1243 wrote:
| And even if you have already resolved it the TTL is only 5
| minutes
| stavros wrote:
| Are we meant to use a domain? I've always just used the IP.
| landgenoot wrote:
| You need a domain in order to get the s in https to work
| federiconafria wrote:
| What about a reverse DNS lookup?
| yread wrote:
| what about certificate for IP address?
| landgenoot wrote:
| What about a route that gets hijacked? There is no HSTS
| for IP addresses.
| sathackr wrote:
| Presumably the route hijacker wouldn't have a valid
| private key for the certificate so they wouldn't pass
| validation
| bigiain wrote:
| That's not correct.
|
| LetEncrypt are trialling ip address https/TLS
| certificates right now:
|
| https://letsencrypt.org/2025/07/01/issuing-our-first-ip-
| addr...
|
| They say:
|
| "In principle, there's no reason that a certificate
| couldn't be issued for an IP address rather than a domain
| name, and in fact the technical and policy standards for
| certificates have always allowed this, with a handful of
| certificate authorities offering this service on a small
| scale."
| noduerme wrote:
| right, this was announced about two weeks ago to some
| fanfare. So in principle there was no reason not to do it
| two decades ago? It would've been nice back then. I never
| heard of any certificate authority offering that.
| bombcar wrote:
| It the beginning of HTTPS you were supposed to look for
| the padlock to prove if was a safe site. Scammers
| wouldn't take the time and money to get a cert, after
| all!
|
| So certs were often tied with identity which an IP really
| isn't so few providers offered them.
| kbolino wrote:
| An IP is about as much of an identity as a domain is.
|
| There are two main reasons IP certificates were not
| widely used in the past:
|
| - Before the SAN extension, there was just the CN, and
| there's only one CN per certificate. It would generally
| be a waste to set your only CN to a single IP address (or
| spend more money on more certs and the infrastructure to
| maintain them). A domain can resolve to multiple IPs,
| which can also be changed over time; users usually want
| to go to e.g. microsoft.com, not whatever IP that
| currently resolves to. We've had SANs for awhile now, so
| this limitation is gone.
|
| - Domain validation (serve this random DNS record)
| involves ordinary forward-lookup records under your
| domain. Trying to validate IP addresses over DNS would
| involve adding records to the reverse-lookup in-addr.arpa
| domain which varies in difficulty from annoying (you work
| for a large org that owns its own /8, /16, or /24) to
| impossible (you lease out a small number of unrelated IPs
| from a bottom-dollar ISP). IP addresses are much more
| doable now thanks to HTTP validation (serve this random
| page on port 80), but that was an unnecessary/unsupported
| modality before.
| fs111 wrote:
| > I never heard of any certificate authority offering
| that.
|
| DigiCert does. That is where 1.1.1.1 and 9.9.9.9 get
| their valid certificates from
| crabique wrote:
| Most CAs offer them, the only requirement is that it's at
| least an OV (not DV) level cert, and the subject
| organization proves it owns the IP address.
| maxloh wrote:
| Nope. That is not correct. https://1.1.1.1/dns-query is a
| perfectly valid DoH resolver address I've been using for
| months.
|
| Your operating system can validate the IP address of the
| DNS response by using the Subject Alternative Name (SAN)
| field within the CA certificate presented by the DoH
| server: https://g.co/gemini/share/40af4514cb6e
| stingraycharles wrote:
| Yeah I don't understand this part either, maybe it's supposed
| to be bootstrapped using your ISP's DNS server?
| tom1337 wrote:
| Pretty much that. You set up a bootstrap DNS server (could
| be your ISPs or any other server) which then resolves the
| IP of the DoH server which then can be used for all future
| requests.
| maxloh wrote:
| Yeah, your operating system will first need to resolve
| cloudflare-dns.com. This initial resolution will likely occur
| unencrypted via the network's default DNS. Only then will
| your system query the resolved address for its DoH requests.
|
| Note that this introduces one query overhead per DNS request
| if the previous cache has expired. For this reason, I've been
| using https://1.1.1.1/dns-query instead.
|
| In theory, this should eliminate that overhead. Your
| operating system can validate the IP address of the DNS
| response by using the Subject Alternative Name (SAN) field
| within the CA certificate presented by the DoH server:
| https://g.co/gemini/share/40af4514cb6e
| Hamuko wrote:
| My (Unifi) router is set to automatic DoH, and I think that
| means it's using Cloudflare and Google. Didn't notice any
| disruptions so either the Cloudflare DoH kept working or it
| used the Google one while it was down.
| zahrc wrote:
| Check Jallmann's response
| https://news.ycombinator.com/item?id=44578490#44578917
|
| TLDR; DoH was working
| Thorrez wrote:
| AFAICS, Jallmann just left 1 comment and it was top-level.
| I'm not sure what you mean by "Jallmann's response".
| sneak wrote:
| I disagree. The actual root cause here is shrouded in jargon
| that even experienced admins such as myself have to struggle to
| parse.
|
| It's corporate newspeak. "legacy" isn't a clear term, it's used
| to abstract and obfuscate.
|
| > _Legacy components do not leverage a gradual, staged
| deployment methodology. Cloudflare will deprecate these systems
| which enables modern progressive and health mediated deployment
| processes to provide earlier indication in a staged manner and
| rollback accordingly._
|
| I know what this means, but there's absolutely no reason for it
| to be written in this inscrutable corporatese.
| willejs wrote:
| If you carry on reading, its quite obvious they misconfigured
| a service and routed production traffic to that instead of
| the correct service, and the system used to do that was built
| in 2018 and is considered legacy (probably because you can
| easily deploy bad configs). Given that, I wouldn't say the
| summary is "inscrutable corporatese" whatever that is.
| bigiain wrote:
| I agree it's not "inscrutable corporatese"
|
| It's carefully written so my boss's boss thinks he
| understands it, and that we cannot possibly have that
| problem because we obviously don't have any "legacy
| components" because we are "modern and progressive".
|
| It is, in my opinion, closer to "intentionally misleading
| corporatese".
| noduerme wrote:
| Joe Shmo committed the wrong config file to production.
| Innocent mistake. Sally caught it in 30 seconds. We were
| back up inside 2 minutes. Sent Joe to the margarita shop
| to recover his shattered nerves. Kid deserves a raise.
| Etc.
| sathackr wrote:
| Yea the "timeline" indicating impact start/end is
| entirely false when you look at the traffic graph shared
| later in the post.
|
| Or they have a different definition of impact than I do
| stingraycharles wrote:
| I disagree, the target audience is also going to be less
| technical people, and the gist is clear to everyone: they
| just deploy this config from 0 to 100% to production, without
| feature gates or rollback. And they made changes to the
| config that wasn't deployed for weeks until some other change
| was made, which also smells like a process error.
|
| I will not say whether or not it's acceptable for a company
| of their size and maturity, but it's definitely not hidden in
| corporate lingo.
|
| I do believe they could have elaborate more on the follow up
| steps they will take to prevent this from happening again, I
| don't think staggered roll outs are the only answer to this,
| they're just a safety net.
| noduerme wrote:
| Funny. I was configuring a new domain today, and for about 20
| minutes I could only reach it through Firefox on one laptop.
| Google's DNS tools showed it active. SSH to an Amazon server
| that could resolve it. My local network had no idea of it.
| Flush cache and all. Turns out I had that one FF browser set up
| to use Cloudflare's DoH.
| sathackr wrote:
| Good writeup except the entirely false timeline shared at the
| beginning of the post
| bartvk wrote:
| You need to clarify such a statement, in my opinion.
| angst wrote:
| I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8
|
| Maybe there is noticeable difference?
|
| I have seen more outage incident reports of cloudflare than of
| google, but this is just personal anecdote.
| ta1243 wrote:
| I guess it depends on where you are and what you count as an
| outage. Is a single failed query an outage?
|
| For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time
| of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are
| 15.0ms, and 9.9.9.9 is 13.8ms.
|
| All of those servers return over 3-nines of uptime when
| quantised in the "worst result in a given 1 minute bucket" from
| my monitoring points, which seem fine to have in your mix of
| upstream providers. Personally I'd never rely on a single
| provider. Google gets 4 nines, but that's only over 90 days so
| I wouldn't draw any long term conclusions.
| Pharaoh2 wrote:
| https://www.dnsperf.com/#!dns-resolvers
|
| Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
| dawnerd wrote:
| Oh this explains a lot. I kept having random connection issues
| and when I disabled AdGuard dns (self hosted) it started working
| so I just assumed it was something with my vm.
| thunderbong wrote:
| How does Cloudflare compare with OpenDNS?
| blurrybird wrote:
| You'd be better off comparing it to Quad9 based on performance,
| privacy claims, and response accuracy.
| johnklos wrote:
| Cloudflare is a for-profit company in the US. Their privacy
| claims can't be believed. Even if we did believe them, we have
| no idea if rsolution data isn't taken by US TLA agencies.
| forbiddenlake wrote:
| Hm, what distinction are you trying to make here? OpenDNS is
| also an American company, acquired by Cisco (an American
| company) in 2015.
| johnklos wrote:
| I don't know much about OpenDNS, but yes, I wouldn't trust
| Cisco to do anything that didn't somehow push money in
| their direction. I was just offering relevant information
| about Cloudflare.
| johnklos wrote:
| It seems we have a lot of Cloudflare fanbois and apologists
| here. This is not unexpected. But is anything I'm writing
| untrue, or just unpopular? Does anyone who's downvoting me
| care to point out any inaccuracies about what I've written?
| nodesocket wrote:
| I used to configure 1.1.1.1 as primary and 8.8.8.8 as secondary
| but noticed that Cloudflare on aggregate was quicker to respond
| to queries and changed everything to use 1.1.1.1 and 1.0.0.1.
| Perhaps I'll switch back to using 8.8.8.8 as secondary, though my
| understanding is DNS will round-robin between primary and
| secondary, it's not primary and then use secondary ONLY if
| primary is down. Perhaps I am wrong though.
|
| EDIT: Appears I was wrong, it is failover not round-robin between
| the primary and secondary DNS servers. Thus, using 1.1.1.1 and
| 8.8.8.8 makes sense.
| ta1243 wrote:
| Depends on how you configure it. In resolv.conf systems for
| example you can set a timeout of say 1 second and do it as
| main/reserve, or set it up to round-robin. From memory it's
| something like "options:rotate"
|
| If you have a more advanced local resolver of some sort
| (systemd for example) you can configure whatever behaviour you
| want.
| chrismorgan wrote:
| I'm surprised at the delay in impact detection: it took their
| internal health service _more than five minutes_ to notice (or at
| least alert) that their main protocol's traffic had abruptly
| dropped to around 10% of expected and was staying there. Without
| ever having been involved in monitoring at that kind of scale,
| I'd have pictured alarms firing for something _that_ extreme
| within a minute. I'm curious for description of how and why that
| might be, and whether it's reasonable or surprising to
| professionals in that space too.
| TheDong wrote:
| I'm not surprised.
|
| Let's say you've got a metric aggregation service, and that
| service crashes.
|
| What does that result in? Metrics get delayed until your
| orchestration system redeploys that service elsewhere, which
| looks like a 100% drop in metrics.
|
| Most orchestration take a sec to redeploy in this case,
| assuming that it could be a temporary outage of the node (like
| a network blip of some sort).
|
| Sooo, if you alert after just a minute, you end up with people
| getting woken up at 2am for nothing.
|
| What happens if you keep waking up people at 2am for something
| that auto-resolves in 5 minutes? People quit, or eventually
| adjust the alert to 5 minutes.
|
| I know you often can differentiate no data and real drops, but
| the overall point, of "if you page people constantly, people
| will quit" I think is the important one. If people keep getting
| paged for too tight alarms, the alarms can and should be
| loosened... and that's one way you end up at 5 minutes.
| mentalgear wrote:
| Its not wrong for smaller companies. But there's an argument
| that a big system critical company/provider like Cloudflare
| should be able to afford its own always on team with a night
| shift.
| chrismorgan wrote:
| Not even a night shift, just normal working hours in
| another part of the world.
| bigiain wrote:
| There are kinds big step/jumps as the size of a company
| goes up.
|
| Step 1: You start out with the founders being on call
| 27x7x365 or people in the first 10 or 20 hires "carry the
| pager" on weekends and evenings and your entire company
| is doing unpaid rostered on call.
|
| Step 2: You steal all the underwear.
|
| Step 3: You have follow-the-sun office-hours support
| staff teams distributed around the globe with sufficient
| coverage for vacations and unexpected illness or
| resignations.
| chrismorgan wrote:
| I confess myself bemused by your Step 2.
| bigiain wrote:
| I'm like, come on! It's a South Park reference? Surely
| everybody here gets that???
|
| <google google google>
|
| "Original air date: December 16, 1998"
|
| Oh, right. Half of you weren't even born... Now I feel
| ooooooold.
| misiek08 wrote:
| Please don't. It doesn't make sense, doesn't help, doesn't
| improve anything and is just waste of money, time, power
| and people.
|
| Now without crying: I saw multiple, big companies getting
| rid of NOC and replacing that with on duties in multiple,
| focused teams. Instead of 12 people sitting 24/7 in group
| of 4 and doing some basic analysis and steps before calling
| others - you page correct people in 3-5 minutes, with exact
| and specific alert.
|
| Incident resolution times went greatly down (2-10x times -
| depends on company), people don't have to sit overnight and
| sleep for most of the time and no stupid actions like
| service restart taken to slow down incident resolution.
|
| And I'm not liking that some platforms hire 1500 people for
| job that could be done with 50-100, but in terms of
| incident response - if you already have teams with
| separated responsibilities then NOC it's "legacy"
| immibis wrote:
| 24/7 on-call is basically mandatory at any major network,
| which cloudflare is. Your contractual relations with
| other networks will require it.
| easterncalculus wrote:
| I'm not convinced that the SWE crowd of HN, particularly
| the crowd showing up to every thread about AI 'agents'
| really knows what it takes to run a global network or
| what a NOC is. I know saying this on here runs the risk
| of Vint Cerf or someone like that showing up in my
| replies, but this is seriously getting out of hand now.
| Every HN thread that isn't about fawning over AI
| companies is devolving into armchair redditor analysis of
| topics people know nothing about. This has gotten way
| worse since the pre-ChatGPT days.
| genewitch wrote:
| > Every HN thread that isn't about fawning over AI
| companies is devolving into armchair redditor analysis of
| topics people know nothing about.
|
| It took me a very long time to realize that^. I've worked
| _with_ two NOC at two _huge_ companies, and i know they
| still exist as teams at those companies. I 'm not an SWE,
| though. And I'm not certain i'd qualify either company as
| truly "global" except in the loosest sense - as in, one
| has "American" in the name of the primary subsidiary.
|
| ^ i even regularly have used "the comments were people
| incorrecting each other about <x>", so i knew
| subconsciously that HN is just a different subset of
| general internet comments. The issue comes from this site
| appearing to be moderated, and the group of people that
| select for commenting here seem like they would be above
| average at understanding and backing up claims. The
| "incorrecting" label comes from n-gate, which hasn't been
| updated since the early '20s, last i checked.
| JohnMakin wrote:
| Lol preach
|
| (Have worked as SRE at large global platform)
|
| I just mostly over the last few years tune out such
| responses and try not to engage them. The whole
| uninformed "Well, if it were me, I would simply not do
| that" kind of comment style has been pervasive on this
| site for longer than AI though, IMO.
| degamad wrote:
| The question is, which is better: 24/7 shift work (so
| that someone is always at work to respond, with disrupted
| sleep schedules at regular planned intervals) or 24/7 on-
| call (with monitoring and alerting that results in random
| intermittent disruptions to sleep, sometimes for false
| positives)?
| amelius wrote:
| I think it is reasonable if the alarm trigger time is, say
| 5-10% of the time required to fix most problems.
| amelius wrote:
| Instead of downvoting me, I'd like to know why this is
| not reasonable?
| croemer wrote:
| It's not rocket science. You do a 2 stage thing: Why not
| check if the aggregation service has crashed before firing
| the alarm if it's within the first 5 minutes? How many types
| of false positives can there be? You just need to eliminate
| the most common ones and you gradually end up with fewer of
| them.
|
| Before you fire a quick alarm, check that the node is up,
| check that the service is up etc.
| sophacles wrote:
| > How many types of false positives can there be?
|
| Operating at the scale of cloudflare? A lot.
|
| * traffic appears to be down 90% but we're only getting
| metrics from the regions of the world that are asleep
| because of some pipeline error
|
| * traffic appears to be down 90% but someone put in a
| firewall rule causing the metrics to be dropped
|
| * traffic appears to be down 90% but actually the counter
| rolled over and prometheus handled it wrong
|
| * traffic appears to be down 90% but the timing of the new
| release just caused polling to show wierd numbers
|
| * traffic appears to be down 90% but actually there was a
| metrics reporting spike and there was pipeline lag
|
| * traffic appears to be down 90% but it turns out that the
| team that handles transit links forgot to put the right
| acls around snmp so we're just not collecting metrics for
| 90% of our traffic
|
| * I keep getting alerts for traffic down 90%.... thousands
| and thousands of them, but it turns out that really its
| just that this rarely used alert had some bitrot and
| doesn't use the aggregate metrics but the per-system ones.
|
| * traffic is actually down 90% because theres an internet
| routing issue (not the dns team's problem)
|
| * traffic is actually down 90% at one datacenter because of
| a fiber cut somewhere
|
| * traffic is actually down 90% because the normal usage
| pattern is trough traffic volume is 10% of peak traffic
| volume
|
| * traffic is down 90% from 10s ago, but 10s ago there was
| an unusual spike in traffic.
|
| And then you get into all sorts of additional issues caused
| by the scale and distributed nature of a metrics system
| that monitors a huge global network of datacenters.
| __turbobrew__ wrote:
| The real issue in your hypothetical scenario is a single bad
| metrics instance can bring the entire thing down. You could
| deploy multiple geographically distributed metrics
| aggregation services which establish the "canonical state"
| through a RAFT/PAXOS quorum. Then as long as a majority of
| metric aggregator instances are up the system will continue
| to work.
|
| When you are building systems like 1.1.1.1 having an alert
| rollup of five minutes is not acceptable as it will hide
| legitimate downtime that lasts between 0 and 5 minutes.
|
| You need to design systems which do not rely on orchestration
| to remediate short transient errors.
|
| Disclosure: I work on a core SRE team for a company with over
| 500 million users.
| perlgeek wrote:
| There's a constant tension between speed of detection and false
| positive rates.
|
| Traditional monitoring systems like Nagios and Icinga have
| settings where they only open events/alerts if a check failed
| three times in a row, because spurious failed checks are quite
| common.
|
| If you spam your operators with lots of alerts for monitoring
| checks that fix themselves, you stress the unnecessarily and
| create alert blindness, because the first reaction will be
| "let's wait if it fixes itself".
|
| I've never operated a service with as much exposure as CF's DNS
| service, but I'm not really surprised that it took 8 minutes to
| get a reliable detection.
| sbergot wrote:
| I work on the SSO stack in a b2b company with about 200k
| monthly active users. One blind spot in our monitoring is
| when an error occurs on the client's identity provider
| because of a problem on our side. The service is unusable and
| we don't have any error logs to raise an alert. We tried to
| setup an alert based on expected vs actual traffic but we
| concluded that it would create more problems for the reason
| you provided.
| chrismorgan wrote:
| At Cloudflare's scale on 1.1.1.1, I'd imagine you could do
| something comparatively simple like track ten-minute and ten-
| second rolling averages (I know, I know, I make that sound
| much easier and more practical than it actually would be),
| and if they differ by more than 50%, sound the alarm. (Maybe
| the exact numbers would need to be tweaked, e.g. 20 seconds
| or 80%, but it's the idea.)
|
| Were it much less than 1.1.1.1 itself, taking longer than a
| minute to alarm probably wouldn't surprise me, but this is
| 1.1.1.1, they're dealing with _vasts_ amounts of probably
| fairly consistent traffic.
| perlgeek wrote:
| I'm sure some engineer at cloudflare is evaluating
| something like this right now, and try it on historical
| data how many false positives that would've generated in
| the past, if any.
|
| Thing is, it's probably still some engineering effort, and
| most orgs only really improve their monitoring after it
| turned out to be sub-optimal.
| chrismorgan wrote:
| This is hardly the first 1.1.1.1 outage. It's also
| probably about the first external monitoring behaviour I
| imagine you'd come up with. That's why I'm surprised--
| more surprised the longer I think about it, actually;
| more than five minutes is a _really_ long delay to notice
| such a fundamental breakage.
| roughly wrote:
| Is your external monitor working? How many checks failed,
| in what order? Across how many different regions or
| systems? Was it a transient failure? How many times do
| you retry, and at what cadence? Do you push your success
| or failure metrics? Do you pull? What if your metrics
| don't make it back? How long do you wait before
| considering it a problem? What other checks do you run,
| and how long do those take? What kind of latency is
| acceptable for checks like that? How many false alarms
| are you willing to accept, and at what cadence?
| briangriffinfan wrote:
| I would want to make sure we avoid "We should always do the
| exact specific thing that would have prevented this exact
| specific issue"-style thinking.
| Anon1096 wrote:
| I work on something at a similar scale to 1.1.1.1, if we
| had this kind of setup our oncall would never be asleep
| (well, that is almost already the case, but alas). It's
| easy to say "just implement X monitor and you'd have caught
| this" but there's a real human cost and you have to work
| extremely vigilently at deleting monitors or you'll be
| absolutely swamped with endless false positive pages. I
| don't think a 5 minute delay is unreasonable for a service
| this scale.
| chrismorgan wrote:
| This just seems kinda fundamental: the _entire service_
| was basically down, and it took 6+ minutes to notice? I'm
| just increasingly perplexed at how that could be. This
| isn't an _advanced_ monitor, this is perhaps the first
| and most important monitor I'd expect to implement (based
| on no closely relevant experience).
| roughly wrote:
| > based on no closely relevant experience
|
| I don't want to devolve this to an argument from
| authority, but - there's a lot of trade offs to
| monitoring systems, especially at that scale. Among other
| things, aggregation takes time at scale, and with enough
| metrics and numbers coming in, your variance is all over
| the place. A core fact about distributed systems at this
| scale is that something is always broken somewhere in the
| stack - the law of averages demands it, and so if you're
| going to do an all-fire-alarm alert any time part of the
| system isn't working, you've got alarms going off 24/7.
| Actually detecting that an actual incident is actually
| happening on a machine of the size and complexity we're
| talking about within 5 minutes is absolutely fantastic.
| philipwhiuk wrote:
| Remember they have no SLA for this service.
| chrismorgan wrote:
| So?
|
| They have a rather significant vested interest in it being
| reliable.
| bombcar wrote:
| This is one of those graphs that would have been on the giant
| wall in the NOC in the old days - someone would glance up and
| see it had dropped and say "that's not right" and start
| scrambling.
| seb1204 wrote:
| That's how I picture it. Is that not how it is? Everyone
| working from home and the big chart is on the TV but someone
| in the family changed channels?
| kccqzy wrote:
| Having alarms firing within a minute just becomes a stress test
| for your alarm infrastructure. Is your alarm infrastructure
| able to get metrics and perform calculations consistently
| within a minute of real time?
| bastawhiz wrote:
| The service almost certainly wasn't completely hard down at the
| time the impact began, especially if that's the start of a
| global rollout. It would have taken time for the impact to
| become measurable.
| egamirorrim wrote:
| What's that about a hijack?
| homero wrote:
| Related, non-causal event: BGP origin hijack of 1.1.1.0/24
| exposed by withdrawal of routes from Cloudflare. This was not a
| cause of the service failure, but an unrelated issue that was
| suddenly visible as that prefix was withdrawn by Cloudflare.
| kylestanfield wrote:
| So someone just started advertising the prefix when it was up
| for grabs? That's pretty funny
| woutifier wrote:
| No they were already doing that, the global withdrawal of
| the legitimate route just exposed it.
| SemioticStandrd wrote:
| How is there absolutely no further comment about that in
| their RCA? That seems like a pretty major thing...
| JdeBP wrote:
| And because people highlighted it on social media at the time
| of the outage, many thought that the bogus route _was_ the
| cause of the problem.
| ollien wrote:
| I'm a bit uneducated here - why was the other 1.1.1.0/24
| announcement previously suppressed? Did it just express a
| high enough cost that no one took it on compared to the CF
| announcement?
| whiatp wrote:
| CF had their route covered by RPKI, which at a high level
| uses certs to formalize delegation of IP address space.
|
| What caused this specific behavior is the dilemma of
| backwards comparability when it comes to BGP security. We
| area long ways off from all routes being covered by rpki,
| (just 56% of v4 routes according to https://rpki-
| monitor.antd.nist.gov/ROV ) so invalid routes tend to be
| treated as less preferred, not rejected by BGP speakers
| that support RPKI.
| 0xbadcafebee wrote:
| > A configuration change was made for the same DLS service. The
| change attached a test location to the non-production service;
| this location itself was not live, but the change triggered a
| refresh of network configuration globally.
|
| Say what now? A test triggered a global production change?
|
| > Due to the earlier configuration error linking the 1.1.1.1
| Resolver's IP addresses to our non-production service, those
| 1.1.1.1 IPs were inadvertently included when we changed how the
| non-production service was set up.
|
| You have a process that allows some other service to just hoover
| up address routes already in use in production by a different
| service?
| sneak wrote:
| 1.1.1.1 does not operate in isolation.
|
| It is designed to be used in conjunction with 1.0.0.1. DNS has
| fault tolerance built in.
|
| Did 1.0.0.1 go down too? If so, why were they on the same
| infrastructure?
|
| This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole
| point is that it can go down at any time and everything keeps
| working.
|
| Shouldn't the fix be to ensure that these are served out of
| completely independent silos and update all docs to make sure
| anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
|
| If I ran a service like this I would regularly do blackouts or
| brownouts on the primary to make sure that people's resolvers are
| configured correctly. Nobody should be using a single IP as a
| point of failure for their internet access/browsing.
| detaro wrote:
| You don't need to test if peoples resolvers handle this
| cleanly, because its already known that many don't. DNS
| fallback behavior across platforms is a mess.
| notpushkin wrote:
| > Did 1.0.0.1 go down too?
|
| Yes.
|
| > Shouldn't the fix be to ensure that these are served out of
| completely independent silos [...]?
|
| Yes.
|
| > If so, why were they on the same infrastructure?
|
| Apparently, they weren't independent enough: something in CF
| has announced both addresses and that got out.
|
| The solution for the end user is, of course, to use 1.1.1.1 and
| 8.8.8.8 (or any other combination of two _different_
| resolvers).
| rswail wrote:
| I now run unbound locally as a recursive DNS server, which really
| should be the default. There's no reason not to in modern
| routers.
|
| Not sure what the "advantage" of stub resolvers is in 2025 for
| anything.
| i_niks_86 wrote:
| Many commenters assume fallback behavior exists between DNS
| providers, but in practice, DNS clients - especially at the OS or
| router level -rarely implement robust failover for DoH. If you're
| using cloudflare-dns(.)com and it goes down, unless the stub
| resolver or router explicitly supports multi-provider failover
| (and uses a trust-on-first-use or pinned cert model), you're
| stuck. The illusion of redundancy with DoH needs serious UX
| rethinking.
| tankenmate wrote:
| I use routedns[0] for this specific reason it handles almost
| all DNS protocols; UDP, TCP, DoT, DoH, DoQ (including 0-RTT).
| But more importantly is has a very configurable route steering
| even down to a record by record basis if you want to put up
| with all the configuration involved. It's very robust and is
| very handy, I use 1.1.1.1 on my desktops and servers and when
| the incident happened I didn't even notice as the failover
| "just worked". I had to actually go look at the logs because I
| didn't notice.
|
| [0] https://github.com/folbricht/routedns
| hkon wrote:
| To say I was surprised when I finally checked the status page of
| cloudflare is an understatement.
| v5v3 wrote:
| > For many users, not being able to resolve names using the
| 1.1.1.1 Resolver meant that basically all Internet services were
| unavailable.
|
| Don't you normally have 2 DnS servers listed on any device. So
| was the second also down, if not why didn't it go to that.
| rat9988 wrote:
| Not all users have configured two DNS servers?
| quacksilver wrote:
| It is highly recommended to configure two or more DNS servers
| incase one is down.
|
| I would count not configuring at least two as 'user error'.
| Many systems require you to enter a primary and alternate
| server in order to save a configuration.
| tgv wrote:
| The default setting on most computers seems to be: use the
| (wifi) router. I suppose telcos like that because it keeps
| the number of DNS requests down. So I wouldn't necessarily
| see it as user error.
| SketchySeaBeast wrote:
| The funny part with that is that sites like cloudflare say
| "Oh, yeah, just use 1.0.0.1 as your alternate", when, in
| reality, it should be an entirely different service.
| daneel_w wrote:
| OK. But there's no reason or excuse not to, if they already
| manually configured a primary.
| rom1v wrote:
| On Android, in Settings, Network & internet, Private DNS, you
| can only provide one in "Private DNS provider hostname"
| (AFAIK).
|
| Btw, I really don't understand why it does not accept an IP
| (1.1.1.1), so you have to give an address (one.one.one.one). It
| would be more sensible to configure a DNS server from an IP
| rather than from an address to be resolved by a DNS server :/
| quacksilver wrote:
| Private DNS on Android refers to 'DNS over HTTPS' and would
| normally only accept a hostname.
|
| Normal DNS can normally be changed in your connection
| settings for a given connection on most flavours of Android.
| quaintdev wrote:
| Its DNS over TLS. Android does not support DNS over HTTPS
| except Google's DNS
| KoolKat23 wrote:
| As far as I understand it, it's Google or Cloudflare?
| lxgr wrote:
| It does since Android 11.
| Tarball10 wrote:
| For a limited set of DoH providers. It does not let you
| enter a custom DoH URL, only a DoT hostname.
| rom1v wrote:
| > Private DNS on Android refers to 'DNS over HTTPS'
|
| Yes, sorry, I did not mention it.
|
| So if you want to use DNS over HTTPS on Android, it is not
| possible to provide a fallback.
| ignoramous wrote:
| > _So if you want to use DNS over HTTPS on Android, it is
| not possible to provide a fallback._
|
| Not true. If the (DoH) host has multiple A/AAAA records
| (multiple IPs), any decent DoH client would retry its
| requests over multiple or all of those IPs.
| lxgr wrote:
| Does Cloudflare offer any hostname that also resolves to
| a different organization's resolver (which must also have
| a TLS certificate for the Cloudflare hostname or DoH
| clients won't be able to connect)?
| ignoramous wrote:
| Usually, for plain old DNS, primary and secondary
| resolvers are from the same provider, serving from
| distinct IPs.
| lxgr wrote:
| Yes, but you were talking about DoH. I don't know how
| that could plausibly work.
| ignoramous wrote:
| > _but you were talking about DoH_
|
| DoH hosts can resolve to multiple IPs (and even different
| IPs for different clients)?
|
| Also see TFA It's worth noting that DoH
| (DNS-over-HTTPS) traffic remained relatively stable as
| most DoH users use the domain cloudflare-dns.com,
| configured manually or through their browser, to access
| the public DNS resolver, rather than by IP address. DoH
| remained available and traffic was mostly unaffected as
| cloudflare-dns.com uses a different set of IP addresses.
| lxgr wrote:
| > DoH hosts can resolve to multiple IPs (and even
| different IPs for different clients)?
|
| Yes, but not from a different organization. That was GPs
| point with
|
| > So if you want to use DNS over HTTPS on Android, it is
| not possible to provide a fallback.
|
| A cross-organizational fallback is not possible with DoH
| in many clients, but it is with plain old DNS.
|
| > It's worth noting that DoH (DNS-over-HTTPS) traffic
| remained relatively stable as most DoH users use the
| domain cloudflare-dns.com
|
| Yes, but that has nothing to do with failovers to an
| infrastructurally/operationally separate secondary
| server.
| ignoramous wrote:
| > _A cross-organizational fallback is not possible with
| DoH in many clients, but it is with plain old DNS._
|
| That's client implementation lacking, not some issue
| inherent to DoH? The DoH client is
| configured with a URI Template, which describes how to
| construct the URL to use for resolution. Configuration,
| discovery, and updating of the URI Template is done out
| of band from this protocol. Note that
| configuration might be manual (such as a user typing URI
| Templates in a user interface for "options") or automatic
| (such as URI Templates being supplied in responses from
| DHCP or similar protocols). DoH servers MAY support more
| than one URI Template. This allows the different
| endpoints to have different properties, such as different
| authentication requirements or service-level guarantees.
|
| https://datatracker.ietf.org/doc/html/rfc8484#section-3
| lxgr wrote:
| Yes, but this restriction of only a single DoH URL seems
| to be the norm for many popular implementations. The
| protocol theoretically allowing better behavior doesn't
| really help people using these.
| eptcyka wrote:
| Cloudflare has valid certs for 1.1.1.1
| fs111 wrote:
| No, it is not DNS over HTTPS it is DNS over TLS, which is
| different.
| lxgr wrote:
| Android 11 and newer support both DoH and DoT.
| politelemon wrote:
| Where is this option? How can I distinguish the two, the
| dialog simply asks for a host name
| zamadatix wrote:
| 1.1.1.1 is also what they call the resolver service as a whole,
| the impact section (seems to) be saying both 1.0.0.0/24 and
| 1.1.1.0/24 were affected (among other ranges).
| Gieron wrote:
| I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I
| understand this correctly, both were down.
| Algent wrote:
| Yeah pretty much. In a perfect world you would pair it with
| another service I guess but usually you use the official
| backup IP because it's not supposed to break at same time.
| carlhjerpe wrote:
| I would rather fall back to the slow path of resolving
| through root servers than fall back from one recursive
| resolver to another.
| moontear wrote:
| Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault
| tolerance in terms of provider as well.
| rvnx wrote:
| Quad9 is reselling the traffic logs, so it means if you
| connect to secret hosts (like for your work), they will be
| leaked
| Demiurge wrote:
| Is this true? They claim that they don't keep any logs.
| Do you have a source?
| jeffbee wrote:
| They don't claim that. Less than a week ago HN discussed
| their top resolved domains report. Such a report implies
| they have logs.
| Demiurge wrote:
| From their homepage:
|
| > How Quad9 protects your privacy?
|
| > When your devices use Quad9 normally, no data
| containing your IP address is ever logged in any Quad9
| system.
|
| Of course they have some kinds of logs. Aggregating
| resolved domains without logging client IPs is not what
| the implication of "Quad9 is reselling the traffic logs"
| seems to be.
| jeffbee wrote:
| We're not discussing IP addresses, we are discussing
| whether their logs can leak your secret domain name.
| Demiurge wrote:
| Thats more clear, I get your point now. Again, though,
| that's not how most people would read the original
| comment. I've never even contemplated that I might
| generate some hostnames existence of which might be
| considered sensitive. It seems like a terrible idea to
| begin with, as I'm sure there are other avenues for those
| "secret" domains to be leaked. Perhaps name your secret
| VMs vm1, vm2, ..., instead of <your root password>. But
| yeah, this is not my area of expertise, nor a concern for
| the vast majority of internet users who want more privacy
| than their ISP will provide.
|
| I am curious though, do you have any suggestions for
| alternative DNS that is better?
| jeffbee wrote:
| I use Google DNS because I feel it suits my personal
| theory of privacy threats. Among the various public DNS
| resolver services, I feel that they have the best
| technical defenses agains insider snooping and outside
| hackers infiltrating their systems, and I am unperturbed
| about their permanent logs. I also don't care about
| Quad9's logs, except to the extent that it seems
| inconsistent with the privacy story they are selling. I
| used Quad9 as my resolver of last resort in my config. I
| doubt any queries actually go there in practice.
| daneel_w wrote:
| Could you show a citation? Your statement completely
| opposes Quad9's official information as published on
| quad9.net, and what's more it doesn't align at all with
| Bill Woodcock's known advocacy for privacy.
| gruez wrote:
| See: https://quad9.net/privacy/policy/
|
| It doesn't say they sell traffic logs outright, but they
| do send telemetry on blocked domains to the blocklist
| provider, and provides "a sparse statistical sampling of
| timestamped DNS responses" to "a very few carefully
| vetted security researchers". That's not exactly "selling
| traffic logs", but is fairly close. Moreover colloquially
| speaking, it's not uncommon to claim "google sells your
| data", even they don't provide dumps and only disclose
| aggregated data.
| daneel_w wrote:
| Disagree that it's fairly close to the statement "they
| resell traffic logs" and the implication that they leak
| all queried hostnames ("secret hosts, like for your work,
| will be leaked"). Unless Quad9 is deceiving users, both
| statements are, in fact, completely false.
|
| https://quad9.net/privacy/policy/#22-data-collected
| gruez wrote:
| >and the implication that they leak all queried hostnames
| ("secret hosts, like for your work, will be leaked").
|
| The part about sharing data with "a very few carefully
| vetted security researchers" doesn't preclude them from
| leaking domains. For instance if the security researcher
| exports a "SELECT COUNT(*) GROUP BY hostname" query that
| would arguably count as "summary form", and would include
| any secret hostnames.
|
| >https://quad9.net/privacy/policy/#22-data-collected
|
| If you're trying to imply that they can't possibly be
| leaking hostnames because they don't collect hostnames,
| that's directly contradicted by the subsequent sections,
| which specifically mention that they share metrics
| grouped by hostname basis. Obviously they'll need to
| collect hostname to provide such information.
| daneel_w wrote:
| I'm implying that I'm convinced they are not storing
| statistics on (thus leaking) every queried hostname. By
| your very own admission, they clearly state that they
| perform statistics on _a set of malicious domains
| provided by a third party_ , as part of their blocking
| program. Additionally they publish a "top 500 domains"
| list regularly. You're really having a go with the
| shoehorn if you want "secret domains, like for your work"
| (read: every distinct domain queried) to fit here.
| gruez wrote:
| >I'm implying that I'm convinced they are not storing
| statistics on (thus leaking) every queried hostname. By
| your very own admission, they clearly state that they
| perform statistics on a set of malicious domains provided
| by a third party, as part of their blocking program.
|
| Right, but the privacy policy also says there's a
| separate program for "a very few carefully vetted
| security researchers" where they can get data in "summary
| form", which can leak domain name in the manner I
| described in my previous comment. Maybe they have a great
| IRB (or similar) that would prevent this from happening,
| but that's not mentioned in the privacy policy. Therefore
| it's totally in the realm of possibility that secret
| domain names could be leaked, no "really having a go with
| the shoehorn" required.
| sophacles wrote:
| Im sorry... what is a secret hostname that is publicly
| resolvable?
|
| The very idea strikes me as irresponsible and misguided.
| notpushkin wrote:
| It could be some subdomain that's hard to guess. You
| can't (generally) enumerate all subdomains through DNS,
| and if you use a wildcard TLS certificate (or self-signed
| / no cert at all), it won't be leaked to CT logs either.
| Secret hostname.
| rvnx wrote:
| Examples: github.internal.companyname.com or
| jira.corp.org or jenkins-ci.internal-finance.acme-
| corp.com or grafana.monitoring.initech.io or
| confluence.prod.internal.companyx.com etc
|
| These, if you don't know the host, you will not be able
| to hit the backend service. But if you know, you can
| start exploiting it, either by lack of auth, or by trying
| to exploit the software itself
| Quad9 wrote:
| We are fully committed to end-user privacy. As a result,
| Quad9 is intentionally designed to be incapable of
| capturing end-users' PII. Our privacy policy is clear
| that queries are never associated with individual persons
| or IP addresses, and this policy is embedded in the
| technical (in)capabilities of our systems.
| rvnx wrote:
| It is about the hostnames themselves like:
| git.nationalpolice.se but I understand that there is not
| much choice if you want to keep the service free to use
| so this is fair
| staviette wrote:
| Is that really a concern for most people? Trying to keep
| hostnames secret is a losing battle anyways these days.
|
| You should probably be using a trusted TLS certificate
| for your git hosting. And that means the host name will
| end up in certificate transparency logs which are even
| easier to scrape than DNS queries.
| baobabKoodaa wrote:
| Windows 11 does not allow using this combination
| snickerdoodle12 wrote:
| Huh? Did they break the primary/secondary DNS server
| setup that has been present in all operating systems for
| decades?
| antonvs wrote:
| DNS over HTTPS adds a requirement for an additional field
| - a URL template - and Windows doesn't handle defaulting
| that correctly in all cases. If you set them manually it
| works fine.
| snickerdoodle12 wrote:
| What does that have to do with plain old dns?
| antonvs wrote:
| Nothing, but Windows can automatically use DNS over HTTPS
| if it recognizes the server, which is the source of the
| issue the other commenter mentioned.
| lxgr wrote:
| How so? Does it reject a secondary DNS server that's not
| in the same subnet or something similar?
| antonvs wrote:
| It's using DNS over HTTPS, and it doesn't default the URL
| templates correctly when mixing (some) providers. You can
| set them manually though, and it works.
| lxgr wrote:
| Ah, this is for DoH, gotcha!
|
| This "URL template" thing seems odd - is Windows doing
| something like creating a URL out of the DNS IP and a
| pattern, e.g. 1.1.1.1 + "https://<ip>/foo" would yield
| https://1.1.1.1/foo?
|
| If so, why not just allow providing an actual URL for
| each server?
| antonvs wrote:
| It does allow you to provide a URL for each server. The
| issue is just that its default behavior doesn't work for
| all providers. I have another comment in this thread
| telling the original commenter how to configure it.
| lxgr wrote:
| Very cool, thank you!
| antonvs wrote:
| You can use it, you just need to set the DNS over HTTPS
| templates correctly, since there's an issue with the
| defaults it tries to use when mixing providers.
|
| The templates you need are:
|
| 1.1.1.1: https://cloudflare-dns.com/dns-query
|
| 9.9.9.9: https://dns.quad9.net/dns-query
|
| 8.8.8.8: https://dns.google/dns-query
|
| See https://learn.microsoft.com/en-us/windows-
| server/networking/... for info on how to set the
| templates.
| baobabKoodaa wrote:
| Awesome! Thank you!
| antonvs wrote:
| You're welcome. btw I came across a description of doing
| it via the GUI here: https://github.com/Curious4Tech/DNS-
| over-HTTPS-Set-Up
| Aachen wrote:
| I became a bit disillusioned with quad9 when they started
| refusing to resolve my website. It's like wetransfer but
| supporting wget and without the AI scanning or
| interstitials. A user had uploaded malware and presumably
| sent the link to a malware scanner. Instead of reporting
| the malicious upload or blocking the specific URL1, the
| whole domain is now blocked on a DNS level. The competing
| wetransfer.com resolves just fine at 9.9.9.9
|
| I haven't been able to find any recourse. The malware was
| online for a few hours but it has been weeks and there
| seems to be no way to clear my name. Someone on github (the
| website is open source) suggested that it's probably
| because they didn't know of the website, like everyone
| heard of wetransfer and github and so they don't get the
| whole domain blocked for malicious user content. I can't
| find any other difference, but also no responsible party to
| ask. The false-positive reporting tool on quad9's website
| just reloads the page and doesn't do anything
|
| 1 I'm aware DNS can't do this, but with a direct way of
| contacting a very responsive admin (no captchas or annoying
| forms, just email), I'd not expect scanners to resort to
| blocking the domain outright to begin with, at least not
| after they heard back the first time and the problematic
| content has been cleared swiftly
| mnordhoff wrote:
| You should email them about the form and about your
| domain. Their email address is listed on the website.
| <https://quad9.net/support/contact/>
|
| Sometimes the upstream blocklist provider will be easy to
| contact directly as well. Sometimes not so much.
| ajdude wrote:
| I've been the victim of similar abuse before, for my mail
| servers and one of my community forums that I used to
| run. It's frustrating when you try to do everything right
| but you're at the mercy of a cold and uncompromising
| rules engine.
|
| You just convinced me to ditch quad9.
| Quad9 wrote:
| What is your ticket #? Let's see if we can get this
| resolved for you.
| dmitrygr wrote:
| Why not address the _REAL_ issue:
|
| > I haven't been able to find any recourse. [...] there
| seems to be no way to clear my name.
| seb1204 wrote:
| From the parent comment the path of recourse is a ticket.
| Does not help if hn is needed to have it looked at.
| rvnx wrote:
| 8.8.8.8 + 1.1.1.1 is stable and mostly safe
| baobabKoodaa wrote:
| Windows 11 does not allow using this combination
| heraldgeezer wrote:
| it does if you set it on the interface
| ziml77 wrote:
| This is what I do. I have both services set in my router,
| so the full list it tries are 1.1.1.1, 1.0.0.1, 8.8.8.8,
| and 8.8.4.4
| bmicraft wrote:
| My Mikrotik router (and afaict all of them) don't support more
| than one DoH address.
| ahoka wrote:
| Or run your own, if you are able to.
| sschueller wrote:
| Yes, I would also highly recommend using a DNS closest to you
| (for those that have ISPs that don't mess around (blocking
| etc.) with their DNS you usually get much better response
| times) and multiple from different providers.
|
| If your device doesn't support proper failover use a local DNS
| forwarder on your router or an external one.
|
| In Switzerland I would use Init7 (isp that doesn't filter) ->
| quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
| dylan604 wrote:
| How busy in life are you that we're concerning ourselves with
| nearest DNS? Are you browsing the internet like a high
| frequency stock trader? Seriously, in everyone's day to day,
| other than when these incidents happen, does someone notice a
| delay from resolving a domain name?
|
| I get that in theory blah blah, but we now have choices in
| who gets to see all of our requests and the ISP will always
| lose out to the other losers in the list
| jeffbee wrote:
| You know, I recently went through a period of thinking my
| MacBook was just broken. It had the janks. Everything on
| the browser was just slower than you're used to. After a
| week or two of pulling my hair, I figured it out. The
| newly-configured computer was using the DHCP-assigned DNS
| instead of Google DNS. Switched it, and it made a massive
| difference.
| dylan604 wrote:
| but that's the opposite of the request to move from a
| googDNS to a local one because of latency. so your ISP's
| DNS sucked, which is a broad statement, and is part of
| the why services like 1.1.1.1 or 8.8.8.8 exist. you
| didn't make the change of DNS because you were picking
| one based on nearest location.
| jeffbee wrote:
| There is more to latency than distance. Server response
| time is also important. In my case, the problem was that
| the DNS forwarder in the local wifi access point/router
| was very slow, even though the ICMP latency from my
| laptop to that device is obviously low.
| dylan604 wrote:
| which is well and fine, but my original comment was that
| moving to a closer DNS isn't worth it just for being
| closer especially when it is usually your ISP's server.
| so now, you're confirming that just moving closer isn't
| the solve, so it just reassures that not using the
| closest DNS is just fine.
| tredre3 wrote:
| news.ycombinator.com has a TTL of 1, so every page load
| will do one DNS request (possibly multiple).
|
| If you choose a resolver that is very far, 100ms longer
| page loads do end add up quickly...
| sumtechguy wrote:
| Even something simple like www.google.com serves from 5
| different DNS names. I have seen as high as 50. It is
| surprisingly snappier. Especially on older browsers that
| would only have 2 connections at a time open. It adds up
| faster than you would intuitively think. I used to have
| local resolvers that would mess with the TTL. But that
| was more trouble than it was worth. But it also gave a
| decent speedup. Was it 'worth' doing. Well it was kinda
| fun to mess with, I guess.
| Macha wrote:
| Cloudflare's own suggested config is to use their backup server
| 1.0.0.1 as the secondary DNS, which was also affected by this
| incident.
| stingraycharles wrote:
| TBH at this point the failure modes in which 1.1.1.1 would go
| down and 1.0.0.1 would not are not that many. At CloudFlare's
| scale, it's hardly believable a single of these DNS servers
| would go down, and it's rather a large-scale system failure.
|
| But I understand why Cloudflare can't just say "use 8.8.8.8
| as your backup".
| bombcar wrote:
| At least some machines/routers do NOT have a primary and
| backup but instead randomly round-robin between them.
|
| Which means that you'd be on cloudflare half the time and
| on google half the time which may not be what you wanted.
| toast0 wrote:
| It would depend on how Cloudflare set up their systems.
| From this and other outages, I think it's pretty clear that
| they've set up their systems as a single failure domain.
| But it would be possible for them to have setup for 1.1.1.1
| and 1.0.0.1 to have separate failure domains --- separate
| infrastructure, at least some sites running one but not the
| other.
| Bluescreenbuddy wrote:
| Yup. I have Cloudfare and Quad9
| bongodongobob wrote:
| 3 at every place I've ever worked.
| Polizeiposaune wrote:
| Cloudflare recommends you configure 1.1.1.1 and 1.0.0.1 as DNS
| servers.
|
| Unfortunately, the configuration mistake that caused this
| outage disabled Cloudflare's BGP advertisements of both
| 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
| kingnothing wrote:
| A better recommendation is to use Cloudflare for one of your
| DNS servers and a completely different company for the other.
| butlike wrote:
| Yeah but on paper they're never going to recommend using a
| competitor
| itake wrote:
| Just wondering, how do y'all manage wifi portals and
| manually setting DNS services? I used to use cf and
| google's but it was so annoying to disable and re-enable
| that every time I use a public wifi network.
| udev4096 wrote:
| This is why running your own resolver is so important. Clownflare
| will always break something or backdoor something
| perlgeek wrote:
| An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a
| year.
|
| It would be interesting to see the service level objective (SLO)
| that cloudflare internally has for this service.
|
| I've found https://www.cloudflare.com/r2-service-level-agreement/
| but this seems to be for payed services, so this outage would put
| July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10%
| refund for the month if you payed for it.
| philipwhiuk wrote:
| Probably 99.9% or better annually just from a 'maintaining
| reputation for reliability' standpoint.
| stingraycharles wrote:
| What really matters with these percentages is whether it's
| per month or per year. 99.9% per year allows for much longer
| outages than 99.9% per month.
| kachapopopow wrote:
| Interesting to see that they probably lost 20% of 1.1.1.1 usage
| from a roughly 20 minute incident.
|
| Not sure how cloudflare keeps struggling with issues like these,
| this isn't the first (and probably won't be the last) time they
| have these 'simple', 'deprecated', 'legacy' issues occuring.
|
| 8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for
| almost a decade.
|
| 1: localized issues did exist, but that's really the fault of the
| internet and they did remain running when google itself suffered
| severe downtime in various different services.
| Tepix wrote:
| There's more to DNS than just availability (granted, it's very
| important). There's also speed and privacy.
|
| European users might prefer one of the alternatives listed at
| https://european-alternatives.eu/category/public-dns over US
| corporations subject to the CLOUD act.
| immibis wrote:
| HN users might prefer to run their own. It's a low
| maintenance service. It's not like running a mail server.
| daneel_w wrote:
| I think that might be overestimating the technical prowess
| of HN readers on the whole. Sure, it doesn't require
| wizardry to set up e.g. Unbound as a catch-all DoT
| forwarder, but it's not the click'n'play most people
| require. It should be compared to just changing the system
| resolvers to dns0, Quad9 etc.
| lossolo wrote:
| One issue here is that you can be tracked easily.
| kachapopopow wrote:
| Running your own and being the sole user is the exact same
| thing as using a dns server (you need to obtain nameservers
| for any given domain which you have to contact a dns server
| for).
| daneel_w wrote:
| Everyone, European or not, should prefer _anything_ but
| Cloudflare and Google if they feel that privacy has any
| value.
| adornKey wrote:
| I think just setting up Unbound is even less trouble. Servers
| come and go. Getting rid of the dependency altogether is
| better than having to worry who operates the DNS-servers and
| how long it's going to be available.
| genewitch wrote:
| i am 95% certain i run unbound in a datacenter, and i have
| pihole local, my PC connects to pihole first, and if that's
| down, it connects to my DC; pihole connects to the DC and
| one of the filtered DNS providers (don't remember which)
| and GTEi's old server, that still works and has never let
| me down. No, not that one, the other one.
|
| i have musknet, though, so i can't edit the DNS providers
| on the router without buying another router, so cellphones
| aren't automatically on this plan, nor are VMs and the
| like.
| adornKey wrote:
| Having a 2nd trustworthy router consumes extra energy,
| but maybe it's worth it. More than once my router made an
| update and silently disabled the pi-hole.
|
| Having a fully configured spare pi-hole in a box also
| helps. Another time my pi-hole refused to boot after a
| power outage.
| barbazoo wrote:
| Then you'd be using a google DNS though which is undesirable
| for many.
| kod wrote:
| > Not sure how cloudflare keeps struggling with issues like
| these
|
| Cloudflare has a reasonable culture around incident response,
| but it doesn't incentivize proactive prevention.
| zamadatix wrote:
| Regarding the 20% some clients/resolvers will mark a server as
| temporarily down if it fails to respond to multiple queries in
| a row. That way the user doesn't have to wait the timeout delay
| 500 times in a row on the next 500 queries.
|
| From the longer term graphs it looks like volume returned to
| normal https://imgur.com/a/8a1H8eL
| heraldgeezer wrote:
| Yes, I honestly switched back to 8.8.8.8 and 8.8.4.4 google
| DNS. 100% stable, no filtering, fast in the EU.
| user3939382 wrote:
| You're not sure how they're struggling to fix an engineering
| problem characterized by complexity and scale encountered by
| 0.001% of network engineers?
| nness wrote:
| Interesting side-effect, the Gluetun docker image uses 1.1.1.1
| for DNS resolution -- as a result of the outage Gluetun's health
| checks failed and the images stopped.
|
| If there were some way to view torrenting traffic, no doubt
| there'd be a 20 minute slump.
| johnklos wrote:
| Personally, I'd consider any Docker image that does its own DNS
| resolution outside of the OS a Trojan.
| greggsy wrote:
| I'd love to know legacy systems they're referring to.
| wreckage645 wrote:
| This is a good post mortem, but improvements only come with
| change on processes. It seems every team at CloudFlare is
| approaching this in isolation, without a central problem
| management. Every week we see a new CloudFlare global outage. It
| seems like the change management processes is broken and needs to
| be looked at..
| sylware wrote:
| cloudflare is providing a service designed to block
| noscript/basic (x)html browsers.
|
| I know.
| trollbridge wrote:
| I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and
| Google's 8.8.8.8 with both primary and secondary.
|
| Secondary DNS is supposed to be in an independent network to
| avoid precisely this.
| neurostimulant wrote:
| I never noticed the outage because my isp hijack all outbound udp
| traffic to port 53 and redirect them to their own dns server so
| they can apply government-mandated cencorship :)
| nu11ptr wrote:
| Question: Years ago, back when I used to do networking, Cisco
| Wireless controllers used 1.1.1.1 internally. They seemed to
| literally blackhole any comms to that IP in my testing. I assume
| they changed this when 1.0.0.0/8 started routing on the Internet?
| blurrybird wrote:
| Yeah part of the reason why APNIC granted Cloudflare access to
| those very lucrative IPs is to observe the misconfiguration
| volume.
|
| The theory is CF had the capacity to soak up the junk traffic
| without negatively impacting their network.
| yabones wrote:
| The general guidance for networking has been to only use IPs
| and domains that you actually control... But even 5-8 years
| ago, the last time I personally touched a cisco WLC box, it
| still had 1.1.1.1 hardcoded. Cisco loves to break their own
| rules...
| homebrewer wrote:
| This is a good time to mention that dnsmasq lets you setup
| several DNS servers, and can race them. The first responder wins.
| You won't ever notice one of the services being down:
| all-servers server=8.8.8.8 server=9.9.9.9
| server=1.1.1.1
| mnordhoff wrote:
| Even without "all-servers", DNSMasq will race servers
| frequently (after 20 seconds, unless it's changed), and when
| retrying. A sudden outage should only affect you for a few
| seconds, if at all.
| anthonyryan1 wrote:
| Additionally, as long as you don't set strict-order, dnsmasq
| will automatically use all-servers for retries.
|
| If you were using systemd-resolved however, it retries all
| servers in the order they were specified, so it's important to
| interleave upstreams.
|
| Using the servers in the above example, and assuming IPv4 +
| IPv6: 1.1.1.1 2001:4860:4860::8888
| 9.9.9.9 2606:4700:4700::1111 8.8.8.8
| 2620:fe::fe 1.0.0.1 2001:4860:4860::8844
| 149.112.112.112 2606:4700:4700::1001 8.8.4.4
| 2620:fe::9
|
| will failover faster and more successfully on systemd-resolved,
| than if you specify all Cloudflare IPs together, then all
| Google IPs, etc.
|
| Also note that Quad9 is default filtering on this IP while the
| other two or not, so you could get intermittent differences in
| resolution behavior. If this is a problem, don't mix filtered
| and unfiltered resolvers. You definitely shouldn't mix DNSSEC
| validatng and not DNSSEC validating resolvers if you care about
| that (all of the above are DNSSEC validating).
| matthewtse wrote:
| wow good tip
|
| I was handling an incident due to this outage. I ended up
| adding Google DNS resolvers using systemd-resolved, but I
| didn't think to interleave them!
| karel-3d wrote:
| dnsdist is AMAZINGLY easy to set up as a secure local resolver
| that forwards all queries to DoH (and checks SSL) and checks
| liveliness every second
|
| I need to do a write-up one day
| jzebedee wrote:
| Please do. I'd be curious what a secure-by-default self
| hosted resolver would look like.
| daneel_w wrote:
| For what it may be worth, here's a most basic (but fully
| working) config for running Unbound as a DoT-only
| forwarder: server: logfile: ""
| log-queries: no # adjust as necessary
| interface: 127.0.0.1@53 access-control:
| 127.0.0.0/8 allow infra-keep-probing: yes
| tls-system-cert: yes forward-zone:
| name: "." forward-tls-upstream: yes
| forward-addr: 9.9.9.9@853#dns.quad9.net forward-
| addr: 193.110.81.9@853#zero.dns0.eu forward-addr:
| 149.112.112.112@853#dns.quad9.net forward-addr:
| 185.253.5.9@853#zero.dns0.eu
| karmakaze wrote:
| I don't consider these interchangeable. They have different
| priorities and policies. If anything I'd choose one and use my
| ISP default as fallback.
| nemonemo wrote:
| Agreed in principle, but has anyone seen any practical
| difference between these DNS services? What would be a more
| detailed downside for using these in parallel instead of the
| ISP default as a fallback?
| eli wrote:
| My ISP has already been caught selling personally
| identifiable customer data. I trust them less than any of
| those companies.
| sumtechguy wrote:
| My ISP one got kicked to the curb once they started
| returning results for anything including invalid sites.
| Basically to try to steer you towards their search.
| outworlder wrote:
| My ISP (one of the largest in the US) like to hijack DNS
| responses (specially NXDOMAIN) and serve crap. No thanks.
| Which is also why I have to use encryption to talk to public
| DNS servers otherwise they will hijack anyways.
| whitehexagon wrote:
| That sounds good in principle, but is there a more private
| configuration that doesnt send DNS resolutions to cloudfare,
| google et al. ie. avoid BigTech tracking, and not wanting DOH.
|
| dnsmasq with a list of smaller trusted DNS providers sounds
| perfect, as long as it is not considered bad etiquette to spam
| multiple DNS providers for every resolution?
|
| But where to find a trusted list of privacy focused DNS
| resolvers. The couple I tried from random internet advice
| seemed unstable.
| agolliver wrote:
| There are no good private DNS configurations, but if you
| don't trust the big caching recursive resolvers then I'd
| consider just running your own at home. Unbound is easy to
| set up and you'll probably never notice a speed difference.
| hdgvhicv wrote:
| I trust my isp far more than I trust cloudflare and google
| bagels wrote:
| Why? Some were injecting ads, blocking services,
| degrading video and other wrongdoings.
| sophacles wrote:
| You can just run unbound or similar and do your own recursive
| resolving.
| Tmpod wrote:
| Quad9 and NextDNS are usually thrown around.
| Yeri wrote:
| https://www.dns0.eu/ is an option
| mcpherrinm wrote:
| I've reviewed the privacy policy and performance of various
| DoH servers, and determined in my opinion that Cloudflare and
| Google both provide privacy-respecting policies.
|
| I believe that they follow their published policies and have
| reasonable security teams. They're also both popular
| services, which mitigates many of the other types of DNS
| tracking possible.
|
| https://developers.google.com/speed/public-dns/privacy
| https://developers.cloudflare.com/1.1.1.1/privacy/public-
| dns...
| bsilvereagle wrote:
| I haven't had any problems with OpenNIC: https://opennic.org/
|
| > OpenNIC (also referred to as the OpenNIC Project) is a user
| owned and controlled top-level Network Information Center
| offering a non-national alternative to traditional Top-Level
| Domain (TLD) registries; such as ICANN.
| hamandcheese wrote:
| NextDNS. Generous free tier, very affordable paid tier. Happy
| customer for several years and I've never noticed an outage.
| Melatonic wrote:
| This
| paradao wrote:
| Using DNSCrypt with anonymized DNS could be an option:
| https://github.com/DNSCrypt/dnscrypt-
| proxy/wiki/Anonymized-D...
| localtoast wrote:
| dnsforge.de comes to mind.
| xyst wrote:
| Probably great for users. Awful for trying to reproduce an
| issue. I prefer a more deterministic approach myself.
| itscrush wrote:
| Looks like AdGuard allows for same, thanks for mentioning
| dnsmasq support! I overlooked it on setup.
| heavyset_go wrote:
| I think systemd-resolved does something similar if you use
| that. Does DoT and DNSSEC by default.
|
| If you want to eschew centralized DNS altogether, if you run a
| Tor daemon, it has an option to expose a DNS resolver to your
| network. Multiple resolvers if you want them.
| alyandon wrote:
| Cloudflare's 1.1.1.1 Resolver service became unavailable to the
| Internet starting at 21:52 UTC and ending at 22:54 UTC
|
| Weird. According to my own telemetry from multiple networks they
| were unavailable for a lot longer than that.
| chrisgeleven wrote:
| I been lazy and was using Cloudflare's resolver only recently. In
| hindsight I probably should just setup two instances of Unbound
| on my home network that don't rely on upstream resolvers and call
| it a day. It's unlikely both will go down at the same time and if
| I'm having an total Internet outage (unlikely as I have Comcast
| as primary + T-Mobile Home Internet as a backup), it doesn't
| matter if DNS is or isn't resolving.
| tacitusarc wrote:
| Perhaps I am over-saturated, but this write up felt like AI- at
| least largely edited by a model.
| xyst wrote:
| Am not a fan of CF in general due to their role in centralization
| of the internet around their services.
|
| But I do appreciate these types of detailed public incident
| reports and RCAs.
| zac23or wrote:
| It's no surprise that Cloudflare is having a service issue again.
|
| I use Cloudflare at work. Cloudflare has many bugs, and some
| technical decisions are absurd, such as the worker's cache.delete
| method, which only clears the cache contents in the data center
| where the Worker was invoked!!!
| https://developers.cloudflare.com/workers/runtime-apis/cache...
|
| In my experience, Cloudflare support is not helpful at all,
| trying to pass the problem onto the user, like "Just avoid
| holding it in that way. ".
|
| At work, I needed to use Cloudflare. The next job I get, I'll put
| a limit on my responsibilities: I don't work with Cloudflare.
|
| I will never use Cloudflare at home and I don't recommend it to
| anyone.
|
| Next week: A new post about how Cloudflare saved the web from a
| massive DDOS attack.
| kentonv wrote:
| > some technical decisions are absurd, such as the worker's
| cache.delete method, which only clears the cache contents in
| the data center where the Worker was invoked!!!
|
| The Cache API is a standard taken from browsers. In the
| browser, cache.delete obviously only deletes that browser's
| cache, not all other browsers in the world. You could certainly
| argue that a global purge would be more useful in Workers, but
| it would be inconsistent with the standard API behavior, and
| also would be extraordinarily expensive. Code designed to use
| the standard cache API would end up being much more expensive
| than expected.
|
| With all that said, we (Workers team) do generally feel in
| retrospect that the Cache API was not a good fit for our
| platform. We really wanted to follow standards, but this
| standard in this case is too specific to browsers and as a
| result does not work well for typical use cases in Cloudflare
| Workers. We'd like to replace it with something better.
| freedomben wrote:
| Just wanted to say, I always appreciate your comments and
| frankness!
| zac23or wrote:
| >cache.delete obviously only deletes that browser's cache,
| not all other browsers in the world.
|
| To me, it only makes sense if the put method creates a cache
| only in the datacenter where the Worker was invoked. Put and
| delete need to be related, in my opinion.
|
| Now I'm curious: what's the point of clearing the cache
| contents in the datacenter where the Worker was invoked? I
| can't think of any use for this method.
|
| My criticisms aren't about functionality per see or
| developers. I don't doubt the developers' competence, but I
| feel like there's something wrong with the company culture.
| freedomben wrote:
| Cloudflare is definitely not perfect (and when they make a
| change that breaks the existing API contract it always makes
| for several miserable days for me), but on the whole Cloudflare
| is pretty reliable.
|
| That said, I don't use workers and don't plan to. I personally
| try to stay away from non cross-platform stuff because I've
| been burned too heavily with vendor/platform lock-in in the
| past.
| kentonv wrote:
| > and when they make a change that breaks the existing API
| contract it always makes for several miserable days for me
|
| If we changed an API in Workers in a way that broke any
| Worker in production, we consider that an incident and we
| will roll it back ASAP. We really try to avoid this but
| sometimes it's hard for us to tell. Please feel free to
| contact us if this happens in the future (e.g. file a support
| ticket or file a bug on workerd on GitHub or complain in our
| Discord or email kenton@cloudflare.com).
| freedomben wrote:
| Thank you! To clarify it's been API contracts in the DNS
| record setting API that have hit me. I'm going from memory
| here and it's been a couple years I think so might be a bit
| rusty, but one example was a slight change in data type
| acceptance for TTL on a record. It used to take either a
| string or integer in the JSON but at some point started
| rejecting integers (or strings, whichever one I was sending
| at the time stopped being accepted) so the API calls were
| suddenly failing (to be fair that might not have
| technically been a violation of the contract, but it was a
| change in behavior that had been consistent for years and
| which I would not have expected). Another one was regarding
| returning zone_id for records where the zone_id stopped
| getting populated in the returned record. Luckily my code
| already had the zone_id because it needs that to build the
| URL path, but it was a rough debugging session and then I
| had to hack around it by either re-adding the zone ID to
| the returned record or removing zone ID from my equality
| check, neither of which were preferred solutions.
|
| If we start using workers though I'll definitely let you
| know if any API changes!
| aftbit wrote:
| >Even though this release was peer-reviewed by multiple engineers
|
| I find it somewhat surprising that none of the multiple engineers
| who reviewed the original change in June noticed that they had
| added 1.1.1.0/24 to the list of prefixes that should be rerouted.
| I wonder what sort of human mistake or malice led to that
| original error.
|
| Perhaps it would be wise to add some hard-coded special-case
| mitigations to DLS such that it would not allow 1.1.1.1/32 or
| 1.0.0.1/32 to be reassigned to a single location.
| burnte wrote:
| It's probably much simpler, "I trust Jerry, I'm sure this is
| fine, approved."
| roughly wrote:
| I'm generally more a "blame the tools" than "blame the people"
| - depending on how the system is set up and how the configs are
| generated, it's easy for a change like this to slip by -
| especially if a bunch of the diff is autogenerated. It's still
| humans doing code review, and this kind of failure indicates
| process problems, regardless of whether or not laziness or
| stupidity were also present.
|
| But, yes, a second mitigation here would be defense in depth -
| in an ideal world, all your systems use the same ops/deploy/etc
| stack, in this one, you probably want an extra couple steps in
| the way of potentially taking a large public service offline.
| b0rbb wrote:
| I don't know about you all but I love a well written RCA. Nicely
| done.
| alexandrutocar wrote:
| > It's worth noting that DoH (DNS-over-HTTPS) traffic remained
| relatively stable as most DoH users use the domain cloudflare-
| dns.com, configured manually or through their browser, to access
| the public DNS resolver, rather than by IP address.
|
| I use their DNS over HTTPS and if I hadn't seen the issue being
| reported here, I wouldn't have caught it at all. However, this--
| along with a chain of past incidents (including a recent
| cascading service failure caused by a third-party outage)--led me
| to reduce my dependencies. I no longer use Cloudflare Tunnels or
| Cloudflare Access, replacing them with WireGuard and mTLS
| certificates. I still use their compute and storage, but for
| personal projects only.
| cadamsdotcom wrote:
| > The way that Cloudflare manages service topologies has been
| refined over time and currently consist of a combination of a
| legacy and a strategic system that are synced.
|
| This writing is just brilliant. Clear to technical and non-
| technical readers. Makes the in-progress migration sound way more
| exciting than it probably is!
|
| > We are sorry for the disruption this incident caused for our
| customers. We are actively making these improvements to ensure
| improved stability moving forward and to prevent this problem
| from happening again.
|
| This is about as good as you can get it from a company as serious
| and important as Cloudflare. Bravo to the writers and vetters for
| not watering this down.
| kccqzy wrote:
| I can't tell if you are being sarcastic, but "legacy" is a term
| most often used by technical people whereas "strategic" is a
| term most often used by marketing and non-technical leadership.
| Mixing them together annoys both kinds of readers.
| hbay wrote:
| you were annoyed by that sentence?
| marcusb wrote:
| You cannot throw a rock without hitting a product marketer
| describing everything not-their-product as "legacy."
| nixpulvis wrote:
| Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this
| after trying to use my hotspot from my Linux laptop with it set
| for my default DNS.
|
| Very frustrating.
___________________________________________________________________
(page generated 2025-07-16 23:00 UTC)