[HN Gopher] Questions for Cloudflare
       ___________________________________________________________________
        
       Questions for Cloudflare
        
       Author : todsacerdoti
       Score  : 64 points
       Date   : 2025-11-19 16:49 UTC (6 hours ago)
        
 (HTM) web link (entropicthoughts.com)
 (TXT) w3m dump (entropicthoughts.com)
        
       | mnholt wrote:
       | This website could benefit from a CDN...
        
         | majke wrote:
         | Questions for "questions for cloudflare" owner
        
         | jf wrote:
         | https://web.archive.org/web/20251119165814/https://entropict...
        
         | internetter wrote:
         | 8.5s... yikes... although notably they aren't adopting an anti-
         | CDN or even really anti-cloudlare perspective, just grievances
         | with software architecture. So the slowness of their site isn't
         | really detrimental to their argument
        
         | Sesse__ wrote:
         | I loaded it and got an LCP of ~350 ms, which is better than the
         | ~550 ms I got from this very comment page.
        
       | tptacek wrote:
       | It's a detailed postmortem published within a couple hours of the
       | incident and this blog post is disappointed that it didn't
       | provide a comprehensive assessment of all the procedural changes
       | inside the engineering organization that came as a consequence.
       | At the point in time when this blog post was written, it would
       | _not have been possible_ for them to answer these questions.
        
         | otterley wrote:
         | "But I need attention _now_! "
        
         | kqr wrote:
         | Part of my argument in the article is that it does't take long
         | to come to that realisation if using the right methods. It
         | would absolutely have been possible to identify the problem of
         | missing feedback by that time.
        
           | tptacek wrote:
           | It absolutely does take long with the right methods; in fact,
           | the righter the methods, the longer it takes. You're talking
           | about a postmortem that was up within _single digit hours_ of
           | the initial incident resolution. A lot of orgs would wait on
           | the postmortem just to be sure the system is settling back
           | into a steady state!
           | 
           | You were way off here.
        
             | kqr wrote:
             | To be clear, I'm not expecting a full analysis within
             | hours. I'm hoping for a method of analysis by which the
             | major deficiencies come up at a high level, and then as
             | more effort is spent on it, more details around those
             | deficiencies are revealed.
             | 
             | What otherwise tends to happen, in my experience, is the
             | initial effort brings up some deficiencies which are only
             | partially the major ones, and subsequent effort is spent
             | looking mainly in that same area, never uncovering those
             | major deficiencies which were not initially discovered.
        
       | RationPhantoms wrote:
       | > I wish technical organisations would be more thorough in
       | investigating accidents.
       | 
       | Cloudflare is probably one of the best "voices" in the industry
       | when it comes to post-mortems and root cause analysis.
        
         | tptacek wrote:
         | I wish blog posts like these would be more thorough in simply
         | looking at the timestamps on the posts they're critiquing.
        
         | ItsHarper wrote:
         | If you read their previous article about AWS (linked in this
         | one), they specifically call out root cause analysis as a
         | flawed approach.
        
       | timenotwasted wrote:
       | "I don't know. I wish technical organisations would be more
       | thorough in investigating accidents." - This is just armchair
       | quarterbacking at this point given that they were forthcoming
       | during the incident and had a detailed post-mortem shortly after.
       | The issue is that by not being a fly on the wall in the war room
       | the OP is making massive assumptions about the level of
       | discussions that take place about these types of incidents long
       | after it has left the collective conscience of the mainstream.
        
         | cogman10 wrote:
         | People outside of tech (and some inside) can be really bad at
         | understanding how something like this could slip through the
         | cracks.
         | 
         | Reading cloudflare's description of the problem, this is
         | something that I could easily see my own company missing. It's
         | the case that a file got too big which tanked performance
         | enough to bring everything down. That's a VERY hard thing to
         | test for. Especially since this appears to have been a
         | configuration file and a regular update.
         | 
         | The reason it's so hard to test for is because all tests would
         | show that there's no problem. This isn't a code update, it was
         | a config update. Without really extensive performance tests
         | (which, when done well, take a long time!) there really wasn't
         | a way to know that a change that appeared safe wasn't.
         | 
         | I personally give Cloudflare a huge pass for this. I don't
         | think this happened due to any sloppiness on their part.
         | 
         | Now, if you want to see a sloppy outage you look at the
         | Crowdstrike outage from a few years back that bricked basically
         | everything. That is what sheer incompetence looks like.
        
           | jsnell wrote:
           | I don't believe that is an accurate description of the issue.
           | It wasn't that the system got too slow due to a big file,
           | it's that the file getting too big was treated as a fatal
           | error rather than causing requests to fail open.
        
         | kqr wrote:
         | The article makes no claim about the effort that has gone into
         | the analysis. You can apply a lot of effort and still only
         | produce a shallow analysis.
         | 
         | If the analysis has not uncovered the feedback problems (even
         | with large effort, or without it), my argument is that a better
         | method is needed.
        
       | colesantiago wrote:
       | Maybe instead of asking "questions" to a corporation which their
       | only interest is profit, is now beholden Wall Street and wouldn't
       | care what we think, we should look for answers and alternatives
       | like BunnyCDN [0], Anubis [1], etc.
       | 
       | [0] https://bunny.net/
       | 
       | [1] https://github.com/TecharoHQ/anubis
        
         | arbll wrote:
         | Ah yes because both of those alternatives are non-profits right
         | ?
        
           | colesantiago wrote:
           | You can sponsor Anubis right now and start supporting
           | alternatives.
        
         | vlovich123 wrote:
         | Bunny has raised money from VC which indicates it's going the
         | "Wall Street" path.
         | 
         | Anubis is a bot firewall not a CDN.
        
           | koakuma-chan wrote:
           | I wouldn't trust a provider that has "Excellent (underlined)
           | star star star star star STAR TrustPilot 4.8 on G2" on their
           | landing page. I bet they are also award winning, and 150 best
           | place to work at. Really shows they have no taste.
        
             | colesantiago wrote:
             | ?
             | 
             | I don't remember telling anyone to trust the reviews?
             | 
             | I think it is healthy to try alternatives to Cloudflare and
             | then come to your own decision.
        
               | koakuma-chan wrote:
               | I'm not saying you did, but for me things like what I
               | mentioned are red flags. They also use C#--another red
               | flag. There's OVH, Hetzner, DigitalOcean, etc--all are
               | private companies that aren't on Wall Street.
        
               | colesantiago wrote:
               | No.
               | 
               | DigitalOcean is owned by Wall Street.
               | 
               | Only Hetzner is a good alternative CDN.
        
               | koakuma-chan wrote:
               | You're right, DO is public.
        
           | colesantiago wrote:
           | > Bunny has raised money from VC which indicates it's going
           | the "Wall Street" path.
           | 
           | Yet it _is_ an available alternative to Cloudflare that is
           | _not_ on Wall Street (a public company).
           | 
           | If you want to do this 100% yourself there is Apache Traffic
           | Control.
           | 
           | https://github.com/apache/trafficcontrol
           | 
           | > Anubis is a bot firewall not a CDN.
           | 
           | For now. If we support alternatives they can grow into an
           | open source CDN.
        
             | vlovich123 wrote:
             | Anubis is a piece of software not a CDN service.
             | 
             | You realize to run a CDN you have to buy massive amounts of
             | bandwidth and computers? DIY here belies a misunderstanding
             | of what it takes to be DOS resistant and also what it takes
             | to actually have CDN deliver a performance benefit.
        
               | colesantiago wrote:
               | This is a great idea for Anubis, funding future
               | development and it being an alternative CDN.
               | 
               | Customers on the enterprise plan can either use Anubis's
               | Managed CDN or host Anubis themselves via a enterprise
               | license!
               | 
               | They can directly receive tech support from the creator
               | of Anubis. (as long as they pay on the enterprise plan)
               | 
               | I don't see a problem with this and it can turn Anubis
               | from "a piece of software" into a CDN.
        
               | HumanOstrich wrote:
               | Has anyone from the Anubis project said anything about
               | aspiring to transform into a CDN?
        
               | akerl_ wrote:
               | Maybe they can also start a marketplace to buy and sell
               | digital goods, like NFTs.
        
       | blixt wrote:
       | It's a bit odd to come from the outside to judge the internal
       | process of an organization with many very complex moving parts,
       | only a fraction of which we have been given context for,
       | especially so soon after the incident and the post-mortem
       | explaining it.
       | 
       | I think the ultimate judgement must come from whether we will
       | stay with Cloudflare now that we have seen how bad it can get.
       | One could also say that this level of outage hasn't happened in
       | many years, and they are now freshly frightened by it happening
       | again so expect things to get tightened up (probably using
       | different questions than this blog post proposes).
       | 
       | As for what this blog post could have been: maybe a page out of
       | how these ideas were actively used by the author at e.g. Tradera
       | or Loop54.
        
         | kqr wrote:
         | > how these ideas were actively used by the author at e.g.
         | Tradera or Loop54.
         | 
         | This would be preferable, of course. Unfortunately both
         | organisations were rather secretive about their technical and
         | social deficiencies and I don't want to be the one to air them
         | out like that.
        
       | otterley wrote:
       | The post is describing a full port-mortem process including a
       | Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a
       | mature organization that follows best SRE practices, this will be
       | performed by the relevant service teams, recorded in the port-
       | mortem document, and used for creating follow-up actions. It's
       | almost always an internal process and isn't shared with the
       | public--and often not even with customers under NDA.
       | 
       | We mustn't assume that CloudFlare isn't undertaking this process
       | because we're not an audience to it.
        
         | tptacek wrote:
         | It also _couldn 't have happened_ by the time the postmortem
         | was produced. The author of this blog post appears not to have
         | noticed that the postmortem was up within a couple hours of
         | resolving the incident.
        
           | otterley wrote:
           | Exactly. These deeper investigations can sometimes take weeks
           | to complete.
        
       | dkyc wrote:
       | These engineering insights were not worth the 16 seconds load
       | time this website took.
       | 
       | It's _extremely_ easy, and correspondingly valueless, to ask all
       | kinds of  "hard questions" about a system 24h after it had a huge
       | incident. The hard part is doing this appropriately for _every_
       | part of the system _before_ something happens, while maintaining
       | the other equally rightful goals of the organizations (such as
       | cost-efficiency, product experience, performance, etc.). There 's
       | little evidence that suggests Cloudflare isn't doing that, and
       | their track record is definitely good for their scale.
        
         | raincole wrote:
         | Every engineer has this phase when you're capable enough to do
         | something at small scale, so you look at the incumbents, who
         | are doing the similar thing but at 1000x scale, and wonder how
         | they are so bad at it.
         | 
         | Some never get out of this phase though.
        
       | Nextgrid wrote:
       | It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub)
       | for what's happening, and I say that as one of the biggest
       | "yellers at the cloud" on here.
       | 
       | Ultimately end-users don't have a relationship with any of those
       | companies. They have relationships with businesses that chose to
       | rely on them. Cloudflare, etc publish SLAs and compensation
       | schedules in case those SLAs are missed. Businesses chose to
       | accept those SLAs and take on that risk.
       | 
       | If Cloudflare/etc signed a contract promising a certain SLA (with
       | penalties) and then chose to not pay out those penalties then
       | there would be reasons to ask questions, but nothing suggests
       | they're not holding up their side of the deal - you will
       | absolutely get compensated (in the form of a refund on your bill)
       | in case of an outage.
       | 
       | The issue is that businesses accept this deal and then scream
       | when it goes wrong, yet are unwilling to pay for a solution that
       | does not fail in this way. Those solutions exist - you absolutely
       | can build systems that are reliable and/or fail in a predictable
       | _and testable_ manner; it's simply more expensive and requires
       | more skill than just slapping a few SaaSes and CNCF projects
       | together. But it is possible - look at the uptime of card
       | networks, stock exchanges, or airplane avionics. It's just more
       | expensive and the truth is that businesses don't want to pay for
       | it (and neither are their end-customers - they will bitch about
       | outages, but will immediately run the other way if you ask them
       | to pony up for a more reliable system - and the ones that don't,
       | already run such a system and were unaffected by the recent
       | outages).
        
         | psim1 wrote:
         | > It is unfair to blame Cloudflare (or AWS, or Azure, or
         | GitHub) for what's happening
         | 
         | > Ultimately end-users don't have a relationship with any of
         | those companies. They have relationships with businesses that
         | chose to rely on them
         | 
         | Could you not say this about any supplier relationship? No, in
         | this case, we all know the root of the outage is CloudFlare, so
         | it absolutely makes sense to blame CloudFlare, and not their
         | customers.
        
           | Nextgrid wrote:
           | Devil's advocate: I operate the equivalent of an online
           | lemonade stand, some shitty service at a cheap price offered
           | with little guarantees ("if I fuck up I'll refund you the
           | price of your 'lemonade'") for hobbyists to use to host their
           | blog and Visa decides to use it in their critical path. Then
           | this "lemonade stand" goes down. Do you think it's fair to
           | blame me? I never chose to be part of Visa's authorization
           | loop, and after all is done I did indeed refund them the
           | price of their "lemonade". It's Visa's fault they introduced
           | a single point of failure with inadequate compensation
           | schedules in their critical path.
        
             | stronglikedan wrote:
             | > Do you think it's fair to blame me?
             | 
             | Absolutely, yes. Where's your backup plan for when Visa
             | doesn't behave as you expect? It's okay to not have one,
             | but it's also your fault for not having one, and that is
             | the sole reason that the lemonade stand went down.
        
               | Nextgrid wrote:
               | > Where's your backup plan for when Visa doesn't behave
               | as you expect?
               | 
               | I don't have (nor have to have) such a plan, I offer X
               | service with Y guarantees paying out Z dollars if I don't
               | hold up my part of the bargain. In this hypothetical
               | situation if Visa signs up I assumed they wanted to host
               | their marketing website or some low-hanging fruit, it's
               | not my job to check what they're using it for (in fact it
               | would be preferable for me not to check, as I'd be seeing
               | unencrypted card numbers and PII otherwise).
        
           | stronglikedan wrote:
           | If I'm paying a company that chose Cloudflare, and my SLA
           | with that company entitles me to some sort of compensation
           | for outages, then I expect that company to compensate me
           | regardless of whose fault it is, and regardless of whether
           | they were compensated by Cloudflare. I can know that the
           | cause of the outage is Cloudflare, but also know that the
           | company that I'm paying should have had a backup plan and not
           | be solely reliable on one vendor. In other words, I care
           | about who I pay, not who they decide to use.
        
           | wongarsu wrote:
           | Don't we say that about all supplier relationships? If my
           | Samsung washing machine stops working I blame Samsung. Even
           | when it turns out that it was a broken drive belt I don't
           | blame the manufacturer of the drive belt, or whoever produced
           | the rubber that went into the drive belt, or whoever made the
           | machine involved in the production of this batch of rubber.
           | Samsung choose to put the drive belt in my washing machine,
           | that's where the buck stops. They are free to litigate the
           | matter internally, but I only care about Samsung selling me a
           | washing machine that's now broken
           | 
           | Same with cloudflare. If you run your site on cloudflare you
           | are responsible for any downtime caused to your site by
           | cloudflare
           | 
           | What we can blame cloudflare for is having so many customers
           | that a cloudflare outage has outsized impact compared to the
           | more uncorrelated outages we would have if sites were
           | distributed among many smaller providers. But that's not
           | quite the same as blaming any individual site being down on
           | cloudflare
        
             | raincole wrote:
             | > Don't we say that about all supplier relationships?
             | 
             | No always. If the farm sells packs of poisoned bacon to the
             | supermarket, we blame the farm.
             | 
             | It's more about if the website/supermarket can reasonably
             | do the QA.
        
         | mschuster91 wrote:
         | > look at the uptime of card networks, stock exchanges, or
         | airplane avionics.
         | 
         | In fact, I'd say... airplane avionics are _not_ what you should
         | be looking at. Boeing 's 787? Reboot every 51 days or risk the
         | pilots getting wrong airspeed indicators! No, I'm not joking
         | [1], and it's not the first time either [2], and it's not just
         | Boeing [3].
         | 
         | [1]
         | https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
         | 
         | [2]
         | https://www.theregister.com/2015/05/01/787_software_bug_can_...
         | 
         | [3]
         | https://www.theregister.com/2019/07/25/a350_power_cycle_soft...
        
           | Nextgrid wrote:
           | > Reboot every 51 days or risk the pilots getting wrong
           | airspeed indicators
           | 
           | If this is documented then fair enough - airlines don't
           | _have_ to buy airplanes that need rebooting every 51 days,
           | they can vote with their wallets and Boeing is welcome to fix
           | it. If not documented, I hope regulators enforced penalties
           | high enough to force Boeing to get their stuff together.
           | 
           | Either way, the uptime of avionics (and redundancies -
           | including the unreliable airspeed checklists) are much higher
           | than anything conventional software "engineering" has been
           | putting out the past decade.
        
       | waiwai933 wrote:
       | > Maybe some of these questions are obviously answered in a
       | Cloudflare control panel or help document. I'm not in the market
       | right now so I won't do that research.
       | 
       | I don't love piling on, but it still shocks me that people write
       | without first reading.
        
       | jcmfernandes wrote:
       | The tone is off. Cloudflare shared a post-mortem on the same day
       | as the incident. It's unreasonable to throw a "I wish technical
       | organisations would be more thorough in investigating accidents".
       | 
       | With that said, I would also like to know how it took them ~2
       | hours to see the error. That's a long, long time.
        
       | vlovich123 wrote:
       | A lot of these questions bely a misunderstanding of how it works
       | - bot management is evaluated inline within the proxy as a
       | feature on the site (similar to other features like image
       | optimization).
       | 
       | So during ingress there's not an async call to the bot management
       | service which intercepts the request before it's outbound to
       | origin - it's literally a Lua script (or rust module in fl2) that
       | runs on ingress inline as part of handling the request. Thus
       | there's no timeout or other concerns with the management service
       | failing to assign a bot score.
       | 
       | There are better questions but to me the ones posed don't seem
       | particularly interesting.
        
         | kqr wrote:
         | Maybe I'm misunderstanding something but it being a blocking
         | call does not make timeouts less important -- if anything they
         | become more important!
        
           | tptacek wrote:
           | I don't understand how it is you're doing distributed systems
           | design on a system you don't even have access to. Maybe the
           | issue is timeouts, maybe the issue is some other technical
           | change, maybe the issue is human/procedural. How could you
           | possibly know? The owners of the system probably don't have a
           | full answer within just a few hours of handling the incident!
        
             | kqr wrote:
             | I would be worried if they had all the answers within a few
             | hours! I was just caught off guard by the focus on
             | technical control measures when there seems to have been
             | fairly obvious problems with information channels.
             | 
             | For example, "more global kill switches for features" is
             | good, but would "only" have shaved 30 % off the time of
             | recovery (if reading the timeline charitably). Being able
             | to identify the broken component faster would have shaved
             | 30-70 % off the time of recovery depending on how fast
             | identification could happen - even with no improvements to
             | the kill switch situation.
        
       | spenrose wrote:
       | I am disappointed to see this article flagged. I thought it was
       | excellent.
        
         | kqr wrote:
         | In defense of your taste, it was updated based on the loud
         | feedback here, so you probably read a slightly better version
         | than that which was flagged.
        
       ___________________________________________________________________
       (page generated 2025-11-19 23:01 UTC)