[HN Gopher] Questions for Cloudflare
___________________________________________________________________
Questions for Cloudflare
Author : todsacerdoti
Score : 64 points
Date : 2025-11-19 16:49 UTC (6 hours ago)
(HTM) web link (entropicthoughts.com)
(TXT) w3m dump (entropicthoughts.com)
| mnholt wrote:
| This website could benefit from a CDN...
| majke wrote:
| Questions for "questions for cloudflare" owner
| jf wrote:
| https://web.archive.org/web/20251119165814/https://entropict...
| internetter wrote:
| 8.5s... yikes... although notably they aren't adopting an anti-
| CDN or even really anti-cloudlare perspective, just grievances
| with software architecture. So the slowness of their site isn't
| really detrimental to their argument
| Sesse__ wrote:
| I loaded it and got an LCP of ~350 ms, which is better than the
| ~550 ms I got from this very comment page.
| tptacek wrote:
| It's a detailed postmortem published within a couple hours of the
| incident and this blog post is disappointed that it didn't
| provide a comprehensive assessment of all the procedural changes
| inside the engineering organization that came as a consequence.
| At the point in time when this blog post was written, it would
| _not have been possible_ for them to answer these questions.
| otterley wrote:
| "But I need attention _now_! "
| kqr wrote:
| Part of my argument in the article is that it does't take long
| to come to that realisation if using the right methods. It
| would absolutely have been possible to identify the problem of
| missing feedback by that time.
| tptacek wrote:
| It absolutely does take long with the right methods; in fact,
| the righter the methods, the longer it takes. You're talking
| about a postmortem that was up within _single digit hours_ of
| the initial incident resolution. A lot of orgs would wait on
| the postmortem just to be sure the system is settling back
| into a steady state!
|
| You were way off here.
| kqr wrote:
| To be clear, I'm not expecting a full analysis within
| hours. I'm hoping for a method of analysis by which the
| major deficiencies come up at a high level, and then as
| more effort is spent on it, more details around those
| deficiencies are revealed.
|
| What otherwise tends to happen, in my experience, is the
| initial effort brings up some deficiencies which are only
| partially the major ones, and subsequent effort is spent
| looking mainly in that same area, never uncovering those
| major deficiencies which were not initially discovered.
| RationPhantoms wrote:
| > I wish technical organisations would be more thorough in
| investigating accidents.
|
| Cloudflare is probably one of the best "voices" in the industry
| when it comes to post-mortems and root cause analysis.
| tptacek wrote:
| I wish blog posts like these would be more thorough in simply
| looking at the timestamps on the posts they're critiquing.
| ItsHarper wrote:
| If you read their previous article about AWS (linked in this
| one), they specifically call out root cause analysis as a
| flawed approach.
| timenotwasted wrote:
| "I don't know. I wish technical organisations would be more
| thorough in investigating accidents." - This is just armchair
| quarterbacking at this point given that they were forthcoming
| during the incident and had a detailed post-mortem shortly after.
| The issue is that by not being a fly on the wall in the war room
| the OP is making massive assumptions about the level of
| discussions that take place about these types of incidents long
| after it has left the collective conscience of the mainstream.
| cogman10 wrote:
| People outside of tech (and some inside) can be really bad at
| understanding how something like this could slip through the
| cracks.
|
| Reading cloudflare's description of the problem, this is
| something that I could easily see my own company missing. It's
| the case that a file got too big which tanked performance
| enough to bring everything down. That's a VERY hard thing to
| test for. Especially since this appears to have been a
| configuration file and a regular update.
|
| The reason it's so hard to test for is because all tests would
| show that there's no problem. This isn't a code update, it was
| a config update. Without really extensive performance tests
| (which, when done well, take a long time!) there really wasn't
| a way to know that a change that appeared safe wasn't.
|
| I personally give Cloudflare a huge pass for this. I don't
| think this happened due to any sloppiness on their part.
|
| Now, if you want to see a sloppy outage you look at the
| Crowdstrike outage from a few years back that bricked basically
| everything. That is what sheer incompetence looks like.
| jsnell wrote:
| I don't believe that is an accurate description of the issue.
| It wasn't that the system got too slow due to a big file,
| it's that the file getting too big was treated as a fatal
| error rather than causing requests to fail open.
| kqr wrote:
| The article makes no claim about the effort that has gone into
| the analysis. You can apply a lot of effort and still only
| produce a shallow analysis.
|
| If the analysis has not uncovered the feedback problems (even
| with large effort, or without it), my argument is that a better
| method is needed.
| colesantiago wrote:
| Maybe instead of asking "questions" to a corporation which their
| only interest is profit, is now beholden Wall Street and wouldn't
| care what we think, we should look for answers and alternatives
| like BunnyCDN [0], Anubis [1], etc.
|
| [0] https://bunny.net/
|
| [1] https://github.com/TecharoHQ/anubis
| arbll wrote:
| Ah yes because both of those alternatives are non-profits right
| ?
| colesantiago wrote:
| You can sponsor Anubis right now and start supporting
| alternatives.
| vlovich123 wrote:
| Bunny has raised money from VC which indicates it's going the
| "Wall Street" path.
|
| Anubis is a bot firewall not a CDN.
| koakuma-chan wrote:
| I wouldn't trust a provider that has "Excellent (underlined)
| star star star star star STAR TrustPilot 4.8 on G2" on their
| landing page. I bet they are also award winning, and 150 best
| place to work at. Really shows they have no taste.
| colesantiago wrote:
| ?
|
| I don't remember telling anyone to trust the reviews?
|
| I think it is healthy to try alternatives to Cloudflare and
| then come to your own decision.
| koakuma-chan wrote:
| I'm not saying you did, but for me things like what I
| mentioned are red flags. They also use C#--another red
| flag. There's OVH, Hetzner, DigitalOcean, etc--all are
| private companies that aren't on Wall Street.
| colesantiago wrote:
| No.
|
| DigitalOcean is owned by Wall Street.
|
| Only Hetzner is a good alternative CDN.
| koakuma-chan wrote:
| You're right, DO is public.
| colesantiago wrote:
| > Bunny has raised money from VC which indicates it's going
| the "Wall Street" path.
|
| Yet it _is_ an available alternative to Cloudflare that is
| _not_ on Wall Street (a public company).
|
| If you want to do this 100% yourself there is Apache Traffic
| Control.
|
| https://github.com/apache/trafficcontrol
|
| > Anubis is a bot firewall not a CDN.
|
| For now. If we support alternatives they can grow into an
| open source CDN.
| vlovich123 wrote:
| Anubis is a piece of software not a CDN service.
|
| You realize to run a CDN you have to buy massive amounts of
| bandwidth and computers? DIY here belies a misunderstanding
| of what it takes to be DOS resistant and also what it takes
| to actually have CDN deliver a performance benefit.
| colesantiago wrote:
| This is a great idea for Anubis, funding future
| development and it being an alternative CDN.
|
| Customers on the enterprise plan can either use Anubis's
| Managed CDN or host Anubis themselves via a enterprise
| license!
|
| They can directly receive tech support from the creator
| of Anubis. (as long as they pay on the enterprise plan)
|
| I don't see a problem with this and it can turn Anubis
| from "a piece of software" into a CDN.
| HumanOstrich wrote:
| Has anyone from the Anubis project said anything about
| aspiring to transform into a CDN?
| akerl_ wrote:
| Maybe they can also start a marketplace to buy and sell
| digital goods, like NFTs.
| blixt wrote:
| It's a bit odd to come from the outside to judge the internal
| process of an organization with many very complex moving parts,
| only a fraction of which we have been given context for,
| especially so soon after the incident and the post-mortem
| explaining it.
|
| I think the ultimate judgement must come from whether we will
| stay with Cloudflare now that we have seen how bad it can get.
| One could also say that this level of outage hasn't happened in
| many years, and they are now freshly frightened by it happening
| again so expect things to get tightened up (probably using
| different questions than this blog post proposes).
|
| As for what this blog post could have been: maybe a page out of
| how these ideas were actively used by the author at e.g. Tradera
| or Loop54.
| kqr wrote:
| > how these ideas were actively used by the author at e.g.
| Tradera or Loop54.
|
| This would be preferable, of course. Unfortunately both
| organisations were rather secretive about their technical and
| social deficiencies and I don't want to be the one to air them
| out like that.
| otterley wrote:
| The post is describing a full port-mortem process including a
| Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a
| mature organization that follows best SRE practices, this will be
| performed by the relevant service teams, recorded in the port-
| mortem document, and used for creating follow-up actions. It's
| almost always an internal process and isn't shared with the
| public--and often not even with customers under NDA.
|
| We mustn't assume that CloudFlare isn't undertaking this process
| because we're not an audience to it.
| tptacek wrote:
| It also _couldn 't have happened_ by the time the postmortem
| was produced. The author of this blog post appears not to have
| noticed that the postmortem was up within a couple hours of
| resolving the incident.
| otterley wrote:
| Exactly. These deeper investigations can sometimes take weeks
| to complete.
| dkyc wrote:
| These engineering insights were not worth the 16 seconds load
| time this website took.
|
| It's _extremely_ easy, and correspondingly valueless, to ask all
| kinds of "hard questions" about a system 24h after it had a huge
| incident. The hard part is doing this appropriately for _every_
| part of the system _before_ something happens, while maintaining
| the other equally rightful goals of the organizations (such as
| cost-efficiency, product experience, performance, etc.). There 's
| little evidence that suggests Cloudflare isn't doing that, and
| their track record is definitely good for their scale.
| raincole wrote:
| Every engineer has this phase when you're capable enough to do
| something at small scale, so you look at the incumbents, who
| are doing the similar thing but at 1000x scale, and wonder how
| they are so bad at it.
|
| Some never get out of this phase though.
| Nextgrid wrote:
| It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub)
| for what's happening, and I say that as one of the biggest
| "yellers at the cloud" on here.
|
| Ultimately end-users don't have a relationship with any of those
| companies. They have relationships with businesses that chose to
| rely on them. Cloudflare, etc publish SLAs and compensation
| schedules in case those SLAs are missed. Businesses chose to
| accept those SLAs and take on that risk.
|
| If Cloudflare/etc signed a contract promising a certain SLA (with
| penalties) and then chose to not pay out those penalties then
| there would be reasons to ask questions, but nothing suggests
| they're not holding up their side of the deal - you will
| absolutely get compensated (in the form of a refund on your bill)
| in case of an outage.
|
| The issue is that businesses accept this deal and then scream
| when it goes wrong, yet are unwilling to pay for a solution that
| does not fail in this way. Those solutions exist - you absolutely
| can build systems that are reliable and/or fail in a predictable
| _and testable_ manner; it's simply more expensive and requires
| more skill than just slapping a few SaaSes and CNCF projects
| together. But it is possible - look at the uptime of card
| networks, stock exchanges, or airplane avionics. It's just more
| expensive and the truth is that businesses don't want to pay for
| it (and neither are their end-customers - they will bitch about
| outages, but will immediately run the other way if you ask them
| to pony up for a more reliable system - and the ones that don't,
| already run such a system and were unaffected by the recent
| outages).
| psim1 wrote:
| > It is unfair to blame Cloudflare (or AWS, or Azure, or
| GitHub) for what's happening
|
| > Ultimately end-users don't have a relationship with any of
| those companies. They have relationships with businesses that
| chose to rely on them
|
| Could you not say this about any supplier relationship? No, in
| this case, we all know the root of the outage is CloudFlare, so
| it absolutely makes sense to blame CloudFlare, and not their
| customers.
| Nextgrid wrote:
| Devil's advocate: I operate the equivalent of an online
| lemonade stand, some shitty service at a cheap price offered
| with little guarantees ("if I fuck up I'll refund you the
| price of your 'lemonade'") for hobbyists to use to host their
| blog and Visa decides to use it in their critical path. Then
| this "lemonade stand" goes down. Do you think it's fair to
| blame me? I never chose to be part of Visa's authorization
| loop, and after all is done I did indeed refund them the
| price of their "lemonade". It's Visa's fault they introduced
| a single point of failure with inadequate compensation
| schedules in their critical path.
| stronglikedan wrote:
| > Do you think it's fair to blame me?
|
| Absolutely, yes. Where's your backup plan for when Visa
| doesn't behave as you expect? It's okay to not have one,
| but it's also your fault for not having one, and that is
| the sole reason that the lemonade stand went down.
| Nextgrid wrote:
| > Where's your backup plan for when Visa doesn't behave
| as you expect?
|
| I don't have (nor have to have) such a plan, I offer X
| service with Y guarantees paying out Z dollars if I don't
| hold up my part of the bargain. In this hypothetical
| situation if Visa signs up I assumed they wanted to host
| their marketing website or some low-hanging fruit, it's
| not my job to check what they're using it for (in fact it
| would be preferable for me not to check, as I'd be seeing
| unencrypted card numbers and PII otherwise).
| stronglikedan wrote:
| If I'm paying a company that chose Cloudflare, and my SLA
| with that company entitles me to some sort of compensation
| for outages, then I expect that company to compensate me
| regardless of whose fault it is, and regardless of whether
| they were compensated by Cloudflare. I can know that the
| cause of the outage is Cloudflare, but also know that the
| company that I'm paying should have had a backup plan and not
| be solely reliable on one vendor. In other words, I care
| about who I pay, not who they decide to use.
| wongarsu wrote:
| Don't we say that about all supplier relationships? If my
| Samsung washing machine stops working I blame Samsung. Even
| when it turns out that it was a broken drive belt I don't
| blame the manufacturer of the drive belt, or whoever produced
| the rubber that went into the drive belt, or whoever made the
| machine involved in the production of this batch of rubber.
| Samsung choose to put the drive belt in my washing machine,
| that's where the buck stops. They are free to litigate the
| matter internally, but I only care about Samsung selling me a
| washing machine that's now broken
|
| Same with cloudflare. If you run your site on cloudflare you
| are responsible for any downtime caused to your site by
| cloudflare
|
| What we can blame cloudflare for is having so many customers
| that a cloudflare outage has outsized impact compared to the
| more uncorrelated outages we would have if sites were
| distributed among many smaller providers. But that's not
| quite the same as blaming any individual site being down on
| cloudflare
| raincole wrote:
| > Don't we say that about all supplier relationships?
|
| No always. If the farm sells packs of poisoned bacon to the
| supermarket, we blame the farm.
|
| It's more about if the website/supermarket can reasonably
| do the QA.
| mschuster91 wrote:
| > look at the uptime of card networks, stock exchanges, or
| airplane avionics.
|
| In fact, I'd say... airplane avionics are _not_ what you should
| be looking at. Boeing 's 787? Reboot every 51 days or risk the
| pilots getting wrong airspeed indicators! No, I'm not joking
| [1], and it's not the first time either [2], and it's not just
| Boeing [3].
|
| [1]
| https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
|
| [2]
| https://www.theregister.com/2015/05/01/787_software_bug_can_...
|
| [3]
| https://www.theregister.com/2019/07/25/a350_power_cycle_soft...
| Nextgrid wrote:
| > Reboot every 51 days or risk the pilots getting wrong
| airspeed indicators
|
| If this is documented then fair enough - airlines don't
| _have_ to buy airplanes that need rebooting every 51 days,
| they can vote with their wallets and Boeing is welcome to fix
| it. If not documented, I hope regulators enforced penalties
| high enough to force Boeing to get their stuff together.
|
| Either way, the uptime of avionics (and redundancies -
| including the unreliable airspeed checklists) are much higher
| than anything conventional software "engineering" has been
| putting out the past decade.
| waiwai933 wrote:
| > Maybe some of these questions are obviously answered in a
| Cloudflare control panel or help document. I'm not in the market
| right now so I won't do that research.
|
| I don't love piling on, but it still shocks me that people write
| without first reading.
| jcmfernandes wrote:
| The tone is off. Cloudflare shared a post-mortem on the same day
| as the incident. It's unreasonable to throw a "I wish technical
| organisations would be more thorough in investigating accidents".
|
| With that said, I would also like to know how it took them ~2
| hours to see the error. That's a long, long time.
| vlovich123 wrote:
| A lot of these questions bely a misunderstanding of how it works
| - bot management is evaluated inline within the proxy as a
| feature on the site (similar to other features like image
| optimization).
|
| So during ingress there's not an async call to the bot management
| service which intercepts the request before it's outbound to
| origin - it's literally a Lua script (or rust module in fl2) that
| runs on ingress inline as part of handling the request. Thus
| there's no timeout or other concerns with the management service
| failing to assign a bot score.
|
| There are better questions but to me the ones posed don't seem
| particularly interesting.
| kqr wrote:
| Maybe I'm misunderstanding something but it being a blocking
| call does not make timeouts less important -- if anything they
| become more important!
| tptacek wrote:
| I don't understand how it is you're doing distributed systems
| design on a system you don't even have access to. Maybe the
| issue is timeouts, maybe the issue is some other technical
| change, maybe the issue is human/procedural. How could you
| possibly know? The owners of the system probably don't have a
| full answer within just a few hours of handling the incident!
| kqr wrote:
| I would be worried if they had all the answers within a few
| hours! I was just caught off guard by the focus on
| technical control measures when there seems to have been
| fairly obvious problems with information channels.
|
| For example, "more global kill switches for features" is
| good, but would "only" have shaved 30 % off the time of
| recovery (if reading the timeline charitably). Being able
| to identify the broken component faster would have shaved
| 30-70 % off the time of recovery depending on how fast
| identification could happen - even with no improvements to
| the kill switch situation.
| spenrose wrote:
| I am disappointed to see this article flagged. I thought it was
| excellent.
| kqr wrote:
| In defense of your taste, it was updated based on the loud
| feedback here, so you probably read a slightly better version
| than that which was flagged.
___________________________________________________________________
(page generated 2025-11-19 23:01 UTC)