[HN Gopher] One Fastly customer triggered internet meltdown
___________________________________________________________________
One Fastly customer triggered internet meltdown
Author : JulianMorrison
Score : 239 points
Date : 2021-06-09 11:14 UTC (11 hours ago)
(HTM) web link (www.bbc.co.uk)
(TXT) w3m dump (www.bbc.co.uk)
| geerlingguy wrote:
| Not in the article: how one Fastly customer triggered the bug.
| Just a quote and a promise that an RCA will be posted.
| tyingq wrote:
| Fastly doesn't appear to be sharing that detail. Their own blog
| post is similarly vague about the exact cause.
| https://www.fastly.com/blog/summary-of-june-8-outage
|
| Edit: That blog post does say this: _" On May 12, we began a
| software deployment that introduced a bug that could be
| triggered by a specific customer configuration under specific
| circumstances."_
|
| The scheduled maintenance on May 12 was this:
| https://status.fastly.com/incidents/dlsphjqst537
|
| Based on that, it sounds like maybe a configuration change
| could deploy new cache nodes with ip addresses that a customer
| hasn't explicitly allowed to talk to their backend:
|
| _" When this change is applied, customers may observe
| additional origin traffic as new cache nodes retrieve content
| from origin. Please be sure to check that your origin access
| lists allow the full range of Fastly IP addresses"_
| abluecloud wrote:
| Why is everyone banging on about this. It's a blog post from
| the same day, a decent post mortem takes a while to put
| together and assuming the bug isn't fully patched across
| their entire CDN, why would they post the information.
| jffry wrote:
| Plus it would be weird to present just that specific
| information, outside of the context of a post mortem /
| failure chain analysis type discussion.
| tyingq wrote:
| That's true, though they are also saying things like _"
| We created a permanent fix for the bug and began
| deploying it at 17:25."_. "Permanent fix" sort of implies
| they understood the issue really well.
| jffry wrote:
| That's my point though. Even though they may understand
| the immediate flaw in their code that caused the issue,
| there's not much use (for them or their customers) in
| just talking in detail about that specific flaw.
|
| I'd go so far as to argue that the specifics of the flaw
| are immaterial right now. At this stage, the important
| thing is that they have identified a specific code change
| that was the proximate cause of the issue, and have a
| mitigation in place. This is contrasted with more
| mysterious and hard-to-track-down failures. ("We are
| working to understand why our systems are down and will
| post another update in 30 minutes")
|
| What will take time, and the thing which will be
| interesting, is failure tree analysis. (You might hear
| the phrase "failure chain" or "root cause" but IMO it's
| quite rare for things to be so linear). That can help
| identify opportunities to improve processes at many
| different levels of the product lifecycle.
|
| Humans are fallible, and there's no way we can write bug-
| free software, so the solution has to be more robust than
| "hope that every member of our organization never makes a
| mistake again"
| tyingq wrote:
| Yes, I was saying I would have avoided words like
| "permanent fix", because it sets unrealistic
| expectations.
| tyingq wrote:
| I wasn't "banging on", I was answering why the article
| didn't mention the cause...because the source didn't
| either.
| notacoward wrote:
| Everyone is "banging on" because there are important
| lessons to be learned from such incidents, and people want
| to learn. They hunger for more details about the
| generalizable aspects of the bug, even if a full post
| mortem that also covers internal processes etc. might take
| longer to do. Having participated in many post mortems, in
| many roles, for systems just as complex, I believe it's
| entirely possible to provide that information the _next_
| (not same) day. Is that still setting the bar too high?
| Perhaps. Fastly deserves kudos for providing even the level
| of information that they have, since that 's already above
| the pathetic industry standard, but I don't think there's
| anything bad about wanting more. Defensiveness is the enemy
| of effective post mortems.
| addingnumbers wrote:
| The paragraph you quoted is just describing a side effect of
| adding nodes. That excerpt appears in every one of their
| capacity expansion announcements, going back years.
| tyingq wrote:
| Ah, interesting. Though the change itself reads like it was
| just adding capacity in one location (Newark). I don't see
| any other changes mentioned for that date.
| jedimastert wrote:
| I assume that the bug isn't 100% fixed yet and the instant they
| publish how the bug took place 1000 yahoos will immediately try
| to re-create it.
| staticassertion wrote:
| If you find out you have a DOS that can take down the internet
| you might be wary about sharing details until you've hammered
| things out.
| Dobbs wrote:
| My uneducated hypothesis is that Fastly runs varnish.
| Presumably they have some process that collects data from their
| config system generates the VCL (varnishes psuedo-C
| configuration language which compiles directly to C). Somehow a
| customer configured something in such a way that it generated a
| bad VCL file which then caused it to lose all configuration, or
| caused one domain to incorrectly garble up traffic for all
| domains.
|
| I can poke plenty of holes in this hypothesis, like fastly
| likely not deploying configuration to all nodes but only
| subsets. Looking forward to the deeper post.
| tyingq wrote:
| It's interesting to me that they still run Varnish, since it
| doesn't have https/tls built-in. I do get that VCL is more
| expressive than similar capabilities in Nginx, HAproxy, etc.
| But it would seem like less work to add expressiveness to one
| of those (via Lua maybe?) than to maintain both Varnish and
| the separate components needed for both tls ingress and
| egress.
| foobarbazetc wrote:
| Varnish and VCL were the hotness at the time Fastly was
| coming up so it sort of makes sense, but Varnish also
| doesn't support WebSockets, can't proxy GRPC, etc so
| they're very limited in functionality vs CloudFlare.
|
| I doubt they'd build it on Varnish today, but it's a bit
| late now since they allow custom VCL (which has now proven
| to be a terrible idea) and will have to support that for
| eternity.
|
| They can run two or more serving stacks side by side though
| if it comes to that.
|
| And to add to your point, they also have a separate process
| that speaks QUIC. It's an interesting tech stack with a lot
| of technical debt.
| tyingq wrote:
| >it's a bit late now since they allow custom VCL (which
| has now proven to be a terrible idea)
|
| Ah, okay. I took a look, and it appears they at least
| didn't allow varnish modules or inline C. But, still, a
| fairly hefty anchor for the future.
| longwave wrote:
| You can write custom VCL snippets directly in the Fastly
| control panel. Migrating all existing customer VCL to
| another language would be an enormous task.
| LeifCarrotson wrote:
| They say as much here [1].
|
| > _Fastly is a shared infrastructure. By allowing the use of
| inline C code, we could potentially give a single user the
| power to read, write to, or write from everything. As a
| result, our varnish process (i.e., files on disk, memory of
| the varnish user 's processes) would become unprotected
| because inline C code opens the potential for users to do
| things like crash servers, steal data, or run a botnet._
|
| Personally, my hypothesis is that somebody uploaded a
| configuration for their domain
| `https://IVCL_{raise(SIGSEGV)}.com` (edit: the preceding URL
| used to contain a heart emoji between I and VCL, apparently
| HN prefers ASCII, too) in a way that, rather than converting
| to Punycode, passed a few bytes that weren't in the 96 legal
| characters accepted by the VCC compiler and caused some kind
| of undefined behavior.
|
| [1] https://docs.fastly.com/en/guides/guide-to-vcl#embedding-
| inl...
| desireco42 wrote:
| Them pointing to customer as a source of issue is not OK. This
| reminds me of that Chase scandal, when they transferred huge
| amount of money to pay of full loan instead of installment, then
| they blamed on some guy in India for alleged mistake, where his 2
| or 3 superiors that approved that and really horrific interface,
| but they decided to pin the blame on that guy.
|
| So, I am glad things are OK, but it definitely is not that one
| customer who is to blame for this outage.
| zomglings wrote:
| I don't think that Fastly is pinning blame on the customer.
| That seems to be the BBC trying to bait us into reading the
| article.
|
| I am sad to report that their clickbait worked on me.
| desireco42 wrote:
| I think you are right, it is my bad.
| KingOfCoders wrote:
| "Fastly senior engineering executive Nick Rockwell said: "This
| outage was broad and severe - and we're truly sorry for the
| impact to our customers and everyone who relies on them.""
|
| I wonder here, are they sorry enough to run the company on two
| different tech and software stacks and data centers? Like people
| don't buy disk drives (SSDs) not all from the same vendor? How
| much would they spend for the "sorry".
| mhandley wrote:
| You'd have to consider what actually happens when 50% of your
| infrastructure goes down. Can the remaining 50% cope with 100%
| of the load? If not, then you still get a complete failure. So
| then the question becomes can you reprovision between stack A
| and stack B very rapidly, both running from the same hardware
| pool, while 50% of your infrastructure is down. Now you've
| introduced the potential for correlated failures (single
| hardware pool and network), plus added complexity and load due
| to reprovisioning just when things are already overloaded. Not
| easy to get this right, so might not actually increase
| reliability, as now you've got two different stacks that can
| each independently fail and get you into this mess.
| ceejayoz wrote:
| Without more details on what/where the "software bug" was, it's
| hard to say if that would've helped at all.
| spicybright wrote:
| Making your infra heterogeneous is very underrated. Or at least
| I don't hear of many companies doing that, even large ones.
|
| How do you go about different software though, have one center
| running a version or two behind to fallover to?
| aliasEli wrote:
| In general it is expensive. In most cases you will need
| experts that need knowledge of both system A and system B.
| Those people are more difficult to find. Also you have a
| higher chance of errors due to confusion between the two
| systems.
| stingraycharles wrote:
| That only protects against one type of failure, eg one of your
| vendors going down. The article suggests this was a problem in
| their own software, which in all likelihood would not be
| protected against by using different stacks.
| KingOfCoders wrote:
| Two developers and two product managers would not create the
| same bugs but different ones I'd think. From writing
| thousands of correct lines and implementing features without
| bugs, it would be a coincidende (except perhaps not well
| understood complex parts) where two developers create the
| same bugs.
|
| Reading studies about bugs in the past where developers where
| told to write some code resulted in different bugs.
| stingraycharles wrote:
| Are there any organizations out there that actually make
| multiple versions of their product and deploy it together,
| in real time, all the while being completely agnostic to
| the end user?
|
| It seems like it will be very hard to justify the immense
| costs for this.
| dap wrote:
| I've heard this before but I'm skeptical about this approach.
| It's got to cost close to 2x to develop and maintain two stacks
| -- if they're sharing a lot, that would defeat the point. Then
| you'd have two stacks tested about as well as the one today.
| Instead, you could put that investment towards testing and
| improving quality on the one stack.
|
| I know a second stack hopefully has different bugs. But is that
| likely (to a meaningful extent)? Reimplementations often
| reintroduce old issues (which suggests many people make similar
| mistakes), plus it's hard to imagine the first stack not
| influencing the second in various ways.
| tibbetts wrote:
| Redundant implementations only help with uncorrelated problems.
| With complex software, problems are often quite correlated
| across multiple implementations. Add to that the additional
| complexity of managing those multiple implementations and the
| potential for problems may be net worse. The consensus among
| safety experts is that multiple implementations are a bad
| approach to safety or reliability.
| KingOfCoders wrote:
| I would assume different developers and product managers
| create different bugs.
| Simplicitas wrote:
| Still impressed at how quickly Fastly recovers. Kudos.
| ic4l wrote:
| Even if they fixed it in 20 minutes the chain reaction caused
| by Fastly being down took much more than 20 minutes to resolve
| itself.
|
| An example is imgix:
|
| https://status.imgix.com
|
| It took them 11 hours to recover from Fastly going down for
| their claimed 40 minutes.
| busymom0 wrote:
| > We are still seeing increased origin load as a result of
| the earlier outage from our service provider.
|
| Does this mean that some companies using Fastly could have
| major costs because of the increased origin load?
| axlee wrote:
| Absolutely.
| rlv-dan wrote:
| https://www.fastly.com/blog/summary-of-june-8-outage
| kevincox wrote:
| Global shared control plane updates are frighteningly common. For
| example Cloudflare regularly brags about how your configuration
| updates are pushed around the world in single-digit seconds.
| Sure, it is an amazing feature, but it opens you up to this sort
| of issue.
|
| All changes to critical infrastructure should be a gradual
| rollout (emergencies aside). Instant sounds nice, until it isn't.
| If this rolled out to one region for the first hour it likely
| would have been caught and Fastly could press the "stop all
| rollouts" button.
| dang wrote:
| See also: _Summary of June 8 outage_ -
| https://news.ycombinator.com/item?id=27444005 - June 2021
| (ongoing)
| CalChris wrote:
| The BBC title is a _terrible_ title. The customer didn't cause
| the outage. The bug caused the outage.
| justin_oaks wrote:
| Sadly, most news headlines are terrible. When I read the
| headline, I clicked through for the details. Fortunately the
| details made the whole thing clear.
|
| I too wish for a world where headlines aren't terrible, but we
| currently live in a world of clickbait.
| underscore_ku wrote:
| this and the BBC bots that upvote every BBC news
| mypastself wrote:
| Unless it's been changed since your comment, I don't think
| that's what the title is claiming. A trigger is different from
| a cause.
| Zenst wrote:
| That was my take as well and very much disenfranchises what
| happens with title's like that.
|
| If you have a design flaw, you don't hint at onus upon the
| first person to fall foul of it - and we all know how quick
| many are reading news that they will run fully upon the title
| alone (something we can all do).
|
| But we have seen a drive towards click-bait/search-bot friendly
| to garner hits - style headlines. Even the BBC over the years
| have IMHO learnt towards such tabloid style headlines more and
| that is just sad.
| tantalor wrote:
| The bug is the root cause, the customer's action is the
| proximal cause (or "trigger")
| hnbad wrote:
| Placing the emphasis on "one customer" still creates a false
| narrative though. Note the headline doesn't even say it was a
| bug. This headline provokes the question "Who was it and what
| did they do?" rather than the more insightful "What was the
| bug?".
|
| If a single bug caused the "internet meltdown", it's fairly
| likely that the bug was triggered by one person so there's no
| need to emphasize that part.
| shitgoose wrote:
| Well, it is BBC, what do you want.
| cwkoss wrote:
| This is like saying "Fat man destroys bridge"
|
| no, the engineers/builders who didn't implement proper
| safety tolerance broke the bridge.
|
| The straw is not responsible for breaking the camels back.
| sporkland wrote:
| Maybe working at a service provider has warped my English,
| but usually for me reading "one customer took down X" is
| usually a pejorative about the service and not a customer
| witch hunt for me.
| tantalor wrote:
| Who cares about headlines?
|
| Headlines are written by editors, not reporters, to
| maximize CTR and minimize length.
|
| You can't fit detail and nuance in a headline. The point is
| to get people to read the article, not inform.
|
| If people are drawing conclusions from reading just the
| headline, not the article, then you can safely ignore them.
| There's no point in getting mad about it.
| m3kw9 wrote:
| I remember there was a similar bug that hit Cloud Flare before??
| oneeyedpigeon wrote:
| Shame on the BBC for such a misleading headline. But Fastly
| probably shouldn't have even given them detail -- it's irrelevant
| and bound to be misreported. Just own up to your bug and be done
| with it.
| Dah00n wrote:
| Again, The Cloud is just someone else's server. Most things in
| The Cloud doesn't belong there in my opinion but it is yet
| another trend everyone must follow and check off on their list
| from management. Let's all chant in unison and play out the
| rituals sent down from above.
|
| Umbasa!
| gilbertbw wrote:
| Unless you are a very large company a CDN is not something you
| can realistically build and run in house.
| rocqua wrote:
| If you are not a very large company, you might not need the
| full functionality of a very large CDN either.
|
| Now, building the functionality you require can still be
| unrealistic to build yourself. One thing that springs to mind
| is DDoS protection.
| aliasEli wrote:
| When your site can expect visitors from all around the
| world you need a CDN that also has a world wide presence.
| Using a CDN can have a pretty big impact on the performance
| of your site, and it is well known that users avoid sites
| that are slow.
| doublerabbit wrote:
| Colocation is cheap. But your correct with DDoS; that is
| the only thing I am not prepared for. I will forever go
| with colocation.
|
| 1Gbit is nothing nowadays and can be saturated in seconds;
| my purposes do not justify the cost of 10g transit. Even
| owning 4U when I only need a VPS is overkill. But I like
| owning a small dusty cube of internet. So there's that.
| ceejayoz wrote:
| > Again, The Cloud is just someone else's server.
|
| Sure, but in a lot of cases, that "someone else" is an entire
| team of experts in their particular niche that can do a better
| job of the specific task at hand than I ever can hope to.
|
| Is this _always_ the case? No. Is it _sometimes_ the case? Yes.
| lrem wrote:
| But this is primarily a CDN, with some edge copmpute, isn't it?
| What's the functional alternative here, "I'll go ahead and
| lease 2U of rack space and fly my ops people to every major
| city in which I have users"?
| LennyWhiteJr wrote:
| The sad reality is that if your app serves any real business
| value and you can't afford to hire a team that can quickly
| handle scaling attacks, simply running your own server isn't
| really a viable option anymore.
|
| The internet's original distributed nature just isn't
| compatible with the sheer scale of billions of active users.
| mattowen_uk wrote:
| THIS version of the internet isn't.
|
| I hold out hope that the next version won't be controlled by
| corporate entities.
| kissgyorgy wrote:
| So they have an infrastructure where customer data is not
| separated from other customers, that's pretty terrifying! This
| means another bug of the same kind can also cause a global
| disruption any time in the future (mistakes will happen).
| aliasEli wrote:
| Yes, that might happen.
|
| But the only way to avoid that is to give each customer its own
| private hardware, which seems prohibitively expensive (and may
| not even prevent all failure sources).
| napolux wrote:
| cat /users/*.conf >> global.conf && ./restart_everything
| pentagone wrote:
| > The outage has raised questions about relying on a handful of
| companies to run the vast infrastructure that underpins the
| internet.
|
| Big surprise
| paul_f wrote:
| I love how it almost suggests it was a customer's fault. If only
| they hadn't changed their settings!
| DudeInBasement wrote:
| Should be "Untested code triggers problems"
| gonzo41 wrote:
| hey hey, production is just big test.
| coldcode wrote:
| It could just as easily have been tested extensively but no
| testing is 100% guaranteed, especially in a world wide
| service as complex as a CDN. People who think perfect code
| comes purely from testing are delusional.
| [deleted]
| rchaud wrote:
| It suggests it only as long as someone doesn't read the
| article.
|
| The way I read it, they were trying to communicate the fact
| that a customer fiddling with their own configuration brought
| down large swathes of the internet for everybody else. That
| absolutely deserves to be in the headline.
| NaturalPhallacy wrote:
| > _It suggests it only as long as someone doesn 't read the
| article._
|
| Which is why it's clickbait. Sensational title, humdrum
| article.
| politelemon wrote:
| An edge case triggered an edge condition at our edge location.
| Frost1x wrote:
| "Edge cases" and "conditions" are some of my trigger phrases.
| Given enough time or users, they are inevitable and passing
| them off as rare to avoid dealing with them, especially when
| you're aware of their existence, drives me up the wall.
|
| Unless it's an edge _use case_ you 're not supporting, don't
| sell me your cost avoidance on any production systems.
| bennyp101 wrote:
| And edge case is just that though? Something that hadn't
| been thought of, and makes stuff break, then you can figure
| out a fix.
|
| I don't know any dev that would agree with "passing them
| off as rare to avoid dealing with them" - rather, "it's a
| low priority, but it needs fixing" or "ok, this is an edge
| case, but holy crap its a bad one"
| mjthompson wrote:
| A very edgy comment! Well done.
| smlss_sftwr wrote:
| quite edge-ucational I must say
| iainmerrick wrote:
| I know you're mostly joking, but it does no such thing. The
| blog post explicitly states it was a " _valid_ customer
| configuration change".
| pdpi wrote:
| The Fastly blog does not blame the user. The BBC article kind
| of suggests it ("One Fastly customer triggered internet
| meltdown").
| iainmerrick wrote:
| Yeah, you're right, the headline is clickbaity.
| ceejayoz wrote:
| The BBC article includes the line "a customer quite
| legitimately changing their settings had exposed a bug".
| oneeyedpigeon wrote:
| That's 'bottom of a locked filing cabinet in a disused
| lavatory with a sign on the door saying "Beware of the
| Leopard"'-level stuff compared to the clickbait headline,
| though.
| ceejayoz wrote:
| What's your proposed headline?
| oneeyedpigeon wrote:
| Fastly bug triggered internet meltdown
| ceejayoz wrote:
| That's a great headline for yesterday.
|
| This article details a _new_ piece of information, and
| the headline reflects that.
| jiveturkey wrote:
| No, it is designed to be suggestive -- to shift attention and
| perhaps even blame. And of course the BBC picked up on it for
| clickbait reasons. Brilliant (but perhaps evil) PR by Fastly.
|
| Customer configuration is an irrelevant detail that should
| have been left out until a full RCA. What does it matter that
| it was valid? So an invalid configuration would have meant it
| was indeed the _customer's_ fault??
|
| A more fair treatment would have been, "a customer pentested
| us and won".
| jcims wrote:
| Just an unsolicited plug of an interesting podcast I started
| listening to recently (co-hosted by HNer slackerIII)-
| https://downtimeproject.com/
|
| Hope this one gets the treatment.
| bluedevil2k wrote:
| Why is Amazon dependent on someone else's cloud infrastructure?
| Thaxll wrote:
| Also large compagnies uses multiple CDN.
| busymom0 wrote:
| Someone answered this yesterday. CloudFront is good for video
| and large download assets (plus very low margins) but not for
| images and smaller stuff which Fastly is much faster at:
|
| https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...
| [deleted]
| typon wrote:
| Hedging your bets
| rchaud wrote:
| They can't afford Cloudfront bills either.
| tyingq wrote:
| I believe Google's Firebase also uses Fastly.
| NaturalPhallacy wrote:
| Modal free: https://archive.is/dKMHE
| flareback wrote:
| fta: "The outage has raised questions about relying on a handful
| of companies to run the vast infrastructure that underpins the
| internet."
|
| I read that phrase everytime something like this happens and yet
| we all still rely on the same handful of companies.
| perlgeek wrote:
| That's because many such business have grown so large because
| they benefit from scale.
|
| Some examples:
|
| Social network: you only engage on one if your
| friends/family/coworkers are on the same network.
|
| Search engine: needs to index "the whole Internet", which is
| less expensive per user if you have more users
|
| CDN: works best if you have edge nodes everywhere, which is
| quite capital intensive, which is why you need many customers
| to distribute it over.
|
| ... and so on. We might not like it, but many of these quasi
| monopolies are based on fundamental economics, not (just) on
| the greed of the companies.
| ryukafalz wrote:
| > Social network: you only engage on one if your
| friends/family/coworkers are on the same network.
|
| This one isn't exactly based on fundamental economics given
| that federated social networks exist. Email has similar
| network effects and is not centralized.
| toxik wrote:
| I think this argument could be made for many of these.
| perlgeek wrote:
| None of the federated social networks seems to have reached
| the scale of the biggest centralized social networks.
|
| Which leads me to believe that economics and incentives
| favor big, centralized social networks.
| kleinsch wrote:
| If you're making a purchasing recommendation for your company,
| do you want to tell your boss that you're recommending not
| going with your best option, or even 2nd, but that your company
| should use the 4th or 5th best CDN as a way to diversify the
| Internet? Seems pretty altruistic, but not a great way to keep
| your job.
| axiosgunnar wrote:
| Would a smaller CDN provider have no outages at all?
| unhammer wrote:
| There's kind of a race-to-the-bottom[0] wrt.
| decentralisation.
|
| It'd be better for the internet as a whole if we don't always
| pick the most popular (so when your email's CDN goes down you
| can still communicate on chat, when CNN goes down you can
| still read BBC). But as an individual I have strong
| incentives to pick the one everyone else picks, because
| that's presumably the most stable/documented/lowest cost due
| to volume.
|
| [0] https://slatestarcodex.com/2014/07/30/meditations-on-
| moloch/
| sgtfrankieboy wrote:
| No, but it would take down less sites if it does have one.
| gpm wrote:
| Is that a feature?
|
| If we have 12 small cdns have 12 outages in a year
| (combined, 1/year each), each time bringing down 10,000
| websites, is that better than 1 large cdn having one outage
| during the year bringing down 120,000 websites?
|
| If I'm the website owner I think I prefer the latter, my
| customers blame the cdn instead of me. If I'm the cdn owner
| I definitely prefer the latter, more customers to amortize
| my costs over.
| iso1210 wrote:
| From a customer point of view diversity is far better
| it's great - if Sainsburys is closed for some reason I'll
| go to Tesco.
|
| Certainly don't want a situation where all the shops are
| closed.
| martius wrote:
| It may be more acceptable to have more frequent outages if
| the impact radius (number of websites or services impacted)
| is smaller.
| zelon88 wrote:
| Obviously that isn't an honest comparison. If you're asking
| whether or not 10 small CDN providers could provide a more
| robust, higher quality service with more uptime than one
| large CDN provider, then I think the answer is probably
| "yes."
| notacoward wrote:
| It's the tech equivalent of "thoughts and prayers" isn't it?
| adrr wrote:
| There are at least 5 major CDNs out there. CloudFront,
| Fastly,Cloudflare, Akamai, Google CDN. You can use more than
| just 1. Shopify uses two. Akamai and Fastly.
| viraptor wrote:
| If you use them as pure file-serving CDN, then sure. But once
| you start adding extra logic, headers, routing, etc. the
| features don't fully align. Or you need to keep to the
| minimum common featureset.
| 0xcoffee wrote:
| Does anyone know if Azure/Google/Amazon can provide some
| 'multi-cdn' setup out of the box? The way to change these
| points of failure is for the big boys to change their
| defaults.
| adrr wrote:
| Do it at the DNS layer. Route53 has failover support out of
| the box that should work for it. You can setup a monitor
| and it will switch dns entries on a failure.
| tyingq wrote:
| It is getting increasing tricky to have enough redundancy at
| a basic level to avoid a major player outage from affecting
| you. For example, you would probably want at least one
| authoritative DNS server that isn't either of your CDN
| providers. And knowing some details about how these players
| sometimes use each other, like that Google's Firebase uses
| Fastly.
| gonzo41 wrote:
| Are you bigger than a major player is the question I'd be
| asking. Maybe risking it is fine.
| tyingq wrote:
| I don't know that size of your operation is the right
| metric to gauge whether to bother with this. If 100% of
| your revenue, for example, is from online sales, it might
| be worth it even if you're small. But yes, it's often not
| worth it.
| gonzo41 wrote:
| I agree, it's just for some DR scenarios there's only so
| much you can do. And 'the internet is down' is hard to
| plan for. If CNN is offline due to some outage and you're
| a smaller enterprise then are people really still doing
| online eCommerce stuff, or are they waiting for their
| favorite sites to come back up as a signal that things
| are back to normal.
| dexterdog wrote:
| But we're talking about what was a 1 hour outage. Does it
| make sense to spend more than 1/8000 of your revenue to
| avoid an hour per year esp when you will never lose an
| hour of revenue from being down for one hour because many
| customers come back to buy the thing they were going to
| buy anyway.
| darkwater wrote:
| So, after multi-cloud now we need to go multi-CDN? Half
| joking here, it's actually a good idea although probably it's
| not worth the cost. I think GitHub (at least from my casual
| looking at that behavior during the outage) nailed it because
| they must have some kind of active/passive CDN config. They
| were affected by the outage but after a few minutes (less
| than whole Fastly outage duration) they were serving assets
| again.
| politelemon wrote:
| I think it will continue to happen. People and orgs, when
| picking a service, don't have an incentive or a way to ask
| around to see what percentage of similar-service users are
| using that particular service. People and orgs will always flow
| towards cheap/popular/well-known services.
|
| On the other hand, those handful of companies could be asked to
| structure their services so that an outage only affects a
| portion of customers and not all their customers. However! That
| would be more inefficient for them, and more expensive, and
| that cost would cause the people and orgs mentioned earlier to
| just flow towards the company that took those shortcuts.
| vikramkr wrote:
| There is an incentive to see what everyone else is using, so
| that you can use it too. Choose boring technologies applies
| to infra as well. Go with a big stable well known can or
| cloud provider instead of trusting your service to some fly
| by night startup etc.
| JulianMorrison wrote:
| The answer is that before the existence of those companies to
| rely upon, people _didn 't_ rely upon them, they just accepted
| the lag, or they hand hacked the same globally distributed
| approach on their own, and it sucked for them, and it wasn't
| too great for users either. CDNs are big because that's what
| their function is, to reach the world and absorb traffic
| spikes, and take the complicated business of distribution of
| edge servers out of the hands of the people who just want to
| run a website.
|
| The trade off here is intrinsic and accepting the risks of big
| CDNs is the right answer.
| TheDudeMan wrote:
| Well, you could use two CDNs instead of just one. But that
| costs money.
| mathattack wrote:
| Exactly. The question isn't "a few CDNs" vs "Many CDNs".
| There are too many economies of scale. It's really CDN vs
| Not? (Roll your own just isn't feasible except for a half
| dozen of the largest tech firms)
| toyg wrote:
| One of the side-effects of the efficiency obsession that
| capitalism generates, is the constant strive for
| centralization.
|
| We took a technology stack designed to survive nuclear attacks,
| and turned it into something where a single bug can take down
| half the services on it. Why? Because on the flip side, a
| single improvement in the centralized service can automatically
| cascade to all the businesses using it.
|
| Efficiency is a double-edged sword.
| koreanguy wrote:
| single point failure, you are what you run
| mikesabbagh wrote:
| the thing with CDN is that they may have many edge locations, but
| the cache does not sit there.
|
| They frequently have a common caching servers located close by.
| so maybe every 10 or 100 edge locations have a single cache
| location.
|
| Edge is a reverse proxy and probably handles ssl handshakes. So
| if your cache is down, all your edge locations in that area are
| down
| uncertainrhymes wrote:
| This may be true with some, but it is not true of Fastly. Each
| of their edge nodes is a varnish cache. Because they are multi-
| tenant, when varnish crashes it crashes hard and takes everyone
| with it.
|
| The question in various threads is why not have redundancy --
| but the point of a CDN is to have extra servers and capacity
| and lots of locations to make individual crashes just flow
| elsewhere.
|
| But if the single customer with a valid-yet-crashable config
| had lots of traffic all over the world... it'll take everything
| out at once.
|
| Redundancy of CDN is more expensive, and still requires DNS
| failover. People do the calculation and usually decide that 30
| min of downtime every couple years is worth the saving on
| vendors and code and hassle. They don't like it, but every site
| that was down made that decision.
| arrty88 wrote:
| Did they just miss a where clause on the update statement??
| iainmerrick wrote:
| Throwing in my positive hot take among all the negative ones
| here: the immediate response and blog post from Fastly here is
| really good.
|
| A quick fix, a clear apology, enough detail to give an idea of
| what happened, but not so much detail that there might be a
| mistake they'll have to clarify or retract. What more are you
| looking for?
|
| Apart from "not have the bug in the first place" -- and I hope
| and expect they'll go into more detail later when they've had
| time for a proper post mortem -- I'd be interested to hear what
| anyone thinks they could have done better in terms of their
| immediate firefighting.
| oneeyedpigeon wrote:
| I'm sorry, but I disagree. They gave the BBC enough detail that
| a very misleading headline was produced as a result. True, the
| main blame lies with the BBC, but it also comes across -- to
| me, anyway, maybe I'm being too cynical -- as a bit of an
| excuse from Fastly.
| mytailorisrich wrote:
| " _But a customer quite legitimately changing their settings
| had exposed a bug in a software update issued to customers in
| mid-May, causing "85% of our network to return errors_"
|
| They are careful to make clear that the customer did nothing
| wrong and that the problem was a bug in their software.
| oneeyedpigeon wrote:
| I know -- as I've said, the main blame lies with the bbc.
| However, as it's reported, it comes across very much as
| Fastly trying to save face. Maybe the blame is entirely on
| the bbc, maybe Fastly were naive in thinking that giving
| them this information wouldn't result in irresponsible
| headlines.
| Jenk wrote:
| How on earth do you figure that is anyone's fault but the
| BBC?
|
| Read Fastly's statement. There is nothing about it
| blaming the customer(s) at all. There is nothing trying
| to save face.
|
| What is your point here?
| oneeyedpigeon wrote:
| > Early June 8, a customer pushed a valid configuration
| change that included the specific circumstances that
| triggered the bug, which caused 85% of our network to
| return errors.
|
| Is it necessary to refer to "a customer" at all in this
| statement? What would be problematic if the above were
| rewritten as something like:
|
| > Early June 8, a configuration change triggered a bug in
| our software, which caused 85% of our network to return
| errors.
|
| The advantage is that you wouldn't get ignorant reporting
| that "one customer took down the internet". I'm not sure
| there are disadvantages that net outweigh that.
| laumars wrote:
| > _Is it necessary to refer to "a customer" at all in
| this statement?_
|
| That's how autopsies work. You describe the cause and
| resolution. The cause was a bug in the customers control
| panel.
|
| They're not trying to absolve themselves of
| responsibility.
| Jenk wrote:
| Yes, because it is explaining that it was a valid
| *customer* configuration, which is a separate set of
| concerns from, say, infrastructure config.
|
| The important adjective "valid" means it was completely
| normal/expected input and thus not the fault of the
| customer.
|
| It's perfectly clear you've come at this with a pre-
| determined agenda of "I bet fastly, like most other
| public statements after corporate booboos I've seen, will
| try to shrug this one off as someone else's fault" after
| reading the BBCs title and haven't bothered to read it at
| all until now.
| madaxe_again wrote:
| Not necessarily valid. Could have been a bad entry that
| passed validation when it shouldn't have, which would
| still not be the customer's fault.
| Jenk wrote:
| > We experienced a global outage due to an undiscovered
| software bug that surfaced on June 8 when it was
| triggered by a valid customer configuration change.
|
| Verbatim from Fastly:
| https://www.fastly.com/blog/summary-of-june-8-outage
| perbu wrote:
| The somewhat awkward "a customer pushed a _valid_
| configuration" is Fastly making sure they aren't pushing
| any blame onto the customer.
|
| There is no customer blaming here. None at all.
| staticassertion wrote:
| > Is it necessary to refer to "a customer" at all in this
| statement? What would be problematic if the above were
| rewritten as something like:
|
| That's literally what happened. They even say it was a
| valid configuration change, it's very blameless.
|
| Saying "a configuration change" loses critical context. I
| would have assumed that this was in some sort of
| deployment update, not something that a customer could
| trigger. Why would you want _less_ information here?
| oneeyedpigeon wrote:
| OK, I'm replying to your comment since it's the least
| aggressive -- thanks for that!
|
| I'll fully retract my statement. This is 100% the BBC's
| fault, 0% Fastly's.
|
| Can I make one small suggestion that might help to
| prevent this kind of misleading reporting in future,
| though? What if Fastly produce the detailed statement
| they have, with as much accurate technical detail as
| possible AND a more general public-facing statement that
| organisations such as the BBC can use for reporting, that
| doesn't include such detailed information that can easily
| be misconstrued?
| laumars wrote:
| Most of the replies to yours haven't been aggressive.
| Ironically it's your comments that have come across the
| worst by using terms like "aggressive", "blame" and
| "fault" in the first place. Calling other people's
| comments aggressive is pretty unfair. One might even say
| hypocritical.
| staticassertion wrote:
| I hate being part of a dogpile, so yeah sorry about that,
| I just open up things to reply to, and then come back
| later and write it up just to find that I'm one of 10
| people saying the same shit.
|
| edit: FWIW I had a very negative initial reaction to the
| headline as well.
| oneeyedpigeon wrote:
| Not at all -- I understand.
| Jenk wrote:
| My apologies for any hostility on my part.
| oneeyedpigeon wrote:
| No worries. I probably didn't take it very well because
| my intentions were genuine and I really wasn't trying to
| level anything beyond the _very mildest_ criticism
| towards Fastly. I recognise, however, that even that was
| misplaced -- I think the BBC headline just got me too
| worked up!
| LambdaComplex wrote:
| The wording "a configuration change triggered a bug" in
| this context sounds (to me) like it was a configuration
| change made by Fastly to something on their backend.
|
| The wording which was actually used makes it clear that
| that was not the case.
| davisoneee wrote:
| In what way is the BBC at fault for this? Their title is
| objectively true. A _valid_ configuration setting that
| was used by a customer _did_ cause fastly to have an
| outage.
|
| It's not limited to one specific customer (i.e this
| customer isn't the only customer who could have caused
| the issue, presumably), but it _was_ something the
| customer (legitimately) did. It wasn't a server outage.
| It wasn't a fire. It wasn't a cut cable.
|
| "a customer quite legitimately changing their settings
| (BBC: one fastly customer) had exposed a bug (BBC:
| triggered internet meltdown) in a software update issued
| to customers (fastly admitting, when combined with
| 'legitimately', that fastly are at fault) in mid-May".
| iso1210 wrote:
| People love to hate main stream media
| oneeyedpigeon wrote:
| Not me -- I adore the BBC. I've always paid my licence
| fee gladly, and I've been waxing lyrical about the latest
| BBC drama on Twitter just this very hour. On this issue,
| I believe they've made a mistake.
|
| Whatever happened to nuanced opinion, where you can see
| good and bad in the same entity? Why do some people
| insist so strongly on absolutes?
| jerf wrote:
| What verbiage exactly are you looking for from Fastly
| here? I'm hearing "Nobody else did anything wrong, it was
| 100% a software bug on our end, and we're sorry about
| that." How much more responsibility are you asking them
| to take before you would no longer be considering them to
| "save face"? I'm trying to come up with an ironic
| exaggeration here, but I can't, because it kinda seems
| like Fastly has already taken 100% full responsibility
| and there's no room left for exaggeration.
| bennyp101 wrote:
| Don't forget that the BBC were also initially affected by
| this, and jumped on it a lot sooner than most outlets, so
| they have skin in the game.
| iso1210 wrote:
| Would love it if it was the BBC that triggered the
| problem :D
| thayne wrote:
| I don't blame them for having a bug. I do blame them for having
| a design that doesn't isolate incident like this (although, it
| is hard to know how much without more details. And I blame our
| industry for relying so much on a single company (and that
| isn't a problem unique to fastly, or even our industry).
| mason55 wrote:
| > _And I blame our industry for relying so much on a single
| company (and that isn 't a problem unique to fastly, or even
| our industry)._
|
| The problem is that if fastly is the best choice for a
| company then there's zero incentive for the company to choose
| another vendor. Everyone acting in their own best interest
| results in a sub-optimal global outcome.
|
| It's actually one of the major problems with the global,
| winner-takes-all marketplace that's evolving with the
| internet.
| yomly wrote:
| Do you roll your own power grid? Do you roll your own ISP +
| telecoms network?
|
| As a software engineer I live by the ethos that coupling
| and dependency is bad, but if you unravel the layers you
| start to realise much of our life is centralised:
|
| Roads, trains, water, electricity, internet
|
| These are quite consolidated and any of these going down
| would be very disruptive to our lives. Connected software,
| ie the internet, is still quite new. Being charitable, are
| these just growing pains in the journey to building out
| foundational infrastructure?
| mason55 wrote:
| > _Do you roll your own power grid? Do you roll your own
| ISP + telecoms network?_
|
| > _Roads, trains, water, electricity, internet_
|
| I guess the difference here is that you're (mostly)
| talking about physical infra, which by definition must be
| local to where it's being used. We allow (enforce?) a
| monopoly on power distribution (and separate distribution
| from generation) because it doesn't make sense to have
| every power company run their own lines. But with that
| monopoly comes regulation.
|
| Digital services are different. The entire value prop is
| that you can have an infinite number and the marginal
| cost of "hooking up" a new customer is ~$0. This
| frequently leads to a natural winner-take-all market.
|
| One way to address this is to add regulation to digital
| services, saying that they must be up x% of the time or
| respond to incidents in y minutes or whatever. But
| another way to address it is to ensure it's easy for new
| companies to disrupt the incumbents if they are acting
| poorly. The first still leads to entrenched incumbents
| who act exactly as poorly as they can get away with. The
| second actually has a chance of pushing incumbents out,
| assuming the rules are being enforced. And now you've
| basically re-discovered the current American antitrust
| laws.
|
| As far as any individual company's best interests, like
| anything else in engineering, it's about risk vs. reward.
|
| What's the cost of having a backup CDN (cost of service,
| cost of extra engineering effort, opportunity cost of
| building that instead of something else, etc.) vs. the
| cost of the occasional fastly downtime?
|
| I have to imagine that for most companies the cost of
| being multi-CDN isn't worth what they lose with a little
| down time (or four hours of downtime every four years).
| ak217 wrote:
| > One way to address this is to add regulation to digital
| services, saying that they must be up x% of the time or
| respond to incidents in y minutes or whatever.
|
| This is good reasoning but I don't think it's possible to
| legislate service level objectives like that.
|
| > But another way to address it is to ensure it's easy
| for new companies to disrupt the incumbents if they are
| acting poorly.
|
| I agree but realistically there will be many cases when a
| company is far better at something than anyone else. I
| think the only way to avoid global infra single points of
| failure is competitive bidding and multi-source
| contracts, plus competitive pressure to force robustness
| (which already works quite well).
| tw04 wrote:
| > Do you roll your own power grid?
|
| I know plenty of people in Texas who will be buying solar
| panels and batteries after last winter. I will be doing
| the same.
|
| > Do you roll your own ISP + telecoms network?
|
| If I could magically get fiber directly to an IX I would
| gladly be my own ISP. I have confidence I would do as
| good a job or better than the ISPs I've had over the
| years (yes I realize having hundreds of thousands of
| customers to service is more difficult than a single
| home).
| filleduchaos wrote:
| > I know plenty of people in Texas who will be buying
| solar panels and batteries after last winter. I will be
| doing the same.
|
| I have actually been in the position of having to rely on
| non-mains power all my life.
|
| It bloody sucks.
| iso1210 wrote:
| My datacentres have two sources of power (plus internal
| UPS), two main internet lines to two different exchange
| points (and half a dozen others), plenty of bottled
| water.
|
| At home I have emergency power, water and internet. If
| the trains stop I drive, if the car breaks I take the
| train.
| pklausler wrote:
| What's this "emergency ... internet"? A hot-spot on a
| cellular telephone?
| 867-5309 wrote:
| but do all your emergency backups have emergency backups?
| 411111111111111 wrote:
| But having everything redundantly available costs money.
| While some redundancy is easy to justify... At some point
| it becomes hard when the MBA wants to cut costs so he
| gets a bigger bonus.
|
| There is even a competitive advantage in living with the
| risk, as you have less costs and overhead... Sure, you
| might have an outage once every x years for a few
| minutes... But that's obviously the fault of the
| development team, duh
| jjk166 wrote:
| This is a classic example of an externality. You use
| regulations or lawsuits to force the costs back onto the
| decision makers. Make it so people can collect damages
| from outages, the company then needs insurance to cover
| the potential costs of an outage; if the savings from
| removing a redundancy exceed the increase in insurance
| premium then it is actually efficient, otherwise it is a
| net negative. While an actuarian may make a mistake and
| underestimate the likelihood of an outage, they are far
| less incentivized to do so than the MBA looking for a
| bigger bonus.
| thayne wrote:
| Which is one reason why things like roads, trains, water,
| electricty, etc. are so heavily regulated. To prevent the
| companies that hold monopolies over the infrastructure
| from cutting corners like that.
| iseanstevens wrote:
| Sometimes, it's something as small as a missing parentheses
| that makes things go wrong.
|
| (Slight joke here)
| madaxe_again wrote:
| I would wager that they run a single configuration, as it
| grants a significant economy of scale, rather than vertical
| partitioning of their stack, which would require headroom per
| customer and/or slice. This way you just need global
| headroom.
|
| Having done some similar stuff with varnish in the past
| (ecommerce platform), they're likely taking changes in the
| control panel and deploying them to a global config - and
| someone put something lethal in that somehow passed
| validation and got published, and did not parse.
| aliasEli wrote:
| This looks a quite likely scenario.
|
| But then we still don't know what they fixed, was is the
| incorrect configuration or the underlying bug? I would
| expect the former instead of the latter, because it is
| probably not very difficult or dangerous to change that
| specific configuration while fixing bugs in the code seems
| riskier and would probably take more time for testing.
|
| We'll see if they will publish a post-mortem. It has become
| more or less a normal custom these days (and they are
| frequently quite interesting).
| dkarp wrote:
| They were pretty clear about this in their response
| (linked in the article) Once the
| immediate effects were mitigated, we turned our attention
| to fixing the bug and communicating with our customers.
| We created a permanent fix for the bug and began
| deploying it at 17:25.
|
| So they did both. First reverted the config then later
| fixed the bug.
| uncertainrhymes wrote:
| If they have a bug that can crash their servers, they likely
| won't want to publicize the details until it is fully patched.
| I wouldn't expect that detail for a while.
| iainmerrick wrote:
| The blog post actually says it's fixed already (but I would
| definitely expect them to keep the details private until
| they're 100% sure, yeah)
| alvis wrote:
| As an ex CTO having lots of fire fighting experience, I wanna
| give the honor to the team being able to identify such a user
| triggered bug in such a short period of time. Hardly anyone
| would anticipate a single user can trigger a meltdown to the
| internet!
| artichokeheart wrote:
| It reminds me of that ancient joke: A QA engineer walks
| into a bar. He orders a beer. Orders 0 beers. Orders
| 99999999999 beers. Orders a lizard. Orders -1 beers. Orders
| a ueicbksjdhd.
|
| First real customer walks in and asks where the bathroom
| is. The bar bursts into flames, killing everyone.
___________________________________________________________________
(page generated 2021-06-09 23:00 UTC)