[HN Gopher] Details of yesterday's Bunny CDN outage
       ___________________________________________________________________
        
       Details of yesterday's Bunny CDN outage
        
       Author : aSig
       Score  : 159 points
       Date   : 2021-06-23 09:05 UTC (13 hours ago)
        
 (HTM) web link (bunny.net)
 (TXT) w3m dump (bunny.net)
        
       | busymom0 wrote:
       | > On June 22nd at 8:25 AM UTC, we released a new update designed
       | to reduce the download size of the optimization database.
       | 
       | That's around 4:25 a.m EST. Are updates usually done around this
       | time at other companies? Seems like that's cutting pretty close
       | to the around 8am where a lot of employees start working.
       | 
       | The details of the whole incident sounds pretty terrifying and I
       | am inspired to hear how much pressure their admins were under and
       | got it working again. Good work.
        
         | throw1742 wrote:
         | They're based in Slovenia, so that was 10:25 AM local time for
         | them.
        
           | busymom0 wrote:
           | I am going based off of the map here, Europe and then North
           | America is their biggest market:
           | 
           | https://bunny.net/network
           | 
           | Seems like they were updating production during work hours
           | for most people which is pretty odd imo. Usually I would
           | expect them to get this done between midnight and 2-3am.
        
             | PaywallBuster wrote:
             | If you have a global infrastructure with worldwide customer
             | base you'd want to do critical upgrades when everyone's at
             | the office ready to jump at issues.
        
         | latch wrote:
         | Is there a reason to assume EST?
        
           | busymom0 wrote:
           | I was mostly going based off of their majority market being
           | Europe and North America.
        
         | tpetry wrote:
         | A cdn has a global target, there will be someone starting to
         | work all the time around the world.
        
       | YetAnotherNick wrote:
       | They are making it sound like they did everything right and it
       | was a issue of third party library. If we list all the libraries
       | our code depend on, it will be in 1000s. I can't comprehend how a
       | CDN does not have any canary or staging setup and in a update
       | everything could go haywire in seconds. I think it is standard
       | practice in any decent size company to have staging/canary and
       | rollbacks.
        
         | bovermyer wrote:
         | That's not the impression I got. Yeah, their takeaway was to
         | stop using BinaryPack, which I disagree with. However, it
         | sounded to me like they very much understood that they made the
         | biggest error in putting all of their eggs in one basket.
         | 
         | Your system WILL go down eventually. The question is how will
         | you recover from it?
        
           | dejangp wrote:
           | Right, this was our biggest failure (not the only one of
           | course, but we are here to improve). Relying on our own
           | systems to maintain our own systems.
           | 
           | We are dropping BinaryPack mainly because we're a small team,
           | and it wasn't really a big benefit anyway, so spending more
           | time than necessary to try and salvage that makes no sense.
           | This was more of a hot-fix since we don't want the same thing
           | repeating in a week.
        
             | bovermyer wrote:
             | That makes sense then with the additional context.
             | 
             | I don't know the details of your operation, but keeping
             | your ability to update your systems separate from your
             | systems is something I'd strongly encourage.
        
         | pantulis wrote:
         | This. While failure, human or not, is unavoidable in the long
         | term, from their writeup they do not seem to have procedures to
         | avoid this particular mode of failure.
        
         | xwolfi wrote:
         | I came to post that, yeah. I work in a sensitive system on
         | which people can lose millions for a few minutes downtime, and
         | we are a bit anal about week long pilots where half the prod is
         | in a permanent canary stage.
         | 
         | But also they used their own infra it feels to setup their
         | stuff and if their infra was dead they couldnt rollback, which
         | sounds like a case where people a bit too optimistic.
         | 
         | We had catastrophes too, notably on poison pills in a record
         | stream we cant alter, but this update cascade crash sounds
         | avoidable.
         | 
         | Always easy to judge anyway, always happens to you eventually
         | :D
        
       | foobarbazetc wrote:
       | I like how something is "auto-healing" when it's like... has
       | `Restart=on-failure` in systemd.
       | 
       | Anyway, it's always DNS. Always.
       | 
       | "Unfortunately, that allowed something as simple as a corrupted
       | file to crash down multiple layers of redundancy with no real way
       | of bringing things back up."
       | 
       | You can spend many, many millions of $ on multi-AZ Kubernetes
       | microservices blah blah blah and it'll still be taken down by a
       | SPOF, which, 99% of the time, is DNS.
       | 
       | Actual redundancy, as opposed to "redundancy", is extremely
       | difficult to achieve because the incremental costs of one more 9
       | are almost exponential.
       | 
       | And then a customer updates their configuration and your entire
       | global service goes down for hours ala Fastly.
       | 
       | Or a single corrupt file crashes your entire service.
        
         | tyingq wrote:
         | >Anyway, it's always DNS. Always.
         | 
         | Which is disappointing. An infrastructure where the backend is
         | VERY easy to make highly redundant. Thwarted by decisions not
         | to do that easy work, or thwarted by client libraries that
         | don't take advantage of it.
        
       | ram_rar wrote:
       | > On June 22nd at 8:25 AM UTC, we released a new update designed
       | to reduce the download size of the optimization database.
       | Unfortunately, this managed to upload a corrupted file to the
       | Edge Storage.
       | 
       | I wonder, if simple checksum verification of the file would have
       | helped in avoiding this outage all together.
       | 
       | > Turns out, the corrupted file caused the BinaryPack
       | serialization library to immediately execute itself with a stack
       | overflow exception, bypassing any exception handling and just
       | exiting the process. Within minutes, our global DNS server fleet
       | of close to a 100 servers was practically dead
       | 
       | This is exactly, why one needs a canary based deployments. I have
       | seen umpteen amounts of issues being caught in canary, which has
       | saved my team tons of firefighting time.
        
         | dpcx wrote:
         | In the post or comments, they claimed using canary; perhaps
         | their canary simply didn't die in the coalmine?
        
         | jgrahamc wrote:
         | _I wonder, if simple checksum verification of the file would
         | have helped in avoiding this outage all together._
         | 
         | Oh man, you stirred up a really old Cloudflare memory. Back
         | when I was working on our DNS infrastructure I wrote up a task
         | that says: "RRDNS has no way of knowing how many lines to
         | expect or whether what it is read is valid. This could create
         | an issue where the LB map data is not available inside RRDNS."
         | 
         | At the time this "LB map" thing was critical to the mapping
         | between a domain name and its associated IP address(es).
         | Without it Cloudflare wouldn't work. Re-reading the years old
         | Jira I see myself and Lee Holloway discussing the checksumming
         | of the data. He implemented the writing of the checksum and I
         | implemented the read and check.
         | 
         | I miss Lee.
        
           | methyl wrote:
           | For whom, like myself, don't know the story, here it is:
           | https://www.wired.com/story/lee-holloway-devastating-
           | decline...
           | 
           | I'm deeply moved after reading it. Can't imagine how tragic
           | it must be for people who know Lee.
        
             | dylanz wrote:
             | That was an incredible story, and I went down a rabbit hole
             | of reading more about that disease. Thank you very much for
             | sharing.
        
             | yabones wrote:
             | Wow, that is absolutely tragic. Neurodegenerative diseases
             | are something I fear the most, having seen what
             | Huntington's can do to somebody.
        
       | manigandham wrote:
       | All this focus on redundancy should be replaced with a focus on
       | recovery. Perfect availability is already impossible. For all
       | practical uses, something that recovers within minutes is better
       | than trying to always be online and failing horribly.
        
         | EricE wrote:
         | A backup vendor once pointed out that backup was the most
         | misnamed product/function in all of computerdom. He argued it
         | should really be referred to as restore, since when the chips
         | are down that's what you really, really care about. That really
         | resonated with the young sysadmin I was at the time.
         | 
         | Very similar to the story about the planes with holes coming
         | back in WWII and the initial analysis of adding more armor to
         | where the holes were, when someone flipped it and pointed out
         | that armor was needed where the holes _weren 't_ since planes
         | with holes in those spots weren't the ones coming back.
        
       | qaq wrote:
       | And thanx to the writeup making it to the top of HN now I and
       | prob many more people here learned about the existence of Bunny
       | CDN.
        
       | debarshri wrote:
       | I can imagine how stressful the situation was, but it was
       | pleasure to read. It again goes to show you that no matter how
       | prepared, how optimized/over optimized you want to be, there will
       | always be a situation you have never accounted for and sh*t
       | always hits the fan, that is the reality of IT ops.
        
       | corobo wrote:
       | I didn't notice but I do appreciate the automatic SLA honouring
       | plus the letting me know
       | 
       | Nice work Bunny CDN.
        
       | string wrote:
       | Good and clear explanation. This is a risk you take when you use
       | a CDN, I still think the benefits outweigh the occasional
       | downtime. I'm a big fan of BunnyCDN, they've saved me a lot of
       | money over the past few years.
       | 
       | I'm sure I'd be fuming if I worked at some multi-million dollar
       | company but as someone that mainly works for smaller businesses
       | it's not the end of the world, I suspect most of my clients
       | haven't even noticed yet.
        
         | manishsharan wrote:
         | TIL about BunnyCDN. I had been paying $0.08 per GB on AWS
         | Cloudfront whereas BunnyCDN is only $0.01 per GB. Can you
         | comment on you experience with them ? Are the APIs
         | comprehensive e.g. cache invalidation ? do they support cookie
         | base authorization ? Any support Geo-Fencing?
        
           | kalev wrote:
           | First time using a CDN improved our site [1] performance by a
           | huge amount, thanks to BunnyCDN. Really easy to setup, great
           | dashboard. The image optimizer for a flat rate works really
           | really well. Only missing option is to rotate images, which I
           | opened a feature request for with them.
           | 
           | You can see our CDN usage inspecting the URLs to the product
           | images. Size attributes are added to the URL and Bunny
           | automatically resizes and compresses the images on the fly.
           | 
           | [1] https://www.airsoftbazaar.nl
        
           | minxomat wrote:
           | I'm using them quite extensively (except the Stream video
           | feature). APIs are good, traffic can be restricted or
           | rerouted based on Geo. Not sure what cookie based auth would
           | do in a CDN but if it's on the origin it passes through. For
           | authenticating URLs there is a signing scheme you can use.
        
           | string wrote:
           | I noticed another user has already commented, sounds like
           | they've had more experience with the things you're interested
           | in than I have, for FWIW, the APIs have been sufficient for
           | my use cases and you can definitely purge a pullzone's cache
           | with them.
           | 
           | My primary use has been for serving image assets, switched
           | over from Cloudfront and have seen probably a >80% cost
           | reduction, and no noticeable performance reduction, but as I
           | mentioned I'm operating at a scale where milliseconds of
           | difference don't mean much.
        
           | sudhirj wrote:
           | Think the answer is yes to all three questions, depending on
           | the specifics. They've got a nice setup, about ~40+ edge
           | locations compared to Cloudfront's ~200+, but the advantage
           | is they're massively cheaper for very small increase in
           | latency. They also have the ~5 region high-volume tier which
           | is something like another order of magnitude cheaper.
           | 
           | The feature set is pretty full, no edge functions, but there
           | is a rule engine you can run on the edge. Fast config
           | updates, nice console and works well enough for most of my
           | projects.
           | 
           | They also have a nice integrated storage solution that's way
           | easier to configure than S3 + Cloudfront, and lots of origin
           | shielding options.
        
           | eeeeeeeee4eeeee wrote:
           | is this google?
        
           | aitchnyu wrote:
           | Can this allow me to route x.mydomain.com (more than one
           | wildcard and top level) to x.a.run.app (Google Cloud Run)?
           | Cloud Run (and the Django app behind it) won't approve domain
           | mapping for Mumbai yet so I am looking for transparent domain
           | rewriting. Cloudflare allows it but its kinda expensive.
           | 
           | https://cloud.google.com/run/docs/locations#domains
        
             | gcbirzan wrote:
             | As the docs say, you can use a LB with this. It'll be 18
             | dollars a month, though.
        
       | zerop wrote:
       | on a different note, this outage news will give them more
       | publicity than the product itself, I believe...
        
         | FerretFred wrote:
         | TIL .. of bunny.net :-)
        
       | busymom0 wrote:
       | One of the comments on the post is:
       | 
       | > One thing you could do in future is to url redirect any
       | BunnyCDN url back to the clients original url, in essence
       | disabling the CDN and getting your clients own hosts do what they
       | were doing before they connected to BunnyCDN, yes it means our
       | sites won't be as fast but its better than not loading up the
       | files at all. I wonder if that is possible in technical terms?
       | 
       | Isn't this a horrible idea? If you use bunny, this would cause a
       | major spike in the traffic and thus costs from your origin
       | server.
        
         | ev1 wrote:
         | Doing this if you are intentionally trying to protect or hide
         | your origin is effectively guaranteed killing on the origin.
         | For example, if Cloudflare unproxied one of my subdomains I'd
         | leave them immediately, and likely have to change all my
         | infrastructure and providers due to attacks.
         | 
         | This is also a terrible idea because of ACLs/firewalls only
         | allowing traffic from CDN (this is _extremely_ common for
         | things like Cloudflare and Akamai) and relying on the CDN for
         | access control.
        
         | dindresto wrote:
         | Also, how would this work if their whole infrastructure is
         | down? The same problem that prevented them from fixing the
         | network would also have prevented them from adding such a
         | redirect.
        
         | corobo wrote:
         | Yeah please don't do this without me having checked a box to do
         | that haha
         | 
         | There's a reason I use a CDN, let me decide if it's up or down
         | if the CDN is down. If I want failover, I'll do that bit.
        
       | nathanganser wrote:
       | I'm impressed by the transparency and clarity of their
       | explanation! Definitely makes me want to use their solution even
       | though they messed up big times!
        
       | slackerIII wrote:
       | Oh, this is a great writeup. I co-host a podcast on outages, and
       | over and over we see cases where circular dependencies end up
       | making recovery much harder. Also, not using a staged deployment
       | is a recipe for disaster!
       | 
       | We just wrapped up the first season, but I'm going to put this on
       | the list of episodes for the second season:
       | https://downtimeproject.com.
        
         | 0des wrote:
         | This is great! I love these types of podcasts. Adding this one
         | to my subscriptions list right now. A bunny CDN episode would
         | be fun. Thanks for putting this podcast on my radar.
        
       | path411 wrote:
       | Sounds like they got really lucky they could get it back up so
       | quickly. They must have some very talented engineers working
       | there.
       | 
       | My take aways though were they should have tested the update
       | better. They should have their production environment more
       | segmented with staggered updates so they have much more contained
       | disasters. And they should have had much better catastrophic
       | failure plans in place.
        
         | ing33k wrote:
         | they are very a small team. Their CEO codes !
        
           | dejangp wrote:
           | I do! In fact, I work like 90 hours a week. I decided to go
           | the bootstrap way (looking back not sure if that was the best
           | idea, but we are where we are), but we're growing 3X year
           | over year, so things are picking up. :)
        
         | holstvoogd wrote:
         | It was not that quick tbh. We were seeing intermittent issues
         | for several hours after the initial problem arose.
         | 
         | It tought me a valuable lesson: make sure it is easy to switch
         | to another CDN & to update cached/stored urls
        
       | christophilus wrote:
       | Happy BunnyCDN user here. Thanks for the writeup.
       | 
       | > Both SmartEdge and the deployment systems we use rely on Edge
       | Storage and Bunny CDN to distribute data to the actual DNS
       | servers. On the other hand, we just wiped out most of our global
       | CDN capacity.
       | 
       | That's the TLDR. What a stressful couple of hours that must have
       | been for their team.
        
       | lclarkmichalek wrote:
       | These follow ups aren't super compelling IMO.
       | 
       | > To do this, the first and smallest step will be to phase out
       | the BinaryPack library and make sure we run a more extensive
       | testing on any third-party libraries we work with in the future.
       | 
       | Sure. Not exactly a structural fix. But maybe worth doing.
       | Another view would be that you've just "paid" a ton to find
       | issues in the BinaryPack library, and maybe should continue to
       | invest in it.
       | 
       | Also, "do more tests" isn't a follow up. What's your process for
       | testing these external libs, if you're making this a core part of
       | your reliability effort?
       | 
       | > We are currently planning a complete migration of our internal
       | APIs to a third-party independent service. This means if their
       | system goes down, we lose the ability to do updates, but if our
       | system goes down, we will have the ability to react quickly and
       | reliably without being caught in a loop of collapsing
       | infrastructure.
       | 
       | Ok, now tell me how you're going to test it. Changing
       | architectures is fine, but until you're running drills of core
       | services going down, you don't actually know you've mitigated the
       | "loop of collapsing infrastructure" issue.
       | 
       | > Finally, we are making the DNS system itself run a local copy
       | of all backup data with automatic failure detection. This way we
       | can add yet another layer of redundancy and make sure that no
       | matter what happens, systems within bunny.net remain as
       | independent from each other as possible and prevent a ripple
       | effect when something goes wrong.
       | 
       | Additional redundancy isn't a great way of mitigating issues
       | caused by a change being deployed. Being 10x redundant usually
       | adds quite a lot of complexity, provides less safety than it
       | seems (again, do you have a plan to regularly test that this
       | failover mode is working?) and can be less effective than
       | preventing issues getting to prod.
       | 
       | What would be nice to see if a full review of the detection,
       | escalation, remediation and prevention for this incident.
       | 
       | More specifically, the triggering event here, the release of a
       | new version of software, isn't super novel. More discussion of
       | follow ups that are systematic improvements to the release
       | process would be useful. Some options:
       | 
       | - Replay tests to detect issues before landing changes
       | 
       | - Canaries to detect issues before pushing to prod
       | 
       | - Gradual deployments to detect issues before they hit 100%
       | 
       | - Even better, isolated gradual deployments (i.e. deploy region
       | by region, zone by zone) to mitigate the risk of issues spreading
       | between regions.
       | 
       | Beyond that, start thinking about all the changing components of
       | your product, and their lifecycle. It sounds like here some data
       | file got screwed up as it was changed. Do you stage those changes
       | to your data files? Can you isolate regional deployments
       | entirely, and control the rollout of new versions of this data
       | file on a regional basis? Can you do the same for all other
       | changes in your system?
        
         | holstvoogd wrote:
         | This, I am not at all reassured by this that it won't happen
         | again. Next week perhaps.
         | 
         | Also, their DNS broke last month as well, but I guess we won't
         | mention that as it would invalidate 2 years of stellar
         | reliability
        
       | TacticalCoder wrote:
       | Slighlty offtopic but what about the big outage from a few
       | days/weeks ago where half the Internet was down (exaggerating
       | only a little bit), has there been a postmortem I missed?
        
         | penguinten wrote:
         | https://www.fastly.com/blog/summary-of-june-8-outage
        
           | altmind wrote:
           | fastly RCA is underwhelming - no info on what was the
           | component, what happened and how the situation was tackled.
        
             | rapsey wrote:
             | I think a public company will never dare go into as much
             | detail as bunny did. Or maybe it is just the size of the
             | organisation that discourages that.
        
               | plett wrote:
               | Cloudflare's RFO blog posts are incredibly detailed. Each
               | time I read one, I feel reasonably confident that they
               | have learnt from the mistakes that lead to that outage
               | and that it shouldn't happen again.
               | 
               | https://blog.cloudflare.com/tag/outage/
        
       | ruuda wrote:
       | Not to say that additional mitigations are inappropriate, but a
       | stack overflow when parsing a corrupt file sounds like something
       | that could have easily been found by a fuzzer.
        
       | dejangp wrote:
       | Dejan here from bunny.net. I was reading some of the comments,
       | but wasn't sure where to reply, so I guess I'll post some
       | additional details here. I tried to keep the blog post somewhat
       | technical, but not overwhelm non-technical readers.
       | 
       | So to add some details, we already use multiple deployment groups
       | (one for each DNS cluster). We always deploy each cluster
       | separately to make sure we're not doing something destructive.
       | Unfortunately this deployment went to a system that we believed
       | was not a critical part of infrastructure (oh look how wrong we
       | were) and was not made redundant, since the rest of the code was
       | supposed to handle it gracefully in case this whole system was
       | offline or broken.
       | 
       | It was not my intention to blame the library, obviously this was
       | our own fault, but I must admit we did not expect a stack
       | overflow out of it, which completely obliterated all of the
       | servers immediately when the "non-critical" component got
       | corrupted.
       | 
       | This piece of data is highly dynamic and processes every 30
       | seconds or so based on hundreds of thousands of metrics. Running
       | a checksum did nothing good here, because the distributed file
       | was perfectly fine. The issue happened when it was being
       | generated, not distributed.
       | 
       | Now for the DNS itself, which is a critical part of our
       | infrastructure.
       | 
       | We of course operate a staging environment with both automated
       | testing and manual testing before things go live.
       | 
       | We also operate multiple deployment groups so separate clusters
       | are deployed first, before others go live, so we can catch
       | issues.
       | 
       | We do the same for the CDN and always use canary testing if
       | possible. We unfortunately never assumed this piece of software
       | could cause all the DNS servers to stack overflow.
       | 
       | Obviously, I mentioned, we are not perfect, but we are trying to
       | improve on what happened. The biggest flaws we discovered were
       | the reliance on our own infrastructure to handle our own
       | infrastructure deployments.
       | 
       | We have code versioning and CI in place as well as the options to
       | do rollbacks as needed. If the issue happened under normal
       | circumstances, we would have the ability to roll back all the
       | software instantly, and maybe experience a 2-5 minute downtime.
       | Instead, we brought down the whole system like dominos because it
       | all relied on each other.
       | 
       | Migrating deployment services to third-party solutions is
       | therefore our biggest fix at this point.
       | 
       | The reason we are moving away from BinaryPack is because it
       | simply wasn't really providing that much benefit. It was helpful,
       | but it wasn't having a significant impact on the overall
       | behavior, so we would rather stick with something that worked
       | fine for years without issues. As a small team, we don't have the
       | time or resources to spend improving it at this point.
       | 
       | I'm somewhat exhausted after yesterday, so I hope this is not
       | super unstructured, but I hope that answers some questions and
       | doesn't create more of them :)
       | 
       | If I missed any suggestions or something that was unclear, please
       | let me know. We're actively trying to improve all the processes
       | to avoid similar situations in the future.
        
         | jgrahamc wrote:
         | Thanks for the write up. I enjoyed reading it.
        
         | jiofih wrote:
         | > Unfortunately this deployment went to a system that we
         | believed was not a critical part of infrastructure
         | 
         | Deploying to a different, lower priority system is not a
         | canary. Do you phase deployments to each system, per host or
         | zone?
        
           | dejangp wrote:
           | For critical systems (or let's call them services) such as
           | DNS, CDN, optimizer, storage, we usually deploy either on a
           | server to server basis, regional basis or cluster basis
           | before going live. What I mean here was that this was not
           | really a critical service that nobody thought could actually
           | cause any harm, so we didn't do canary testing there as it
           | would add a very high level of complexity.
        
         | ing33k wrote:
         | hey dejan, we have been using BunnyCDN for quite some time.
         | Thanks for the detailed writeup.
         | 
         | looks like storage zones are still not fully stable ? after
         | experiencing several issues with storage zones earlier, we
         | migrated to pull a zone. we didn't had any major issues after
         | the migration.
         | 
         | what plans do you have to improve your storage zone ?
        
           | dejangp wrote:
           | Hey, glad to hear that and sorry again about any issues. If
           | you're experiencing any ongoing problems, please message our
           | support team. I'm not aware of anything actively broken, but
           | if there's a problem I'm sure we'll be able to help.
        
             | gazby wrote:
             | I have also had problems with storage zones. We experienced
             | multiple periods of timeouts, super long TTFB, and 5xx
             | responses. A ticket was opened (#136096) about the TTFB
             | issue with full headers/curl output with an offer to supply
             | any further useful information, but the response of "can
             | you confirm this is no longer happening?" the following day
             | discouraged me from further time spent there.
             | 
             | To this day US PoPs are still pulling from EU storage
             | servers (our storage zone is in NY, replicated in DE).
             | < Server: BunnyCDN-IL1-718       < CDN-RequestCountryCode:
             | US       < CDN-EdgeStorageId: 617       < CDN-
             | StorageServer: DE-51
             | 
             | We've since moved away from Bunny, but if there's anything
             | I can do to help improve this situation I'd be happy to do
             | it because it is otherwise a fantastic product for the
             | price.
        
               | T4m2 wrote:
               | We had the same, super long TTFB and lots of 5xx errors,
               | seems to be mostly fixed now, but there are defiantly
               | things that could be done differently, however given the
               | pricing and feature set I'm happy with the service
               | 
               | Would love additional capabilities within the image
               | optimizer such as methods of crop
        
         | thekonqueror wrote:
         | Hi Dejan, we are evaluating Bunny for a long-term multi-tenant
         | project. Today your support mentioned that cdn optimizer strips
         | all origin headers. Is there any way permit some headers on a
         | zone basis?
        
         | sysbot wrote:
         | From the article "Turns out, the corrupted file caused the
         | BinaryPack serialization library to immediately execute itself
         | with a stack overflow exception, bypassing any exception
         | handling and just exiting the process. Within minutes, our
         | global DNS server fleet of close to a 100 servers was
         | practically dead." and from your comment "We do the same for
         | the CDN and always use canary testing if possible. We
         | unfortunately never assumed this piece of software could cause
         | all the DNS servers to stack overflow."
         | 
         | This read like the DNS software is being changed. As some
         | people already mentioned is this a corruption where checksum
         | would of been prevented the stack overflow or would a canary
         | detected this? Why would a change to DNS server software not
         | canaried?
        
           | unilynx wrote:
           | I read it as "DNS software changed, that worked fine, but it
           | turns out we sometimes generate a broken database - not often
           | enough to see it hit during canary, but devastating when it
           | finally happened"
           | 
           | GP also notes that this database changed perhaps every 30
           | seconds
           | 
           | Just a few guesses.. if you have a process that corrupts a
           | random byte every 100.000 runs, and you run it every 30
           | seconds, it might take days before you're at 50% odds of
           | having seen it happening. and if that used to be a text or
           | JSON database, flipping a random bit might not even corrupt
           | anything important. Or if the code swallows the exception at
           | some level, it might even self-heal after 30 seconds when new
           | data comes in, causing an unnoticed blib in the monitoring if
           | at all
           | 
           | Now I don't know what binary pack does exactly, but if you
           | were to replace the above process with something that
           | compresses data, a flipped bit will corrupt a lot more data,
           | often everything from that point forwards (where text or json
           | is pretty self-syncronizing). And if your new code falls over
           | completely if that happens, no more self-healing.
           | 
           | I can totally imagine missing an event like that during
           | canary testing
        
       | EricE wrote:
       | _It's not DNS_
       | 
       |  _There is a no way it's DNS_
       | 
       |  _It was DNS_
       | 
       | One of the most bittersweet haikus for any sysadmin :p
        
       | patrickbolle wrote:
       | Great write-up. I've just switched from Cloudinary to Backblaze
       | B2 + Bunny CDN and I am saving a pretty ridiculous amount of
       | money for hosting thousands of customer images.
       | 
       | Bunny has a great interface and service; I'm really surprised how
       | little people know about it, I think I discovered it on some 'top
       | 10 CDNs list' that I usually ignore, but the pricing was too good
       | to pass up.
       | 
       | The team is really on the ball from what I've seen. Appreciate
       | the descriptive post, folks!
        
       | xarope wrote:
       | I would think critical systems and updates should also have some
       | form of out-of-band access channel?
        
       | zamalek wrote:
       | This brings up one of my pet peeves: recursion. Of course there
       | should have been other mitigations in place, but recursion is
       | _such_ a dangerous tool. So far as reasonably possible, I
       | consider its only real purpose to confuse students in 101
       | courses.
       | 
       | I assume that they are using .Net, as SOEs bring down .Net
       | processes. While that sounds like a strange implementation
       | detail, the philosophy of the .Net team has always been "how do
       | you reasonably recover from an stack overflow?" Even in C++ what
       | happens if, for example, the allocator experiences a stack
       | overflow while deallocating some RAII resource, or a finally
       | block calls a function and allocates stack space, or... you get
       | the idea.
       | 
       | The obvious thing to do here would be to limit recursion in the
       | library (which amounts to safe recursion usage). BinaryPack does
       | not have a recursion limit option, which makes it unsafe for any
       | untrusted data (and that can include data that you produce, as
       | Bunny experienced). Time to open a PR, I guess.
       | 
       | This applies to JSON, too. I would suggest that OP configure
       | their serializer with a limit:
       | 
       | [1]: https://www.newtonsoft.com/json/help/html/MaxDepth.htm
        
         | bob1029 wrote:
         | > recursion is such a dangerous tool.
         | 
         | The most effective tools for the job are usually the more
         | dangerous ones. Certainly, you can do anything without
         | recursion, but forcing this makes a lot of problems much harder
         | than they need to be.
        
         | vfaronov wrote:
         | > _While that sounds like a strange implementation detail, the
         | philosophy of the .Net team has always been "how do you
         | reasonably recover from an stack overflow?"_
         | 
         | Can you expand on this or link to any further reading? I just
         | realized that this affects my platform (Go) as well, but I
         | don't understand the reasoning. Why can't stack overflow be
         | treated just like any other exception, unwinding the stack up
         | to the nearest frame that has catch/recover in place (if any)?
        
           | zamalek wrote:
           | > Why can't stack overflow be treated just like any other
           | exception[...]?
           | 
           | Consider the following code:                   func
           | overflows() {             defer a()
           | fmt.Println("hello") // <-- stack overflow occurs within
           | }              func a() {             fmt.Println("hello")
           | }
           | 
           | The answer lies in trying to figure out how Go would
           | successfully unwind that stack, it can't: when it calls `a`
           | it will simply overflow again. Something that has been
           | discussed is "StackAboutToOverflowException", but that only
           | kicks the bucket down the road (unwinding could still cause
           | an overflow).
           | 
           | In truth, the problem exists because of implicit calls at the
           | end of methods interacting with stack overflows, whether
           | that's because of defer-like functionality, structured
           | exception handling, or deconstructors.
        
         | theandrewbailey wrote:
         | > So far as reasonably possible, I consider its only real
         | purpose to confuse students in 101 courses.
         | 
         | I had a "high school" level programming class with Python
         | before studying CS. I ran into CPython's recursion limit often
         | and wondered why one would use recursion when for loops were a
         | more reliable solution.
         | 
         | Nowadays, my "recursion" is for-looping over an object's
         | children and calling some function on each child object.
        
       ___________________________________________________________________
       (page generated 2021-06-23 23:02 UTC)