[HN Gopher] Details of yesterday's Bunny CDN outage
___________________________________________________________________
Details of yesterday's Bunny CDN outage
Author : aSig
Score : 159 points
Date : 2021-06-23 09:05 UTC (13 hours ago)
(HTM) web link (bunny.net)
(TXT) w3m dump (bunny.net)
| busymom0 wrote:
| > On June 22nd at 8:25 AM UTC, we released a new update designed
| to reduce the download size of the optimization database.
|
| That's around 4:25 a.m EST. Are updates usually done around this
| time at other companies? Seems like that's cutting pretty close
| to the around 8am where a lot of employees start working.
|
| The details of the whole incident sounds pretty terrifying and I
| am inspired to hear how much pressure their admins were under and
| got it working again. Good work.
| throw1742 wrote:
| They're based in Slovenia, so that was 10:25 AM local time for
| them.
| busymom0 wrote:
| I am going based off of the map here, Europe and then North
| America is their biggest market:
|
| https://bunny.net/network
|
| Seems like they were updating production during work hours
| for most people which is pretty odd imo. Usually I would
| expect them to get this done between midnight and 2-3am.
| PaywallBuster wrote:
| If you have a global infrastructure with worldwide customer
| base you'd want to do critical upgrades when everyone's at
| the office ready to jump at issues.
| latch wrote:
| Is there a reason to assume EST?
| busymom0 wrote:
| I was mostly going based off of their majority market being
| Europe and North America.
| tpetry wrote:
| A cdn has a global target, there will be someone starting to
| work all the time around the world.
| YetAnotherNick wrote:
| They are making it sound like they did everything right and it
| was a issue of third party library. If we list all the libraries
| our code depend on, it will be in 1000s. I can't comprehend how a
| CDN does not have any canary or staging setup and in a update
| everything could go haywire in seconds. I think it is standard
| practice in any decent size company to have staging/canary and
| rollbacks.
| bovermyer wrote:
| That's not the impression I got. Yeah, their takeaway was to
| stop using BinaryPack, which I disagree with. However, it
| sounded to me like they very much understood that they made the
| biggest error in putting all of their eggs in one basket.
|
| Your system WILL go down eventually. The question is how will
| you recover from it?
| dejangp wrote:
| Right, this was our biggest failure (not the only one of
| course, but we are here to improve). Relying on our own
| systems to maintain our own systems.
|
| We are dropping BinaryPack mainly because we're a small team,
| and it wasn't really a big benefit anyway, so spending more
| time than necessary to try and salvage that makes no sense.
| This was more of a hot-fix since we don't want the same thing
| repeating in a week.
| bovermyer wrote:
| That makes sense then with the additional context.
|
| I don't know the details of your operation, but keeping
| your ability to update your systems separate from your
| systems is something I'd strongly encourage.
| pantulis wrote:
| This. While failure, human or not, is unavoidable in the long
| term, from their writeup they do not seem to have procedures to
| avoid this particular mode of failure.
| xwolfi wrote:
| I came to post that, yeah. I work in a sensitive system on
| which people can lose millions for a few minutes downtime, and
| we are a bit anal about week long pilots where half the prod is
| in a permanent canary stage.
|
| But also they used their own infra it feels to setup their
| stuff and if their infra was dead they couldnt rollback, which
| sounds like a case where people a bit too optimistic.
|
| We had catastrophes too, notably on poison pills in a record
| stream we cant alter, but this update cascade crash sounds
| avoidable.
|
| Always easy to judge anyway, always happens to you eventually
| :D
| foobarbazetc wrote:
| I like how something is "auto-healing" when it's like... has
| `Restart=on-failure` in systemd.
|
| Anyway, it's always DNS. Always.
|
| "Unfortunately, that allowed something as simple as a corrupted
| file to crash down multiple layers of redundancy with no real way
| of bringing things back up."
|
| You can spend many, many millions of $ on multi-AZ Kubernetes
| microservices blah blah blah and it'll still be taken down by a
| SPOF, which, 99% of the time, is DNS.
|
| Actual redundancy, as opposed to "redundancy", is extremely
| difficult to achieve because the incremental costs of one more 9
| are almost exponential.
|
| And then a customer updates their configuration and your entire
| global service goes down for hours ala Fastly.
|
| Or a single corrupt file crashes your entire service.
| tyingq wrote:
| >Anyway, it's always DNS. Always.
|
| Which is disappointing. An infrastructure where the backend is
| VERY easy to make highly redundant. Thwarted by decisions not
| to do that easy work, or thwarted by client libraries that
| don't take advantage of it.
| ram_rar wrote:
| > On June 22nd at 8:25 AM UTC, we released a new update designed
| to reduce the download size of the optimization database.
| Unfortunately, this managed to upload a corrupted file to the
| Edge Storage.
|
| I wonder, if simple checksum verification of the file would have
| helped in avoiding this outage all together.
|
| > Turns out, the corrupted file caused the BinaryPack
| serialization library to immediately execute itself with a stack
| overflow exception, bypassing any exception handling and just
| exiting the process. Within minutes, our global DNS server fleet
| of close to a 100 servers was practically dead
|
| This is exactly, why one needs a canary based deployments. I have
| seen umpteen amounts of issues being caught in canary, which has
| saved my team tons of firefighting time.
| dpcx wrote:
| In the post or comments, they claimed using canary; perhaps
| their canary simply didn't die in the coalmine?
| jgrahamc wrote:
| _I wonder, if simple checksum verification of the file would
| have helped in avoiding this outage all together._
|
| Oh man, you stirred up a really old Cloudflare memory. Back
| when I was working on our DNS infrastructure I wrote up a task
| that says: "RRDNS has no way of knowing how many lines to
| expect or whether what it is read is valid. This could create
| an issue where the LB map data is not available inside RRDNS."
|
| At the time this "LB map" thing was critical to the mapping
| between a domain name and its associated IP address(es).
| Without it Cloudflare wouldn't work. Re-reading the years old
| Jira I see myself and Lee Holloway discussing the checksumming
| of the data. He implemented the writing of the checksum and I
| implemented the read and check.
|
| I miss Lee.
| methyl wrote:
| For whom, like myself, don't know the story, here it is:
| https://www.wired.com/story/lee-holloway-devastating-
| decline...
|
| I'm deeply moved after reading it. Can't imagine how tragic
| it must be for people who know Lee.
| dylanz wrote:
| That was an incredible story, and I went down a rabbit hole
| of reading more about that disease. Thank you very much for
| sharing.
| yabones wrote:
| Wow, that is absolutely tragic. Neurodegenerative diseases
| are something I fear the most, having seen what
| Huntington's can do to somebody.
| manigandham wrote:
| All this focus on redundancy should be replaced with a focus on
| recovery. Perfect availability is already impossible. For all
| practical uses, something that recovers within minutes is better
| than trying to always be online and failing horribly.
| EricE wrote:
| A backup vendor once pointed out that backup was the most
| misnamed product/function in all of computerdom. He argued it
| should really be referred to as restore, since when the chips
| are down that's what you really, really care about. That really
| resonated with the young sysadmin I was at the time.
|
| Very similar to the story about the planes with holes coming
| back in WWII and the initial analysis of adding more armor to
| where the holes were, when someone flipped it and pointed out
| that armor was needed where the holes _weren 't_ since planes
| with holes in those spots weren't the ones coming back.
| qaq wrote:
| And thanx to the writeup making it to the top of HN now I and
| prob many more people here learned about the existence of Bunny
| CDN.
| debarshri wrote:
| I can imagine how stressful the situation was, but it was
| pleasure to read. It again goes to show you that no matter how
| prepared, how optimized/over optimized you want to be, there will
| always be a situation you have never accounted for and sh*t
| always hits the fan, that is the reality of IT ops.
| corobo wrote:
| I didn't notice but I do appreciate the automatic SLA honouring
| plus the letting me know
|
| Nice work Bunny CDN.
| string wrote:
| Good and clear explanation. This is a risk you take when you use
| a CDN, I still think the benefits outweigh the occasional
| downtime. I'm a big fan of BunnyCDN, they've saved me a lot of
| money over the past few years.
|
| I'm sure I'd be fuming if I worked at some multi-million dollar
| company but as someone that mainly works for smaller businesses
| it's not the end of the world, I suspect most of my clients
| haven't even noticed yet.
| manishsharan wrote:
| TIL about BunnyCDN. I had been paying $0.08 per GB on AWS
| Cloudfront whereas BunnyCDN is only $0.01 per GB. Can you
| comment on you experience with them ? Are the APIs
| comprehensive e.g. cache invalidation ? do they support cookie
| base authorization ? Any support Geo-Fencing?
| kalev wrote:
| First time using a CDN improved our site [1] performance by a
| huge amount, thanks to BunnyCDN. Really easy to setup, great
| dashboard. The image optimizer for a flat rate works really
| really well. Only missing option is to rotate images, which I
| opened a feature request for with them.
|
| You can see our CDN usage inspecting the URLs to the product
| images. Size attributes are added to the URL and Bunny
| automatically resizes and compresses the images on the fly.
|
| [1] https://www.airsoftbazaar.nl
| minxomat wrote:
| I'm using them quite extensively (except the Stream video
| feature). APIs are good, traffic can be restricted or
| rerouted based on Geo. Not sure what cookie based auth would
| do in a CDN but if it's on the origin it passes through. For
| authenticating URLs there is a signing scheme you can use.
| string wrote:
| I noticed another user has already commented, sounds like
| they've had more experience with the things you're interested
| in than I have, for FWIW, the APIs have been sufficient for
| my use cases and you can definitely purge a pullzone's cache
| with them.
|
| My primary use has been for serving image assets, switched
| over from Cloudfront and have seen probably a >80% cost
| reduction, and no noticeable performance reduction, but as I
| mentioned I'm operating at a scale where milliseconds of
| difference don't mean much.
| sudhirj wrote:
| Think the answer is yes to all three questions, depending on
| the specifics. They've got a nice setup, about ~40+ edge
| locations compared to Cloudfront's ~200+, but the advantage
| is they're massively cheaper for very small increase in
| latency. They also have the ~5 region high-volume tier which
| is something like another order of magnitude cheaper.
|
| The feature set is pretty full, no edge functions, but there
| is a rule engine you can run on the edge. Fast config
| updates, nice console and works well enough for most of my
| projects.
|
| They also have a nice integrated storage solution that's way
| easier to configure than S3 + Cloudfront, and lots of origin
| shielding options.
| eeeeeeeee4eeeee wrote:
| is this google?
| aitchnyu wrote:
| Can this allow me to route x.mydomain.com (more than one
| wildcard and top level) to x.a.run.app (Google Cloud Run)?
| Cloud Run (and the Django app behind it) won't approve domain
| mapping for Mumbai yet so I am looking for transparent domain
| rewriting. Cloudflare allows it but its kinda expensive.
|
| https://cloud.google.com/run/docs/locations#domains
| gcbirzan wrote:
| As the docs say, you can use a LB with this. It'll be 18
| dollars a month, though.
| zerop wrote:
| on a different note, this outage news will give them more
| publicity than the product itself, I believe...
| FerretFred wrote:
| TIL .. of bunny.net :-)
| busymom0 wrote:
| One of the comments on the post is:
|
| > One thing you could do in future is to url redirect any
| BunnyCDN url back to the clients original url, in essence
| disabling the CDN and getting your clients own hosts do what they
| were doing before they connected to BunnyCDN, yes it means our
| sites won't be as fast but its better than not loading up the
| files at all. I wonder if that is possible in technical terms?
|
| Isn't this a horrible idea? If you use bunny, this would cause a
| major spike in the traffic and thus costs from your origin
| server.
| ev1 wrote:
| Doing this if you are intentionally trying to protect or hide
| your origin is effectively guaranteed killing on the origin.
| For example, if Cloudflare unproxied one of my subdomains I'd
| leave them immediately, and likely have to change all my
| infrastructure and providers due to attacks.
|
| This is also a terrible idea because of ACLs/firewalls only
| allowing traffic from CDN (this is _extremely_ common for
| things like Cloudflare and Akamai) and relying on the CDN for
| access control.
| dindresto wrote:
| Also, how would this work if their whole infrastructure is
| down? The same problem that prevented them from fixing the
| network would also have prevented them from adding such a
| redirect.
| corobo wrote:
| Yeah please don't do this without me having checked a box to do
| that haha
|
| There's a reason I use a CDN, let me decide if it's up or down
| if the CDN is down. If I want failover, I'll do that bit.
| nathanganser wrote:
| I'm impressed by the transparency and clarity of their
| explanation! Definitely makes me want to use their solution even
| though they messed up big times!
| slackerIII wrote:
| Oh, this is a great writeup. I co-host a podcast on outages, and
| over and over we see cases where circular dependencies end up
| making recovery much harder. Also, not using a staged deployment
| is a recipe for disaster!
|
| We just wrapped up the first season, but I'm going to put this on
| the list of episodes for the second season:
| https://downtimeproject.com.
| 0des wrote:
| This is great! I love these types of podcasts. Adding this one
| to my subscriptions list right now. A bunny CDN episode would
| be fun. Thanks for putting this podcast on my radar.
| path411 wrote:
| Sounds like they got really lucky they could get it back up so
| quickly. They must have some very talented engineers working
| there.
|
| My take aways though were they should have tested the update
| better. They should have their production environment more
| segmented with staggered updates so they have much more contained
| disasters. And they should have had much better catastrophic
| failure plans in place.
| ing33k wrote:
| they are very a small team. Their CEO codes !
| dejangp wrote:
| I do! In fact, I work like 90 hours a week. I decided to go
| the bootstrap way (looking back not sure if that was the best
| idea, but we are where we are), but we're growing 3X year
| over year, so things are picking up. :)
| holstvoogd wrote:
| It was not that quick tbh. We were seeing intermittent issues
| for several hours after the initial problem arose.
|
| It tought me a valuable lesson: make sure it is easy to switch
| to another CDN & to update cached/stored urls
| christophilus wrote:
| Happy BunnyCDN user here. Thanks for the writeup.
|
| > Both SmartEdge and the deployment systems we use rely on Edge
| Storage and Bunny CDN to distribute data to the actual DNS
| servers. On the other hand, we just wiped out most of our global
| CDN capacity.
|
| That's the TLDR. What a stressful couple of hours that must have
| been for their team.
| lclarkmichalek wrote:
| These follow ups aren't super compelling IMO.
|
| > To do this, the first and smallest step will be to phase out
| the BinaryPack library and make sure we run a more extensive
| testing on any third-party libraries we work with in the future.
|
| Sure. Not exactly a structural fix. But maybe worth doing.
| Another view would be that you've just "paid" a ton to find
| issues in the BinaryPack library, and maybe should continue to
| invest in it.
|
| Also, "do more tests" isn't a follow up. What's your process for
| testing these external libs, if you're making this a core part of
| your reliability effort?
|
| > We are currently planning a complete migration of our internal
| APIs to a third-party independent service. This means if their
| system goes down, we lose the ability to do updates, but if our
| system goes down, we will have the ability to react quickly and
| reliably without being caught in a loop of collapsing
| infrastructure.
|
| Ok, now tell me how you're going to test it. Changing
| architectures is fine, but until you're running drills of core
| services going down, you don't actually know you've mitigated the
| "loop of collapsing infrastructure" issue.
|
| > Finally, we are making the DNS system itself run a local copy
| of all backup data with automatic failure detection. This way we
| can add yet another layer of redundancy and make sure that no
| matter what happens, systems within bunny.net remain as
| independent from each other as possible and prevent a ripple
| effect when something goes wrong.
|
| Additional redundancy isn't a great way of mitigating issues
| caused by a change being deployed. Being 10x redundant usually
| adds quite a lot of complexity, provides less safety than it
| seems (again, do you have a plan to regularly test that this
| failover mode is working?) and can be less effective than
| preventing issues getting to prod.
|
| What would be nice to see if a full review of the detection,
| escalation, remediation and prevention for this incident.
|
| More specifically, the triggering event here, the release of a
| new version of software, isn't super novel. More discussion of
| follow ups that are systematic improvements to the release
| process would be useful. Some options:
|
| - Replay tests to detect issues before landing changes
|
| - Canaries to detect issues before pushing to prod
|
| - Gradual deployments to detect issues before they hit 100%
|
| - Even better, isolated gradual deployments (i.e. deploy region
| by region, zone by zone) to mitigate the risk of issues spreading
| between regions.
|
| Beyond that, start thinking about all the changing components of
| your product, and their lifecycle. It sounds like here some data
| file got screwed up as it was changed. Do you stage those changes
| to your data files? Can you isolate regional deployments
| entirely, and control the rollout of new versions of this data
| file on a regional basis? Can you do the same for all other
| changes in your system?
| holstvoogd wrote:
| This, I am not at all reassured by this that it won't happen
| again. Next week perhaps.
|
| Also, their DNS broke last month as well, but I guess we won't
| mention that as it would invalidate 2 years of stellar
| reliability
| TacticalCoder wrote:
| Slighlty offtopic but what about the big outage from a few
| days/weeks ago where half the Internet was down (exaggerating
| only a little bit), has there been a postmortem I missed?
| penguinten wrote:
| https://www.fastly.com/blog/summary-of-june-8-outage
| altmind wrote:
| fastly RCA is underwhelming - no info on what was the
| component, what happened and how the situation was tackled.
| rapsey wrote:
| I think a public company will never dare go into as much
| detail as bunny did. Or maybe it is just the size of the
| organisation that discourages that.
| plett wrote:
| Cloudflare's RFO blog posts are incredibly detailed. Each
| time I read one, I feel reasonably confident that they
| have learnt from the mistakes that lead to that outage
| and that it shouldn't happen again.
|
| https://blog.cloudflare.com/tag/outage/
| ruuda wrote:
| Not to say that additional mitigations are inappropriate, but a
| stack overflow when parsing a corrupt file sounds like something
| that could have easily been found by a fuzzer.
| dejangp wrote:
| Dejan here from bunny.net. I was reading some of the comments,
| but wasn't sure where to reply, so I guess I'll post some
| additional details here. I tried to keep the blog post somewhat
| technical, but not overwhelm non-technical readers.
|
| So to add some details, we already use multiple deployment groups
| (one for each DNS cluster). We always deploy each cluster
| separately to make sure we're not doing something destructive.
| Unfortunately this deployment went to a system that we believed
| was not a critical part of infrastructure (oh look how wrong we
| were) and was not made redundant, since the rest of the code was
| supposed to handle it gracefully in case this whole system was
| offline or broken.
|
| It was not my intention to blame the library, obviously this was
| our own fault, but I must admit we did not expect a stack
| overflow out of it, which completely obliterated all of the
| servers immediately when the "non-critical" component got
| corrupted.
|
| This piece of data is highly dynamic and processes every 30
| seconds or so based on hundreds of thousands of metrics. Running
| a checksum did nothing good here, because the distributed file
| was perfectly fine. The issue happened when it was being
| generated, not distributed.
|
| Now for the DNS itself, which is a critical part of our
| infrastructure.
|
| We of course operate a staging environment with both automated
| testing and manual testing before things go live.
|
| We also operate multiple deployment groups so separate clusters
| are deployed first, before others go live, so we can catch
| issues.
|
| We do the same for the CDN and always use canary testing if
| possible. We unfortunately never assumed this piece of software
| could cause all the DNS servers to stack overflow.
|
| Obviously, I mentioned, we are not perfect, but we are trying to
| improve on what happened. The biggest flaws we discovered were
| the reliance on our own infrastructure to handle our own
| infrastructure deployments.
|
| We have code versioning and CI in place as well as the options to
| do rollbacks as needed. If the issue happened under normal
| circumstances, we would have the ability to roll back all the
| software instantly, and maybe experience a 2-5 minute downtime.
| Instead, we brought down the whole system like dominos because it
| all relied on each other.
|
| Migrating deployment services to third-party solutions is
| therefore our biggest fix at this point.
|
| The reason we are moving away from BinaryPack is because it
| simply wasn't really providing that much benefit. It was helpful,
| but it wasn't having a significant impact on the overall
| behavior, so we would rather stick with something that worked
| fine for years without issues. As a small team, we don't have the
| time or resources to spend improving it at this point.
|
| I'm somewhat exhausted after yesterday, so I hope this is not
| super unstructured, but I hope that answers some questions and
| doesn't create more of them :)
|
| If I missed any suggestions or something that was unclear, please
| let me know. We're actively trying to improve all the processes
| to avoid similar situations in the future.
| jgrahamc wrote:
| Thanks for the write up. I enjoyed reading it.
| jiofih wrote:
| > Unfortunately this deployment went to a system that we
| believed was not a critical part of infrastructure
|
| Deploying to a different, lower priority system is not a
| canary. Do you phase deployments to each system, per host or
| zone?
| dejangp wrote:
| For critical systems (or let's call them services) such as
| DNS, CDN, optimizer, storage, we usually deploy either on a
| server to server basis, regional basis or cluster basis
| before going live. What I mean here was that this was not
| really a critical service that nobody thought could actually
| cause any harm, so we didn't do canary testing there as it
| would add a very high level of complexity.
| ing33k wrote:
| hey dejan, we have been using BunnyCDN for quite some time.
| Thanks for the detailed writeup.
|
| looks like storage zones are still not fully stable ? after
| experiencing several issues with storage zones earlier, we
| migrated to pull a zone. we didn't had any major issues after
| the migration.
|
| what plans do you have to improve your storage zone ?
| dejangp wrote:
| Hey, glad to hear that and sorry again about any issues. If
| you're experiencing any ongoing problems, please message our
| support team. I'm not aware of anything actively broken, but
| if there's a problem I'm sure we'll be able to help.
| gazby wrote:
| I have also had problems with storage zones. We experienced
| multiple periods of timeouts, super long TTFB, and 5xx
| responses. A ticket was opened (#136096) about the TTFB
| issue with full headers/curl output with an offer to supply
| any further useful information, but the response of "can
| you confirm this is no longer happening?" the following day
| discouraged me from further time spent there.
|
| To this day US PoPs are still pulling from EU storage
| servers (our storage zone is in NY, replicated in DE).
| < Server: BunnyCDN-IL1-718 < CDN-RequestCountryCode:
| US < CDN-EdgeStorageId: 617 < CDN-
| StorageServer: DE-51
|
| We've since moved away from Bunny, but if there's anything
| I can do to help improve this situation I'd be happy to do
| it because it is otherwise a fantastic product for the
| price.
| T4m2 wrote:
| We had the same, super long TTFB and lots of 5xx errors,
| seems to be mostly fixed now, but there are defiantly
| things that could be done differently, however given the
| pricing and feature set I'm happy with the service
|
| Would love additional capabilities within the image
| optimizer such as methods of crop
| thekonqueror wrote:
| Hi Dejan, we are evaluating Bunny for a long-term multi-tenant
| project. Today your support mentioned that cdn optimizer strips
| all origin headers. Is there any way permit some headers on a
| zone basis?
| sysbot wrote:
| From the article "Turns out, the corrupted file caused the
| BinaryPack serialization library to immediately execute itself
| with a stack overflow exception, bypassing any exception
| handling and just exiting the process. Within minutes, our
| global DNS server fleet of close to a 100 servers was
| practically dead." and from your comment "We do the same for
| the CDN and always use canary testing if possible. We
| unfortunately never assumed this piece of software could cause
| all the DNS servers to stack overflow."
|
| This read like the DNS software is being changed. As some
| people already mentioned is this a corruption where checksum
| would of been prevented the stack overflow or would a canary
| detected this? Why would a change to DNS server software not
| canaried?
| unilynx wrote:
| I read it as "DNS software changed, that worked fine, but it
| turns out we sometimes generate a broken database - not often
| enough to see it hit during canary, but devastating when it
| finally happened"
|
| GP also notes that this database changed perhaps every 30
| seconds
|
| Just a few guesses.. if you have a process that corrupts a
| random byte every 100.000 runs, and you run it every 30
| seconds, it might take days before you're at 50% odds of
| having seen it happening. and if that used to be a text or
| JSON database, flipping a random bit might not even corrupt
| anything important. Or if the code swallows the exception at
| some level, it might even self-heal after 30 seconds when new
| data comes in, causing an unnoticed blib in the monitoring if
| at all
|
| Now I don't know what binary pack does exactly, but if you
| were to replace the above process with something that
| compresses data, a flipped bit will corrupt a lot more data,
| often everything from that point forwards (where text or json
| is pretty self-syncronizing). And if your new code falls over
| completely if that happens, no more self-healing.
|
| I can totally imagine missing an event like that during
| canary testing
| EricE wrote:
| _It's not DNS_
|
| _There is a no way it's DNS_
|
| _It was DNS_
|
| One of the most bittersweet haikus for any sysadmin :p
| patrickbolle wrote:
| Great write-up. I've just switched from Cloudinary to Backblaze
| B2 + Bunny CDN and I am saving a pretty ridiculous amount of
| money for hosting thousands of customer images.
|
| Bunny has a great interface and service; I'm really surprised how
| little people know about it, I think I discovered it on some 'top
| 10 CDNs list' that I usually ignore, but the pricing was too good
| to pass up.
|
| The team is really on the ball from what I've seen. Appreciate
| the descriptive post, folks!
| xarope wrote:
| I would think critical systems and updates should also have some
| form of out-of-band access channel?
| zamalek wrote:
| This brings up one of my pet peeves: recursion. Of course there
| should have been other mitigations in place, but recursion is
| _such_ a dangerous tool. So far as reasonably possible, I
| consider its only real purpose to confuse students in 101
| courses.
|
| I assume that they are using .Net, as SOEs bring down .Net
| processes. While that sounds like a strange implementation
| detail, the philosophy of the .Net team has always been "how do
| you reasonably recover from an stack overflow?" Even in C++ what
| happens if, for example, the allocator experiences a stack
| overflow while deallocating some RAII resource, or a finally
| block calls a function and allocates stack space, or... you get
| the idea.
|
| The obvious thing to do here would be to limit recursion in the
| library (which amounts to safe recursion usage). BinaryPack does
| not have a recursion limit option, which makes it unsafe for any
| untrusted data (and that can include data that you produce, as
| Bunny experienced). Time to open a PR, I guess.
|
| This applies to JSON, too. I would suggest that OP configure
| their serializer with a limit:
|
| [1]: https://www.newtonsoft.com/json/help/html/MaxDepth.htm
| bob1029 wrote:
| > recursion is such a dangerous tool.
|
| The most effective tools for the job are usually the more
| dangerous ones. Certainly, you can do anything without
| recursion, but forcing this makes a lot of problems much harder
| than they need to be.
| vfaronov wrote:
| > _While that sounds like a strange implementation detail, the
| philosophy of the .Net team has always been "how do you
| reasonably recover from an stack overflow?"_
|
| Can you expand on this or link to any further reading? I just
| realized that this affects my platform (Go) as well, but I
| don't understand the reasoning. Why can't stack overflow be
| treated just like any other exception, unwinding the stack up
| to the nearest frame that has catch/recover in place (if any)?
| zamalek wrote:
| > Why can't stack overflow be treated just like any other
| exception[...]?
|
| Consider the following code: func
| overflows() { defer a()
| fmt.Println("hello") // <-- stack overflow occurs within
| } func a() { fmt.Println("hello")
| }
|
| The answer lies in trying to figure out how Go would
| successfully unwind that stack, it can't: when it calls `a`
| it will simply overflow again. Something that has been
| discussed is "StackAboutToOverflowException", but that only
| kicks the bucket down the road (unwinding could still cause
| an overflow).
|
| In truth, the problem exists because of implicit calls at the
| end of methods interacting with stack overflows, whether
| that's because of defer-like functionality, structured
| exception handling, or deconstructors.
| theandrewbailey wrote:
| > So far as reasonably possible, I consider its only real
| purpose to confuse students in 101 courses.
|
| I had a "high school" level programming class with Python
| before studying CS. I ran into CPython's recursion limit often
| and wondered why one would use recursion when for loops were a
| more reliable solution.
|
| Nowadays, my "recursion" is for-looping over an object's
| children and calling some function on each child object.
___________________________________________________________________
(page generated 2021-06-23 23:02 UTC)