[HN Gopher] Retries - An interactive study of request retry methods
___________________________________________________________________
Retries - An interactive study of request retry methods
Author : whenlambo
Score : 188 points
Date : 2023-11-23 13:32 UTC (9 hours ago)
(HTM) web link (encore.dev)
(TXT) w3m dump (encore.dev)
| samwho wrote:
| Thanks for sharing!
|
| I'm the author of this post, and happy to answer any questions :)
| j1elo wrote:
| There's a subtle insight that could be added to the post if you
| consider worth it, and it's something that's actually _there_
| already, but difficult to realize: Clients in your simulation
| have an absolute maximum number of retries.
|
| I noticed this mid-read, when looking at one of the animations
| with 28 clients, that they would hammer the server but suddenly
| go into wait state, without apparent reason.
|
| Later in the final animation with debug mode enabled, the
| reason becomes apparent for those who click on the Controls
| button:
|
| Retry Strategy > Max Attempts = 10
|
| It makes sense, because in the worst case when everything goes
| wrong, a client should reach a point where it desists and just
| aborts with a "service not available" error.
| samwho wrote:
| You know, I hadn't actually considered mentioning it. Another
| commenter brought it up, too. It's so second nature I forgot
| about it entirely.
|
| I'll look about giving it a nod in the text, thank you for
| the feedback. :)
| fiddlerwoaroof wrote:
| Exponential retries can effectively have a maximum number
| of requests if the gap between retries gets long enough
| quickly enough. In practice, the user will refresh or close
| the page if things look broken for too long.
| marcosdumay wrote:
| Oh, please don't do that.
|
| Unbounded exponential backoff is an horrible experience,
| and improves basically nothing.
|
| If it makes sense to completely fail the request, do it
| before the waiting becomes noticeable. If it's something
| that can't just fail, set a maximum waiting time and add
| jitter.
| codebeaker wrote:
| What technology did you use for the animations? I've a bunch of
| itches I'd like to scratch that would be improved by having
| some canvas animated explainers or UI but I never clicked with
| anything. D3 back in the day.
|
| A rudimentary look in the source code showed a <traffic-
| simulation/> element but I'm not up to date enough with web
| standards to guess where to look for that in your JS bundle to
| guess at the framework!
| samwho wrote:
| It uses PixiJS (https://pixijs.com/) for the 2D rendering and
| GSAP3 (https://gsap.com/) for the animation. The <traffic-
| simulation /> blocks are custom HTMl elements
| (https://developer.mozilla.org/en-
| US/docs/Web/API/Web_compone...) which I use to encapsulate
| the logic.
|
| I've been thinking about creating a separate repo to house
| the source code of posts I've finished so people can see it.
| I don't like all the bundling and minification but sadly it
| serves a very real purpose to the end user experience (faster
| load speeds on slow connections).
|
| Until then feel free to email me (you'll find my address at
| the bottom of my site) and I'd be happy to share a zip of
| this post with you.
| samwho wrote:
| I've uploaded the code for all of my visualisation posts
| here: https://github.com/samwho/visualisations.
|
| Enjoy! :)
| self_awareness wrote:
| Really nice animations, I especially liked the demonstration of
| the effect that after some servers will "explode", any server
| that will be restarted will automatically be DoS'ed until we'll
| throw a bunch of extra temporary servers into the system. Thanks.
| samwho wrote:
| Yeah! An insidious problem that's not obvious when you're
| picking a retry interval.
|
| I had fun with the details of the explosion animation. When it
| explodes, the number of requests that come out is the actual
| number of in-progress requests.
| christophberger wrote:
| A must-read (or rather: must-see) for anyone who thinks
| exponential backoff is overrated.
| rewmie wrote:
| > A must-read (or rather: must-see) for anyone who thinks
| exponential backoff is overrated.
|
| I don't think exponential backoffs were ever accused of being
| overrated. Retries in general have been criticized for being
| counterproductive in multiple aspects, including the risk of
| creating self-inflicted DDOS attacks, and exponential backoffs
| can result in untenable performance and usability problems
| without adding any upside. These are known problems, but none
| of them is hardly classified as "overrating".
| whenlambo wrote:
| Remember to limit the exponential backoff interval if you are not
| limiting the number of retries
| fadhilkurnia wrote:
| The animations are so cool!!!
|
| In general the phenomena is known as _metastable failure_ that
| could be triggered when there are more things to do during
| failure than normal run.
|
| With retry, the client do more work within the same amount of
| time, compared to doing nothing or doing exponential backoff.
| lclarkmichalek wrote:
| This still isn't what I'd call "safe". Retries are amazing at
| supporting clients in handling temporary issues, but horrible for
| helping them deal with consistently overloaded servers. While
| jitter & exponential backoff help with the timing, they don't
| reduce the overall load sent to the service.
|
| The next step is usually local circuit breakers. The two easiest
| to implement are terminating the request if the error rate to the
| service over the last <window> is greater than x%, and
| terminating the request (or disabling retries) if the % of
| requests that are retries over the last <window> is greater than
| x%.
|
| i.e. don't bother sending a request if 70% of requests have
| errored in the last minute, and don't bother retrying if 50% of
| the requests we've sent in the last minute have already been
| retries.
|
| Google SRE book describes lots of other basic techniques to make
| retries safe.
| samwho wrote:
| Totally! Thanks for bringing those up. I tried to keep the
| scope specifically on retries and client-side mitigation.
| There's a whole bunch of cool stuff to visualise on the server-
| side, and I'm hoping to get to it in the future.
| Axsuul wrote:
| Do you have a newsletter?
| samwho wrote:
| Not a newsletter as such but I do have an email list where
| I post whenever I write something new. You can find it
| here: https://buttondown.email/samwho
| cowsandmilk wrote:
| Your response makes it sound like you think circuit breakers
| are server side and not related to retries. They are not;
| they are a client-side mitigation that are a critical part of
| a mature retry library.
| korm wrote:
| The client can track its own error rate to the service, but
| it would need information from a server to get the overall
| health of the service, which is what the author probably
| means. Furthermore the load balancer can add a Retry-After
| header to have more control over the client's retries.
| samwho wrote:
| I think I've misunderstood what circuit breakers are for
| years! I did indeed think they were a server-side
| mechanism. The original commenter's description of them
| is great, you can essentially create a heuristic based on
| the observed behaviour of the server and decide against
| overwhelming it further if you think it's unhealthy.
|
| TIL! Seems like it can have tricky emergent behaviour. I
| bet if you implement it wrong you can end up in very
| weird situations. I should visualise it. :)
| lclarkmichalek wrote:
| I mean, they can and should be both. Local decisions can
| be cheap, and very simple to implement. But global
| decisions can be smarter, and more predictable. In my
| experience, it's incredibly hard to make good decisions
| in pathological situations locally, as you often don't
| know you're in a pathological situation with only local
| data. But local data is often enough to "do less harm" :)
| spockz wrote:
| Finagle fixes this with Retry Budgets:
| https://finagle.github.io/blog/2016/02/08/retry-budgets/
| usrbinbash wrote:
| This is the client side of things. And I think this is a great
| resource that everyone who writes clients for anything, should
| see.
|
| But there is an additional piece of info everyone who writes
| clients needs to see: And that's what people like me, who
| implement backend services, may do if clients ignore such wisdom.
|
| Because: I'm not gonna let bad clients break my service.
|
| What that means in practice: Clients are given a choice: They can
| behave, or they can HTTP 429 Too Many Requests
| rewmie wrote:
| > This is the client side of things.
|
| The article is about making requests, and strategies to
| implement when the request fails. By definition, these are
| clients. Was there any ambiguity?
|
| > But there is an additional piece of info everyone who writes
| clients needs to see: And that's what people like me, who
| implement backend services, may do if clients ignore such
| wisdom.
|
| I don't think this is the obscure detail you are making it out
| to be. A few of the most basic and popular retry strategies are
| designed explicitly with a) handling throttled responses by the
| servers, b) mitigate the risk of causing self-inflicted DDoS
| attacks. This article covers a few of those, such as the
| exponential backoff and jitters.
| usrbinbash wrote:
| > Was there any ambiguity?
|
| Did I say there was?
|
| > I don't think this is the obscure detail you are making it
| out to be
|
| Where did I call this detail "obscure"?
|
| My post is meant as a light-hearted, humorous note pointing
| out one of the many reasons why it is in general a good idea
| for clients to implement the principles outlined in the
| article.
| samwho wrote:
| Throttling, tarpitting, and circuit-breakers are something
| I'd love to visualise in future, too. Throttling on its own
| is such a massive topic!
| tyingq wrote:
| This is one of those things that sort of exposes our industry
| maturity versus other engineering that's been around longer. You
| would think by now that the various frameworks for remote calls
| would have standardized down to include the best practice retry
| patterns, with standard names, setting ranges, etc. But we mostly
| still roll our own for most languages/frameworks. And that's full
| of footguns around DNS caching, when/how to retry on certain
| failures (unauthorized, for example), and so on.
|
| (Yes, there should also be the non-abstracted direct path for
| cases where you do want to roll your own).
| sesm wrote:
| Summary of the article: use exponential backoff + jitter for
| retry intervals.
|
| What author didn't mention: sometimes you want to add jitter to
| delay the first request too, if the request happens immediately
| after some event from server (like server waking up). If you
| don't do this, you may crash the server, and if your exponential
| backoff counter is not global you can even put server into cyclic
| restart.
| whenlambo wrote:
| If you can crash the server with an improperly timed request,
| then you have a much bigger problem than client-side stuff.
| andenacitelli wrote:
| Yes. Worst that should happen is getting a 404 or something.
| A crash due to requesting a piece of data that has not yet
| been created is poor design.
| samwho wrote:
| I think what they mean is something that would cause client
| to do something at the same time (could be all sorts, some
| synchronised crash, aligning timers to clock-time, etc.). If
| the requests aren't user-driven then yes, you likely would
| want to include some jitter in the first request too.
|
| Funnily, you'll notice that some of the visualisations have
| the clients staggering their first request. It's exactly for
| this reason. I wanted the visualisations to be as
| deterministic as possible while still feeling somewhat
| realistic. This staggering was a bit of a compromise.
|
| Not sure what is meant by "if your exponential backoff
| counter is not global", though. Would love to know more about
| that.
| sroussey wrote:
| True, but you can imagine something like a websocket to all
| clients getting reset and everyone re-connecting, re-
| authenticating, and getting a new payload.
| __turbobrew__ wrote:
| One example is if a datacenter loses power and then all the
| hosts get turned on at the same time they can all send
| requests at the same time and crash a server.
| fooey wrote:
| Yup, classic Thundering Herd Problem
| cratermoon wrote:
| I worked at a company with a self-inflicted wound related to
| retries.
|
| At some point in the distant (internet time) past, a sales
| engineer, or the equivalent, had written a sample script to
| demonstrate basic uses of the API. As many of you quickly
| guessed, customers went on a copy/paste rampage and put this
| sample script into production.
|
| The script went into a tight loop on failure, naively using a
| simple library that did not include any back-off or retry in the
| request. I'm not deeply familiar with how the company dealt with
| this situation. I am aware there was a complex load balancing
| system across distributed infrastructure, but also, just a lot of
| horsepower.
|
| Lesson for anyone offering an API product: don't hand out example
| code with a self-own, because it will become someone's production
| code.
| joshka wrote:
| For a lot of things, retry once and only once (at the outermost
| layer to avoid multiplicative amplification) is more correct. At
| a large enough scale, failing twice is often significantly (like
| 90%+) correlated with the likelihood of failing a third time
| regardless of backoff / jitter. This means that the second retry
| only serves to add more load to an already failing service.
| tomwt wrote:
| Retrying end-to-end instead of stepwise greatly reduces the
| reliability of a process with a reasonable number of steps.
|
| That being said, processes should ideally be failing in ways
| which make it clear whether an error is retryable or not.
| xer wrote:
| Correct. It's also the case that human generated requests will
| lose their relevance within seconds, a quick retry is all it's
| worth. As for machine generated requests a dead letter queue
| would make more sense, poor engineered backend services would
| OOM and well-engineered would load shed, if the requests are
| queued on the application servers they are doomed to be lost
| anyway.
| davidw wrote:
| I have been thinking about queueing theory lately. I don't have
| the math abilities to do anything deep with it, but it seems like
| even basic applications of certain things could prove valuable in
| real world situations where people are just kind of winging it
| with resource allocation.
___________________________________________________________________
(page generated 2023-11-23 23:00 UTC)