[HN Gopher] Good Retry, Bad Retry: An Incident Story
___________________________________________________________________
Good Retry, Bad Retry: An Incident Story
Author : misonic
Score : 87 points
Date : 2024-10-06 08:56 UTC (1 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| sim7c00 wrote:
| ver nice read with lots of interesting points and examples /
| examination. very thorough imo. Im not a microservices guy but it
| gives a lot of general concepts also applicable outside of that
| domain. very good thanks!
| duffmancd wrote:
| I missed it on the first read-through but there is a link to the
| code used to run the simulations in the first appendix.
|
| Homegrown python code (i.e. not a library), very nicely laid out.
| And would form a good basis for more experiments for anyone
| interested. I think I'll have a play around later and try and
| train my intuition.
| davedx wrote:
| This is the kind of well written, in depth technical narrative I
| visit HN for. I definitely learned from it. Thanks for posting!
| chipdart wrote:
| I agree. What a treat. One of the best submissions gracing HN
| in months.
| easylion wrote:
| Really good article about retries, its consequences and how load
| amplification works. Loved it
| guideamigo_com wrote:
| I never get this desire for micro services. You IDE can help if
| there are 500 functions, but nothing would help you if you have
| 500 micro services. Almost no one fully understands such a
| system. Is is hard to argue who parts of code are unused. And
| large scale refactoring is impossible.
|
| The upside seems to be some mythical infinite scalability which
| will collapse under such positive feedback loops.
| delusional wrote:
| I think the dream is that you can reason locally. I'm not
| convinced that it actually help any, but the dream is that
| having everything as services, complete with external
| boundaries and enforced constraints, you're able to more
| accurately reason about the orchestration of services. It's
| hard to reason about your order flow if half if it depends on
| some implicit procedure that's part of your shopping cart.
|
| The business I'm part of isn't really after "scalable"
| technology, so that might color my opinion, but a lot of the
| arguments for microservices I hear from my colleagues are
| actually benefits of modular programs. Those two have just
| become synonyms in their minds.
| klabb3 wrote:
| > [...] the dream is that having everything as services,
| [...], you're able to more accurately reason about the
| orchestration of services.
|
| Well.. I mean that's an entirely circular point. Maybe you
| mean something else? That you can individually deploy and
| roll back different functionality that belong to a team?
| There's some appeal for operations yeah.
|
| > but a lot of the arguments for microservices I hear from my
| colleagues are actually benefits of modular programs
|
| Yes I mean from a development perspective a library call is
| far, far superior to an http call. It is much more performant
| and orders of magnitude easier to reason about since the
| caller and callee are running the same version of the code.
| That means that breaking changes is a refactor and single
| commit, whereas with a service boundary you need a whole
| migration.
|
| You can't avoid services altogether, like say external
| services like a payment portal by a completely different
| company. But to deliberately create more of these expensive
| boundaries for no reason, within the same small org or team,
| is madness, imo.
| FooBarWidget wrote:
| The point of microservices is not technical, it's so that the
| deployment- and repository ownership structure matches your
| organization structure, and that clear lines are drawn between
| responsibilities.
| sim7c00 wrote:
| its also easier to find devs that have the skills to create
| and maintain thin services than a large complicated monolith,
| despite the difficulties found when having to debug a
| constellation of microservices during a crisis.
| phil21 wrote:
| For the folks who downvoted this - why? I hire developers
| and this is the absolute truth of the matter.
|
| You can get away with hiring devs able to only debug their
| little micro empire so long as you can retain some super
| senior rockstar level folks able to see the big picture
| when it inevitably breaks down in production under load.
| These skills are becoming rarer by the day, when they used
| to be nearly table stakes for a "senior" dev.
|
| Microservices have their place, but many times you can see
| that it's simply developers saying "not my problem" to the
| actual hard business case things.
| pards wrote:
| > retain some super senior rockstar level folks able to
| see the big picture
|
| This is the critical piece that many organisations miss.
|
| Microservices are the bricks; but the customer needs
| those assembled into a house.
| mannyv wrote:
| You need those senior folks who can see the big picture,
| whether you use monoliths or microservices.
|
| The real benefit of a microservice is that it's easier to
| see the interactions, because you can't call into some
| random and unexpected part of the codebase...or at least
| it's much harder to do something that's not noticeable
| like that.
| morningsam wrote:
| >The upside seems to be some mythical infinite scalability
| which will collapse under such positive feedback loops.
|
| Unless I misunderstand something here, they say pretty early in
| the article that they didn't have autoscaling configured for
| the service in question and there is no indication they scaled
| up the number of replicas manually after the downtime to
| account for the accumulated backlog of requests. So, in my
| mind, of course there can be no infinite, or really any,
| scalability if the service isn't allowed to scale...
| dropofwill wrote:
| The concepts here apply to any client-server networking setup.
| Monoliths could still have web clients, native apps, IOT
| sensors, third party APIs, databases, etc.
| azlev wrote:
| Good reading.
|
| In my last job, the service mesh was responsible to do retries.
| It was a startup and the system was changing every day.
|
| After a while, we suspect that some services were not reliable
| enough and retries were hiding this fact. Turning off retries
| exposed that in fact, quality went down.
|
| In the end, we put retries in just some services.
|
| I never tested neither retry budget nor deadline propagation. I
| will suggest this in the future.
| Rygian wrote:
| Reading this excellent article put me in the mind of wondering if
| job interviews for developer positions include enough questions
| about queue management.
|
| "Ben" developed retries without exponential back-off, and only
| learned about that concept in code review. Exponential back-off
| should be part of any basic developer curriculum (except if that
| curriculum does not mention networks of any sort at all).
| sim7c00 wrote:
| if you have too many deeper questions you rule out a lot of
| eager juniors who can learn and grow on the job. its a fine
| balance though, but looking at the article, ben's taking his
| lessons and growing. thats more important i think than having
| someone who's some guru from the get go. everyone has things
| they are better or worse at, and it's really a team effort to
| do everythinng right. presumably someone reviewed and accepted
| his code, that person also didnt catch it... there's no
| developer who knows everything and makes all perfect code and
| design. its a well balanced team that can help go in that
| direction
| Rygian wrote:
| I wholeheartedly agree, and realize my comment was not really
| clear.
|
| Any training curriculum needs to include exponential back-off
| as a core concept of any system-to-system interaction.
|
| Ben was let out of school without proper training. Kudos on
| the employer for finishing up the training that was missed
| earlier on.
| k3vinw wrote:
| Great food for thought! I'm currently on an endeavor at work to
| stabilize some pre-existing rest service integration tests
| executed in parallel.
| patrakov wrote:
| To counter the avalanche of retries on different layers, I have
| also seen a custom header being added to all requests that are
| retries. Upon receiving a request with this header, the
| microservice would turn off its own retry logic for this request.
| patrakov wrote:
| It's worth noting that the logic in the article only applies to
| idempotent requests. See this article (by the same author) for
| the non-idempotent counter-part:
| https://habr.com/ru/companies/yandex/articles/442762/
| (unfortunately, in Russian). I am sure somebody posted a human-
| written English translation back then, but I cannot find it. So
| here is a Google-translated version (scroll past the internal
| error, the text is below):
|
| https://habr-com.translate.goog/ru/companies/yandex/articles...
| ramchip wrote:
| AWS also say they do something interesting:
|
| > When adding jitter to scheduled work, we do not select the
| jitter on each host randomly. Instead, we use a consistent method
| that produces the same number every time on the same host. This
| way, if there is a service being overloaded, or a race condition,
| it happens the same way in a pattern. We humans are good at
| identifying patterns, and we're more likely to determine the root
| cause. Using a random method ensures that if a resource is being
| overwhelmed, it only happens - well, at random. This makes
| troubleshooting much more difficult.
|
| https://aws.amazon.com/builders-library/timeouts-retries-and...
___________________________________________________________________
(page generated 2024-10-07 23:00 UTC)