[HN Gopher] Good Retry, Bad Retry: An Incident Story
       ___________________________________________________________________
        
       Good Retry, Bad Retry: An Incident Story
        
       Author : misonic
       Score  : 87 points
       Date   : 2024-10-06 08:56 UTC (1 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | sim7c00 wrote:
       | ver nice read with lots of interesting points and examples /
       | examination. very thorough imo. Im not a microservices guy but it
       | gives a lot of general concepts also applicable outside of that
       | domain. very good thanks!
        
       | duffmancd wrote:
       | I missed it on the first read-through but there is a link to the
       | code used to run the simulations in the first appendix.
       | 
       | Homegrown python code (i.e. not a library), very nicely laid out.
       | And would form a good basis for more experiments for anyone
       | interested. I think I'll have a play around later and try and
       | train my intuition.
        
       | davedx wrote:
       | This is the kind of well written, in depth technical narrative I
       | visit HN for. I definitely learned from it. Thanks for posting!
        
         | chipdart wrote:
         | I agree. What a treat. One of the best submissions gracing HN
         | in months.
        
       | easylion wrote:
       | Really good article about retries, its consequences and how load
       | amplification works. Loved it
        
       | guideamigo_com wrote:
       | I never get this desire for micro services. You IDE can help if
       | there are 500 functions, but nothing would help you if you have
       | 500 micro services. Almost no one fully understands such a
       | system. Is is hard to argue who parts of code are unused. And
       | large scale refactoring is impossible.
       | 
       | The upside seems to be some mythical infinite scalability which
       | will collapse under such positive feedback loops.
        
         | delusional wrote:
         | I think the dream is that you can reason locally. I'm not
         | convinced that it actually help any, but the dream is that
         | having everything as services, complete with external
         | boundaries and enforced constraints, you're able to more
         | accurately reason about the orchestration of services. It's
         | hard to reason about your order flow if half if it depends on
         | some implicit procedure that's part of your shopping cart.
         | 
         | The business I'm part of isn't really after "scalable"
         | technology, so that might color my opinion, but a lot of the
         | arguments for microservices I hear from my colleagues are
         | actually benefits of modular programs. Those two have just
         | become synonyms in their minds.
        
           | klabb3 wrote:
           | > [...] the dream is that having everything as services,
           | [...], you're able to more accurately reason about the
           | orchestration of services.
           | 
           | Well.. I mean that's an entirely circular point. Maybe you
           | mean something else? That you can individually deploy and
           | roll back different functionality that belong to a team?
           | There's some appeal for operations yeah.
           | 
           | > but a lot of the arguments for microservices I hear from my
           | colleagues are actually benefits of modular programs
           | 
           | Yes I mean from a development perspective a library call is
           | far, far superior to an http call. It is much more performant
           | and orders of magnitude easier to reason about since the
           | caller and callee are running the same version of the code.
           | That means that breaking changes is a refactor and single
           | commit, whereas with a service boundary you need a whole
           | migration.
           | 
           | You can't avoid services altogether, like say external
           | services like a payment portal by a completely different
           | company. But to deliberately create more of these expensive
           | boundaries for no reason, within the same small org or team,
           | is madness, imo.
        
         | FooBarWidget wrote:
         | The point of microservices is not technical, it's so that the
         | deployment- and repository ownership structure matches your
         | organization structure, and that clear lines are drawn between
         | responsibilities.
        
           | sim7c00 wrote:
           | its also easier to find devs that have the skills to create
           | and maintain thin services than a large complicated monolith,
           | despite the difficulties found when having to debug a
           | constellation of microservices during a crisis.
        
             | phil21 wrote:
             | For the folks who downvoted this - why? I hire developers
             | and this is the absolute truth of the matter.
             | 
             | You can get away with hiring devs able to only debug their
             | little micro empire so long as you can retain some super
             | senior rockstar level folks able to see the big picture
             | when it inevitably breaks down in production under load.
             | These skills are becoming rarer by the day, when they used
             | to be nearly table stakes for a "senior" dev.
             | 
             | Microservices have their place, but many times you can see
             | that it's simply developers saying "not my problem" to the
             | actual hard business case things.
        
               | pards wrote:
               | > retain some super senior rockstar level folks able to
               | see the big picture
               | 
               | This is the critical piece that many organisations miss.
               | 
               | Microservices are the bricks; but the customer needs
               | those assembled into a house.
        
               | mannyv wrote:
               | You need those senior folks who can see the big picture,
               | whether you use monoliths or microservices.
               | 
               | The real benefit of a microservice is that it's easier to
               | see the interactions, because you can't call into some
               | random and unexpected part of the codebase...or at least
               | it's much harder to do something that's not noticeable
               | like that.
        
         | morningsam wrote:
         | >The upside seems to be some mythical infinite scalability
         | which will collapse under such positive feedback loops.
         | 
         | Unless I misunderstand something here, they say pretty early in
         | the article that they didn't have autoscaling configured for
         | the service in question and there is no indication they scaled
         | up the number of replicas manually after the downtime to
         | account for the accumulated backlog of requests. So, in my
         | mind, of course there can be no infinite, or really any,
         | scalability if the service isn't allowed to scale...
        
         | dropofwill wrote:
         | The concepts here apply to any client-server networking setup.
         | Monoliths could still have web clients, native apps, IOT
         | sensors, third party APIs, databases, etc.
        
       | azlev wrote:
       | Good reading.
       | 
       | In my last job, the service mesh was responsible to do retries.
       | It was a startup and the system was changing every day.
       | 
       | After a while, we suspect that some services were not reliable
       | enough and retries were hiding this fact. Turning off retries
       | exposed that in fact, quality went down.
       | 
       | In the end, we put retries in just some services.
       | 
       | I never tested neither retry budget nor deadline propagation. I
       | will suggest this in the future.
        
       | Rygian wrote:
       | Reading this excellent article put me in the mind of wondering if
       | job interviews for developer positions include enough questions
       | about queue management.
       | 
       | "Ben" developed retries without exponential back-off, and only
       | learned about that concept in code review. Exponential back-off
       | should be part of any basic developer curriculum (except if that
       | curriculum does not mention networks of any sort at all).
        
         | sim7c00 wrote:
         | if you have too many deeper questions you rule out a lot of
         | eager juniors who can learn and grow on the job. its a fine
         | balance though, but looking at the article, ben's taking his
         | lessons and growing. thats more important i think than having
         | someone who's some guru from the get go. everyone has things
         | they are better or worse at, and it's really a team effort to
         | do everythinng right. presumably someone reviewed and accepted
         | his code, that person also didnt catch it... there's no
         | developer who knows everything and makes all perfect code and
         | design. its a well balanced team that can help go in that
         | direction
        
           | Rygian wrote:
           | I wholeheartedly agree, and realize my comment was not really
           | clear.
           | 
           | Any training curriculum needs to include exponential back-off
           | as a core concept of any system-to-system interaction.
           | 
           | Ben was let out of school without proper training. Kudos on
           | the employer for finishing up the training that was missed
           | earlier on.
        
       | k3vinw wrote:
       | Great food for thought! I'm currently on an endeavor at work to
       | stabilize some pre-existing rest service integration tests
       | executed in parallel.
        
       | patrakov wrote:
       | To counter the avalanche of retries on different layers, I have
       | also seen a custom header being added to all requests that are
       | retries. Upon receiving a request with this header, the
       | microservice would turn off its own retry logic for this request.
        
       | patrakov wrote:
       | It's worth noting that the logic in the article only applies to
       | idempotent requests. See this article (by the same author) for
       | the non-idempotent counter-part:
       | https://habr.com/ru/companies/yandex/articles/442762/
       | (unfortunately, in Russian). I am sure somebody posted a human-
       | written English translation back then, but I cannot find it. So
       | here is a Google-translated version (scroll past the internal
       | error, the text is below):
       | 
       | https://habr-com.translate.goog/ru/companies/yandex/articles...
        
       | ramchip wrote:
       | AWS also say they do something interesting:
       | 
       | > When adding jitter to scheduled work, we do not select the
       | jitter on each host randomly. Instead, we use a consistent method
       | that produces the same number every time on the same host. This
       | way, if there is a service being overloaded, or a race condition,
       | it happens the same way in a pattern. We humans are good at
       | identifying patterns, and we're more likely to determine the root
       | cause. Using a random method ensures that if a resource is being
       | overwhelmed, it only happens - well, at random. This makes
       | troubleshooting much more difficult.
       | 
       | https://aws.amazon.com/builders-library/timeouts-retries-and...
        
       ___________________________________________________________________
       (page generated 2024-10-07 23:00 UTC)