[HN Gopher] The Thundering Herd Problem
       ___________________________________________________________________
        
       The Thundering Herd Problem
        
       Author : ingve
       Score  : 69 points
       Date   : 2024-02-01 07:46 UTC (1 days ago)
        
 (HTM) web link (encore.dev)
 (TXT) w3m dump (encore.dev)
        
       | _whiteCaps_ wrote:
       | When working on an embedded product, we used the last few digits
       | of the MAC address to control timing of API accesses. Obviously
       | doesn't work unless you control the whole stack, but for us, it
       | was an easy fix since we had our own OUI.
        
       | pphysch wrote:
       | The single biggest way to mitigate this is to switch from "push"
       | to "pull" architecture where possible. Let the central servers
       | reach out to the "clients" at their own honest pace. Not every
       | process can be modeled this way, but a surprising amount can,
       | particularly internal processes where you control all of the
       | agents.
        
         | cbsmith wrote:
         | Ironically, stream processing, which we tend to call "push"
         | architecture, is largely the same solution.
        
           | pphysch wrote:
           | Do you mean something like "at most once" delivery over UDP?
        
             | cbsmith wrote:
             | No, that's more like traditional load shedding. I'm talking
             | more like Kafka/NATS/0mq/etc.
        
       | cbsmith wrote:
       | The thundering herd problem comes from throughput decreasing due
       | to overhead when you need to scale up capacity (traditionally due
       | to increased context switching, but these days often from
       | autoscaling activity). While rate-limiting, circuit-breaking,
       | load-shedding, etc. are decent band aids for the problem, the
       | better solution tends to be finding ways to increase amortization
       | of request processing overhead as request load increases. Caching
       | is _a_ way to do it, but if done improperly can exacerbate
       | problems. Another common approach, that I didn 't see mentioned
       | here in the article, is microbatching, where your pay your
       | overhead costs on a per batch basis (often using streaming
       | interfaces like Kafka). As you fall behind, your batch sizes
       | increase, which increases the number of requests you process per
       | batch, spreading the batch processing overhead over more and more
       | requests.
        
       | rollcat wrote:
       | Adding a little bit of random delay to any scheduled / triggered
       | actions is always a good idea, and is usually very cheap to
       | implement.
       | 
       | If you're on OpenBSD, their cron supports randomized ranges using
       | the ~ operator (see <https://man.openbsd.org/crontab.5>).
       | Otherwise, you can use something like 'sleep $(($RANDOM % 60)) &&
       | some-task', but beware that $RANDOM has a range of 0-65535; you
       | won't get the full, uniform 86400s range for daily jobs.
       | 
       | If you are unable to add a different random delay on every
       | invocation (e.g. you have an RSS reader that only allows you to
       | specify an exact interval), pick some nearby prime number, e.g.
       | 37 instead of 30 or 45, or 71 instead of 60 (71-60=11 being a
       | prime is also great).
       | 
       | If you're deploying a whole fleet at once, you can also vary the
       | exact timings of cron.hourly, cron.daily, etc between machines.
        
         | traceroute66 wrote:
         | For those who don't have the weird systemd allergy, systemd has
         | many great features, including random scheduling, see
         | `RandomizedDelaySec`.
        
           | rollcat wrote:
           | Oh yes, systemd in general and timers in particular do have a
           | whole bunch of awesome stuff - I just tend to forget any of
           | it exists, it's my coping mechanism for dealing with
           | "heterogeneous" environments. I still have CentOS6 machines
           | in prod...
           | 
           | https://www.freedesktop.org/software/systemd/man/latest/syst.
           | ..
        
             | jcul wrote:
             | Yeah a lot of those nice things like random delays, back
             | offs, evening using timezones are not available, even in
             | the centos7 systemd.
        
       | he0001 wrote:
       | My experience is that too many doesn't even know what this
       | problem actually is, and yet it exists in many places, but in
       | different shapes.
        
       | cortesoft wrote:
       | As someone who has worked on very large networks for 15 years,
       | the author left out some important bits about using caching with
       | thundering herd problems.
       | 
       | You really need to make sure that your cache settings will not
       | generate a ton of requests to the origin when new content is
       | requested.
       | 
       | If you have a bunch of clients that are going to request content
       | at nearly the exact same time (this happens a lot with live video
       | streaming, but can also happen when your clients are configured
       | to download something as soon as it becomes available, or on a
       | schedule, etc), you need to make sure your caches are set up so
       | that the second request will wait for the first request to fill
       | from origin rather than send its own cache fill request. Ideally,
       | if 1000 clients all request the same content from your cache,
       | only ONE request should go back to the origin, and the other 999
       | should wait and used the content retrieved by the first request.
       | Otherwise, your cache isn't going to do anything since every
       | request is going to require pulling from origin anyway.
       | 
       | In nginx, you do this with something like the proxy_cache_lock
       | setting.
        
         | bitexploder wrote:
         | Caching is generally way harder to get right than a lot of
         | people probably think. Having delved into building caching
         | systems that get the behavior and needs of the system just
         | right, it took way more effort and research than I thought it
         | would. You start with something simple and by the time you have
         | chased down all the edge cases and tuned the cache behavior to
         | your situation it is often not so simple after all.
        
         | 01HNNWZ0MV43FF wrote:
         | I recall seeing that called "Request coalescing" in Faster Than
         | Lime's article https://fasterthanli.me/articles/request-
         | coalescing-in-async...
        
           | oops wrote:
           | It is also sometimes called request collapsing.
        
         | omeze wrote:
         | Yep, Ive found that, generally, caching logic should be a
         | separate flow from business logic. E.g wrap your RPC logic such
         | that an in-memory cache transparently sits in your RPC stack.
         | This lets you implement locking on the cache in a single place,
         | and makes it easy to do things like eager prefetching or
         | offloading cache logic to another thread.
        
       | bluelightning2k wrote:
       | Hey - just wanted to give props to the author. I always enjoy
       | Encore content when it pops up here.
       | 
       | Brand is cool too
        
         | bluelightning2k wrote:
         | Although I just checked your pricing on a whim and it's the
         | most expensive per member service I've ever seen for devs. By a
         | LOT.
         | 
         | I mean I still like the brand but wow.
        
           | eandre wrote:
           | Encore CEO here. Thanks for the feedback.
           | 
           | We sell developer productivity and devops automation, and
           | compared to hiring additional engineers Encore is very cheap.
           | 
           | We've tried to align our incentives with the needs of our
           | customers, so there are no usage-based or surprise fees when
           | using Encore. The per-seat price may be higher but it's
           | transparent and predictable.
        
       | matsemann wrote:
       | Once had to interface with an external API for many of our client
       | requests (booking requests for transportation). Due to certain
       | combinations of queries, it could sometimes be slow to get a
       | response if the requested journey would contain multiple legs and
       | combinations. So we needed to have a high timeout for our
       | requests to that system. Which was fine most of the time.
       | 
       | Unfortunately, if that system started to struggle, all our calls
       | would be slow, or even time out. And of course us continuously
       | spamming them wouldn't help (we were like 95% of their traffic).
       | But that didn't just wreak havoc on their side, but it completely
       | exhausted all our incoming connections, thread pools etc. just
       | waiting for responses that would never come, with long timeouts.
       | 
       | A circuit breaker here was golden. When too many requests in a
       | row were slow, we would break and stop requests from our side.
       | That would allow them to recover on their side, and things on our
       | side to fail fast and let other services still work.
       | 
       | A key here, was that the circuit breaker would allow a small
       | percentage through still. How else would you detect things are
       | back up again? And then slowly let more and more through if
       | things are going fine (so not just open completely all at once).
        
       | muffa wrote:
       | Props to encore to writing interesting blogs, also I just want to
       | say I tried the encore product and it's awesome(I only used it
       | for a hobby project but still)
        
       | AlbertCory wrote:
       | > user journeys
       | 
       | meaning, they tested that if the user does everything the way
       | they're "supposed to" it works.
       | 
       | That explains a lot.
        
       | PaulDavisThe1st wrote:
       | Once again, what TFA describes is _NOT_ the thundering herd
       | problem.
       | 
       | https://en.wikipedia.org/wiki/Thundering_herd_problem
       | 
       | > In computer science, the thundering herd problem occurs when a
       | large number of processes or threads waiting for an event are
       | awoken when that event occurs, _but only one process is able to
       | handle the event_.
       | 
       | (emphasis mine)
        
         | gilbetron wrote:
         | Thundering herd has evolved beyond the original definition,
         | that happens. Sometimes terms even flip their meaning (like how
         | "one bad apple" has flipped it's meaning, or how "pre-
         | optimization is the root of all evil" as used these days isn't
         | what the original author actually meant, but in this case I
         | think most of the new usages fit the basic idea, which is a
         | large amount of unusual traffic happens outside of normal usage
         | patterns.
        
           | PaulDavisThe1st wrote:
           | Fair point that terms like this can change meanings.
           | 
           | But this version absolutely does not fit the original, which
           | had nothing to do with amounts of traffic.
           | 
           | It was a problem with kernel APIs: if N threads sleep waiting
           | for an event on a socket (i.e. a thread pool), there was no
           | way to wake up _only one_ of them when a single msg arrived.
           | They would all wake, all check if there was work to do, then
           | all (but one) would go back to sleep. This is a behavioral
           | flaw even in the case of next to no traffic (though it may
           | not have an material effect).
           | 
           | The problem has been solved with better kernel APIs.
        
       | jodydonetti wrote:
       | Shameless plug: for protection against thundering herd, cache
       | stampede and various transient failures on .net I created
       | FusionCache.
       | 
       | Hope this may be of help to someone.
       | 
       | https://github.com/ZiggyCreatures/FusionCache
        
       | Berniek wrote:
       | Philosophically, the "thundering herd problem" actually CAUSES
       | congestion.
       | 
       | There are 2 ways to handle it.
       | 
       | 1 Stop the thundering herd.(make all the clients do something
       | different). That may make things worse. Congestion in networks is
       | usually exponential. You can't fulfill a request so you repeat
       | the request, it can't be fulfilled so it repeats. You can add a
       | random delay at the client end but that is just The US
       | governments answer to the debt problem, it kicks the can down the
       | road in the short term but it will almost certainly come back and
       | bite you. Mathematically it is very easy for this scenario to
       | become catastrophically exponential when a threshold is reach
       | 
       | 2 Stop the congestion (make the handling process faster or add
       | processes)
       | 
       | The system already has a cache to handle this but if its not in
       | the cache it doesn't help. There needs to be an extra request
       | cache exclusively for congestion scenarios. The existing cache
       | request is already doing some processing, so extend that to route
       | "thundering herd requests processing". This second cache does a
       | bit more processing as well.As each new request is routed to it,
       | it checks itself to see if this requestor is in the cache and
       | removes it or overwrites it. It should never contain more than
       | one entry per client.
       | 
       | When no more editions are made to this congestion cache (or the
       | rate has slowed significantly) then the requests can be forwarded
       | and processed via the original cache system.
       | 
       | Under this configuration, the congestion does not become
       | exponential and should only delay the thundering herd requests.
       | All other requests will be handled as per normal.
       | 
       | Once the original cache has the information there is no need for
       | any thundering herd requests to be routed to the congestion
       | cache.
       | 
       | Some clients will encounter some delays but not all and only on
       | the "thundering herd process".
        
       ___________________________________________________________________
       (page generated 2024-02-02 23:01 UTC)