[HN Gopher] The Thundering Herd Problem
___________________________________________________________________
The Thundering Herd Problem
Author : ingve
Score : 69 points
Date : 2024-02-01 07:46 UTC (1 days ago)
(HTM) web link (encore.dev)
(TXT) w3m dump (encore.dev)
| _whiteCaps_ wrote:
| When working on an embedded product, we used the last few digits
| of the MAC address to control timing of API accesses. Obviously
| doesn't work unless you control the whole stack, but for us, it
| was an easy fix since we had our own OUI.
| pphysch wrote:
| The single biggest way to mitigate this is to switch from "push"
| to "pull" architecture where possible. Let the central servers
| reach out to the "clients" at their own honest pace. Not every
| process can be modeled this way, but a surprising amount can,
| particularly internal processes where you control all of the
| agents.
| cbsmith wrote:
| Ironically, stream processing, which we tend to call "push"
| architecture, is largely the same solution.
| pphysch wrote:
| Do you mean something like "at most once" delivery over UDP?
| cbsmith wrote:
| No, that's more like traditional load shedding. I'm talking
| more like Kafka/NATS/0mq/etc.
| cbsmith wrote:
| The thundering herd problem comes from throughput decreasing due
| to overhead when you need to scale up capacity (traditionally due
| to increased context switching, but these days often from
| autoscaling activity). While rate-limiting, circuit-breaking,
| load-shedding, etc. are decent band aids for the problem, the
| better solution tends to be finding ways to increase amortization
| of request processing overhead as request load increases. Caching
| is _a_ way to do it, but if done improperly can exacerbate
| problems. Another common approach, that I didn 't see mentioned
| here in the article, is microbatching, where your pay your
| overhead costs on a per batch basis (often using streaming
| interfaces like Kafka). As you fall behind, your batch sizes
| increase, which increases the number of requests you process per
| batch, spreading the batch processing overhead over more and more
| requests.
| rollcat wrote:
| Adding a little bit of random delay to any scheduled / triggered
| actions is always a good idea, and is usually very cheap to
| implement.
|
| If you're on OpenBSD, their cron supports randomized ranges using
| the ~ operator (see <https://man.openbsd.org/crontab.5>).
| Otherwise, you can use something like 'sleep $(($RANDOM % 60)) &&
| some-task', but beware that $RANDOM has a range of 0-65535; you
| won't get the full, uniform 86400s range for daily jobs.
|
| If you are unable to add a different random delay on every
| invocation (e.g. you have an RSS reader that only allows you to
| specify an exact interval), pick some nearby prime number, e.g.
| 37 instead of 30 or 45, or 71 instead of 60 (71-60=11 being a
| prime is also great).
|
| If you're deploying a whole fleet at once, you can also vary the
| exact timings of cron.hourly, cron.daily, etc between machines.
| traceroute66 wrote:
| For those who don't have the weird systemd allergy, systemd has
| many great features, including random scheduling, see
| `RandomizedDelaySec`.
| rollcat wrote:
| Oh yes, systemd in general and timers in particular do have a
| whole bunch of awesome stuff - I just tend to forget any of
| it exists, it's my coping mechanism for dealing with
| "heterogeneous" environments. I still have CentOS6 machines
| in prod...
|
| https://www.freedesktop.org/software/systemd/man/latest/syst.
| ..
| jcul wrote:
| Yeah a lot of those nice things like random delays, back
| offs, evening using timezones are not available, even in
| the centos7 systemd.
| he0001 wrote:
| My experience is that too many doesn't even know what this
| problem actually is, and yet it exists in many places, but in
| different shapes.
| cortesoft wrote:
| As someone who has worked on very large networks for 15 years,
| the author left out some important bits about using caching with
| thundering herd problems.
|
| You really need to make sure that your cache settings will not
| generate a ton of requests to the origin when new content is
| requested.
|
| If you have a bunch of clients that are going to request content
| at nearly the exact same time (this happens a lot with live video
| streaming, but can also happen when your clients are configured
| to download something as soon as it becomes available, or on a
| schedule, etc), you need to make sure your caches are set up so
| that the second request will wait for the first request to fill
| from origin rather than send its own cache fill request. Ideally,
| if 1000 clients all request the same content from your cache,
| only ONE request should go back to the origin, and the other 999
| should wait and used the content retrieved by the first request.
| Otherwise, your cache isn't going to do anything since every
| request is going to require pulling from origin anyway.
|
| In nginx, you do this with something like the proxy_cache_lock
| setting.
| bitexploder wrote:
| Caching is generally way harder to get right than a lot of
| people probably think. Having delved into building caching
| systems that get the behavior and needs of the system just
| right, it took way more effort and research than I thought it
| would. You start with something simple and by the time you have
| chased down all the edge cases and tuned the cache behavior to
| your situation it is often not so simple after all.
| 01HNNWZ0MV43FF wrote:
| I recall seeing that called "Request coalescing" in Faster Than
| Lime's article https://fasterthanli.me/articles/request-
| coalescing-in-async...
| oops wrote:
| It is also sometimes called request collapsing.
| omeze wrote:
| Yep, Ive found that, generally, caching logic should be a
| separate flow from business logic. E.g wrap your RPC logic such
| that an in-memory cache transparently sits in your RPC stack.
| This lets you implement locking on the cache in a single place,
| and makes it easy to do things like eager prefetching or
| offloading cache logic to another thread.
| bluelightning2k wrote:
| Hey - just wanted to give props to the author. I always enjoy
| Encore content when it pops up here.
|
| Brand is cool too
| bluelightning2k wrote:
| Although I just checked your pricing on a whim and it's the
| most expensive per member service I've ever seen for devs. By a
| LOT.
|
| I mean I still like the brand but wow.
| eandre wrote:
| Encore CEO here. Thanks for the feedback.
|
| We sell developer productivity and devops automation, and
| compared to hiring additional engineers Encore is very cheap.
|
| We've tried to align our incentives with the needs of our
| customers, so there are no usage-based or surprise fees when
| using Encore. The per-seat price may be higher but it's
| transparent and predictable.
| matsemann wrote:
| Once had to interface with an external API for many of our client
| requests (booking requests for transportation). Due to certain
| combinations of queries, it could sometimes be slow to get a
| response if the requested journey would contain multiple legs and
| combinations. So we needed to have a high timeout for our
| requests to that system. Which was fine most of the time.
|
| Unfortunately, if that system started to struggle, all our calls
| would be slow, or even time out. And of course us continuously
| spamming them wouldn't help (we were like 95% of their traffic).
| But that didn't just wreak havoc on their side, but it completely
| exhausted all our incoming connections, thread pools etc. just
| waiting for responses that would never come, with long timeouts.
|
| A circuit breaker here was golden. When too many requests in a
| row were slow, we would break and stop requests from our side.
| That would allow them to recover on their side, and things on our
| side to fail fast and let other services still work.
|
| A key here, was that the circuit breaker would allow a small
| percentage through still. How else would you detect things are
| back up again? And then slowly let more and more through if
| things are going fine (so not just open completely all at once).
| muffa wrote:
| Props to encore to writing interesting blogs, also I just want to
| say I tried the encore product and it's awesome(I only used it
| for a hobby project but still)
| AlbertCory wrote:
| > user journeys
|
| meaning, they tested that if the user does everything the way
| they're "supposed to" it works.
|
| That explains a lot.
| PaulDavisThe1st wrote:
| Once again, what TFA describes is _NOT_ the thundering herd
| problem.
|
| https://en.wikipedia.org/wiki/Thundering_herd_problem
|
| > In computer science, the thundering herd problem occurs when a
| large number of processes or threads waiting for an event are
| awoken when that event occurs, _but only one process is able to
| handle the event_.
|
| (emphasis mine)
| gilbetron wrote:
| Thundering herd has evolved beyond the original definition,
| that happens. Sometimes terms even flip their meaning (like how
| "one bad apple" has flipped it's meaning, or how "pre-
| optimization is the root of all evil" as used these days isn't
| what the original author actually meant, but in this case I
| think most of the new usages fit the basic idea, which is a
| large amount of unusual traffic happens outside of normal usage
| patterns.
| PaulDavisThe1st wrote:
| Fair point that terms like this can change meanings.
|
| But this version absolutely does not fit the original, which
| had nothing to do with amounts of traffic.
|
| It was a problem with kernel APIs: if N threads sleep waiting
| for an event on a socket (i.e. a thread pool), there was no
| way to wake up _only one_ of them when a single msg arrived.
| They would all wake, all check if there was work to do, then
| all (but one) would go back to sleep. This is a behavioral
| flaw even in the case of next to no traffic (though it may
| not have an material effect).
|
| The problem has been solved with better kernel APIs.
| jodydonetti wrote:
| Shameless plug: for protection against thundering herd, cache
| stampede and various transient failures on .net I created
| FusionCache.
|
| Hope this may be of help to someone.
|
| https://github.com/ZiggyCreatures/FusionCache
| Berniek wrote:
| Philosophically, the "thundering herd problem" actually CAUSES
| congestion.
|
| There are 2 ways to handle it.
|
| 1 Stop the thundering herd.(make all the clients do something
| different). That may make things worse. Congestion in networks is
| usually exponential. You can't fulfill a request so you repeat
| the request, it can't be fulfilled so it repeats. You can add a
| random delay at the client end but that is just The US
| governments answer to the debt problem, it kicks the can down the
| road in the short term but it will almost certainly come back and
| bite you. Mathematically it is very easy for this scenario to
| become catastrophically exponential when a threshold is reach
|
| 2 Stop the congestion (make the handling process faster or add
| processes)
|
| The system already has a cache to handle this but if its not in
| the cache it doesn't help. There needs to be an extra request
| cache exclusively for congestion scenarios. The existing cache
| request is already doing some processing, so extend that to route
| "thundering herd requests processing". This second cache does a
| bit more processing as well.As each new request is routed to it,
| it checks itself to see if this requestor is in the cache and
| removes it or overwrites it. It should never contain more than
| one entry per client.
|
| When no more editions are made to this congestion cache (or the
| rate has slowed significantly) then the requests can be forwarded
| and processed via the original cache system.
|
| Under this configuration, the congestion does not become
| exponential and should only delay the thundering herd requests.
| All other requests will be handled as per normal.
|
| Once the original cache has the information there is no need for
| any thundering herd requests to be routed to the congestion
| cache.
|
| Some clients will encounter some delays but not all and only on
| the "thundering herd process".
___________________________________________________________________
(page generated 2024-02-02 23:01 UTC)