[HN Gopher] Exponential Rate Limiting
___________________________________________________________________
Exponential Rate Limiting
Author : mooreds
Score : 87 points
Date : 2024-09-09 14:47 UTC (3 days ago)
(HTM) web link (dotat.at)
(TXT) w3m dump (dotat.at)
| hansvm wrote:
| Ha, I wrote one of these a year ago. The problem being solved was
| gracefully degrading a service if the request volume was too
| high.
|
| Some additional tweaks you might care about:
|
| 1. It's easy to change the "numerator" being measured to
| something continuous rather than discrete. As a rate limiter in
| my application I found CPU seconds per second to be much more
| useful than requests per second. If you have more control over
| the low-level networking then queue depth or something might be
| more applicable, but you work with what you have.
|
| 2. It's easy to change the units of the configuration parameters.
| I haven't checked if it's already in these units for example, but
| TFA's "averaging period" can easily be re-cast as a half-life (in
| microseconds or hours or whatever's most convenient).
|
| Anywho, this is a nice write-up on exponential smoothing. I like
| the attention to edge cases affecting rate-limiting specifically.
|
| As a bit of a fun aside, the state-update equation here is an
| Euler update solving the exponential ODE, fusing in the new
| knowledge you have (new requests) as instantaneous tweaks. For
| rate-limiting in particular the exponential is probably a good
| choice (numerically stable, hardware instructions, works even
| when time gaps or other features are far outside of your intended
| configuration space, ...), but in other domains, like memoryless
| (still "stateful", obviously, but only requiring an amalgamation
| condensed into the very last state, as opposed to some O(n)
| thing) animation updates, you might prefer other properties like
| the ability to decay to exactly 0 in finite time. You can also
| recover leaky buckets and other similar algorithms with the right
| choice of ODE.
| brightball wrote:
| > As a rate limiter in my application I found CPU seconds per
| second to be much more useful than requests per second.
|
| This is really insightful and makes good sense. I've toyed for
| years with the idea of building a "Cost Benefit Cache" that
| basically tracked the request volume for a piece of data, the
| time to retrieve that data and the size of the data retrieved
| to make decisions about what to cache during system stress.
|
| The idea was basically a cost benefit formula where the cost is
| the size of the data, so how much of the available cache space
| would it use, while the benefit is the seconds of savings from
| requests * average retrieval time. Requests per second alone
| doesn't really matter if the request is low/no impact.
| CBLT wrote:
| Google had a library like that, and one of the foot guns is
| that requests that are malformed or otherwise do 0 work have
| incredibly low cost, which results in them being prioritized
| above all other requests. Every user had to configure some
| obscure options to set cost multipliers based on return code
| to try and correct for this, which felt like a pretty
| fundamental bug that was being hacked around.
|
| Edit: this was actually for cost-based load balancing, not
| load shedding. You obviously can't load shed based on return
| code.
| travisjungroth wrote:
| Interesting that "if the request is bad or there's an
| error" is a difficult condition to implement universally.
| jpollock wrote:
| When you've got 80,000 engineers you end up with lots of
| different opinions on what the wire protocol should look
| like.
|
| * Sometimes, you want to fail closed - errors are
| returned for all errors.
|
| * Sometimes, you want to fail open - ok is returned for
| errors.
|
| * Sometimes, you want to separate "failure" from
| "denied", the code "worked", but it didn't do what the
| client wanted - think "payment declined", so you return
| "OK", with the actual result in the response.
|
| Compare the different costs to charge a credit card that
| is expired vs one that works.
|
| * The expired one can be checked before calling out to
| the processor - cheap, returns "OK + Declined".
|
| * The success ends up with a call to the processor and
| takes a second or two, returns "OK + Success".
| hansvm wrote:
| Some things are really hard to appropriately abstract,
| but there's so much similarity we can't help but be
| tempted to try.
|
| Soemthing that I wrote at $WORK recently was a data-
| oriented exponentially smoothed reservoir sampling order
| statistic randomized splay tree.
|
| All of those components exist individually. Even picking
| on two of them though; given a tree implementation it's
| trivial (as a human) to create an analogous order-
| statistic tree. In any pseudo-mainstream programming
| language though, especially at a system level, how do you
| tack on the "order-statistic" behavior to an arbitrary
| tree?
|
| It's not too bad if you additional require some interface
| the primary tree needs to adhere to. That interface is,
| necessarily, biased toward the needs of an order-
| statistic tree though. There isn't actually that much
| difference between implementing a tree which adheres to
| the interface and one which just does every operation
| manually. Plus, there are some allocation questions and
| other inefficiencies (or, alternatively, unsoundness)
| that level of abstraction introduces which you can avoid
| by writing the whole thing from scratch.
|
| Bringing it back to the $GOOG library, the problem is
| that you have a lot of code which _looks_ like it's the
| same, but it's impossible to make it actually _be_ the
| same without heavy lifting for callers of that library.
|
| Taking those monolithic frameworks (e.g., inadvisedly
| popular rate-limiting shenanigans) and turning them into
| libraries of helpful utilities often helps. General
| solutions are harder to come by.
| jeffbee wrote:
| CPU rate is only quasi-continuous. The fact of a thread running
| or not running is binary. You cannot be 63% on a CPU, that is
| an illusion of sampling and integration.
|
| But the real trouble with CPU rate is that it represents the
| past, not the future. It tells you what happened to the last
| request and nothing about what will happen to the next request.
| A signal with actual predictive power is how many other
| requests are running or waiting to run.
| hansvm wrote:
| Like everything, the devil's in the details.
|
| In our case, non-idle cumulative thread CPU time (including
| system activity) is a nice metric. If it exceeds 70%, start
| sending some 503s (or 500s depending on the client). A single
| core can easily service 100B requests per day, a few times
| more than we needed to, so if we exceeded that by enough
| margin to saturate every core to 70% then the right course of
| action was some combination of (1) scale up, and (2) notify
| engineering that a moderately important design assumption was
| violated. We could still service 100B requests per day per
| core, but the excess resulted in a few 5xx errors for a
| minute while autoscaling kicked in. It's not good enough for
| every service, but it was absolutely what we needed and leaps
| and bounds better than what we had before, with a side-
| benefit of being "obviously" correct (simple code causes
| fewer unexpected bugs).
|
| The core idea (in tweaking the numerator in exponential
| smoothing) isn't any easily refutable definition of CPU time
| though. It's that you're able to tailor the algorithm to your
| individual needs. A related metric is just the wall-clock
| time from the beginning of a request being queued till when
| it was serviced (depending on your needs, maybe compared to
| the actual processing time of said request). If you ever get
| to total_time >= 2 * processing_time, that likely indicates a
| queue about to exceed a critical threshold.
|
| > how many other requests are running or waiting to run
|
| I referenced that idea pretty explicitly: "If you have more
| control over the low-level networking then queue depth or
| something might be more applicable, but you work with what
| you have."
|
| Yes, that matters. That doesn't make other approaches
| invalid. Even then, you usually want a smoothed version of
| that stat for automated decision-making.
| alphazard wrote:
| This is like a capacitor discharging. I had always considered
| this a variant of token bucket with exponential instead of linear
| decay.
| fanf2 wrote:
| The main difference from my point of view is the EWMA produces
| a measurement of the rate, whereas token bucket only produces
| the result of a comparison with the limit. If you don't need
| the actual value of the rate then you can save a lot of time
| and space by using GCRA instead
| https://dotat.at/@/2024-08-30-gcra.html
| ted_dunning wrote:
| If doing a small amount of arithmetic and keeping 4 bytes (or
| less) additional memory per client is "a lot of time and
| space", you may need to rethink your priorities.
|
| It is very, very unusual for any kind of request to anything
| to take a time within many orders of magnitude of how long
| exponential averaging will take. The difference between a
| GCRA implementation and complex multi-exponential average
| will be nanoseconds.
| taneq wrote:
| We were just this week talking at work about all the many and
| varied names that people make up for "weighted average between
| smoothed value and new sample". :D
| tlarkworthy wrote:
| I wrote an interactive notebook visualization and a fast converge
| optimization for one of these
|
| https://observablehq.com/@tomlarkworthy/rate-estimation
| o11c wrote:
| Hm, I really need to study this more. The early claim that it
| requires "complicated float arithmetic" took me by surprise. I'm
| used to fixed-time-period sampling, so you can just do:
| // Textbook version: //average = alpha * new_value +
| (1-alpha)*average // Which is equivalent to:
| //average += (new_value - average)*alpha // To get rid of
| the float, define w=1/alpha and make it an integer. // If
| w is a power of 2 (try 8), this is just a shift. average
| += (new_value - average) / w
| fanf2 wrote:
| Yeah, it's easy when alpha is fixed. The complications -- the
| exp() and ln() -- come from updating the smoothed rate at
| varying intervals, whenever a request happens to arrive, so
| alpha has to be adjusted according to the interval to make sure
| the exponential decay happens at a fixed speed regardless of
| the interval.
| ted_dunning wrote:
| Absolutely. You just need integer implementation of fixed point
| for evenly spaced data.
|
| For irregularly spaced data, you need exp(-t), but that isn't
| hard to do with fixed point.
| ted_dunning wrote:
| An often overlooked extension of simple exponential weighting is
| that if you take the difference of two or three exponential
| averages, you can get some very nice behavior.
|
| One particular behavior that may be desirable is that you can
| build a system that allows flexibly defined short bursts, but
| limits long-term request rates which is really nice for chatty
| protocols where you might want to nearly ignore a burst of 100
| requests as long as the burst takes less than a few milliseconds,
| but clamp down hard if there is a single query per second over
| the last thirty seconds.
|
| Differences in exponential decay can also be used to implement a
| good approximation of a Gaussian kernel of any desired size. In
| image processing, for instance, this allows a Gaussian blur whose
| cost doesn't depend on the scale of the blur (unlike a true
| convolution).
| socketcluster wrote:
| For WebSockets, using SocketCluster (https://socketcluster.io),
| it's possible to queue up all requests from the same client and
| then detect and respond to high backpressure spikes (e.g. by
| disconnecting the client and/or recording the incident).
|
| You can combine different approaches like limiting the number of
| connections from a single IP within a certain timeframe and also
| limiting the backpressure.
|
| The ability to detect and respond to backpressure spikes on a
| per-end-user basis is highly valuable because backpressure in
| SocketCluster takes into account the processing time of client
| requests.
|
| A common strategy that spammers use is to identify and invoke the
| most expensive endpoints in your system.
|
| Unfortunately, a lot of people still don't understand the value
| proposition of being able to process requests from clients in-
| stream and in-order. It's also good for preventing race
| conditions and makes your environment highly predictable.
|
| In terms of security, many of the worst, most expensive hacks in
| history were the result of asynchronous events exploiting
| unexpected race conditions on the server side. The crypto
| industry has been plagued with those since its inception.
|
| People seem to have gotten used to running chaotic, unpredictable
| systems, supporting impossibly complex permutations of concurrent
| events and trying to handle every possibile edge case instead of
| constraining each stream to certain sequences.
|
| I don't understand the industry's obsession with type safety in
| the face of far greater problems like this. Maybe we just need a
| new catchphrase: Concurrency safety? Async safety?
|
| Queueing up message processing per-client doesn't add any
| noticeable delays from the user's perspective because most
| operations are a few milliseconds which is unnoticeable once you
| factor in latency between client and server. It's only noticeable
| when you want it to be; for example when the user uploads a large
| chunk of data which requires heavy processing. You can also
| specify which streams can be parallel and which can be serial.
| deisteve wrote:
| i've written an implementation on postgresql
|
| you can use it like this: SELECT * FROM
| check_rate_limit('client1', 1.0, 10.0, INTERVAL '1 minute');
|
| -- Create a table to store client rate limiting data
|
| CREATE TABLE rate_limiter ( client_id VARCHAR(255) PRIMARY KEY,
| last_update_time TIMESTAMP WITH TIME ZONE NOT NULL, average_rate
| DOUBLE PRECISION NOT NULL );
|
| -- Function to check and update rate limit
|
| CREATE OR REPLACE FUNCTION check_rate_limit( client_id
| VARCHAR(255), cost DOUBLE PRECISION DEFAULT 1.0, limit_value
| DOUBLE PRECISION DEFAULT 600.0, period INTERVAL DEFAULT INTERVAL
| '1 hour' ) RETURNS TABLE ( allowed BOOLEAN, next_allowed_time
| TIMESTAMP WITH TIME ZONE ) AS $$ DECLARE current_time TIMESTAMP
| WITH TIME ZONE := NOW(); time_interval DOUBLE PRECISION; alpha
| DOUBLE PRECISION; instantaneous_rate DOUBLE PRECISION; new_rate
| DOUBLE PRECISION; client_data RECORD; BEGIN --
| Get client data or use defaults if not exists SELECT
| \* INTO client_data FROM rate_limiter WHERE
| rate_limiter.client_id = check_rate_limit.client_id;
| IF NOT FOUND THEN client_data := (client_id,
| current_time - period, 0.0)::rate_limiter; END IF;
| -- Calculate interval time_interval := EXTRACT(EPOCH
| FROM (current_time - client_data.last_update_time)) /
| EXTRACT(EPOCH FROM period); time_interval :=
| GREATEST(time_interval, 1.0e-10); -- Calculate alpha
| (exponential smoothing weight) alpha :=
| EXP(-time_interval); -- Calculate instantaneous rate
| instantaneous_rate := cost / time_interval; --
| Calculate new average rate new_rate := (1 - alpha) *
| instantaneous_rate + alpha * client_data.average_rate;
| -- Ensure rare requests are counted in full new_rate
| := GREATEST(new_rate, cost); -- Check if rate limit
| is exceeded IF new_rate > limit_value THEN
| -- Calculate next allowed time next_allowed_time :=
| current_time + (period * LN(new_rate / limit_value));
| allowed := FALSE; ELSE -- Update client
| data INSERT INTO rate_limiter (client_id,
| last_update_time, average_rate) VALUES (client_id,
| current_time, new_rate) ON CONFLICT (client_id) DO
| UPDATE SET last_update_time =
| EXCLUDED.last_update_time, average_rate =
| EXCLUDED.average_rate; next_allowed_time :=
| current_time; allowed := TRUE; END IF;
| RETURN NEXT;
|
| END; $$ LANGUAGE plpgsql;
|
| give that a whirl
___________________________________________________________________
(page generated 2024-09-12 23:01 UTC)