[HN Gopher] Exponential Rate Limiting
       ___________________________________________________________________
        
       Exponential Rate Limiting
        
       Author : mooreds
       Score  : 87 points
       Date   : 2024-09-09 14:47 UTC (3 days ago)
        
 (HTM) web link (dotat.at)
 (TXT) w3m dump (dotat.at)
        
       | hansvm wrote:
       | Ha, I wrote one of these a year ago. The problem being solved was
       | gracefully degrading a service if the request volume was too
       | high.
       | 
       | Some additional tweaks you might care about:
       | 
       | 1. It's easy to change the "numerator" being measured to
       | something continuous rather than discrete. As a rate limiter in
       | my application I found CPU seconds per second to be much more
       | useful than requests per second. If you have more control over
       | the low-level networking then queue depth or something might be
       | more applicable, but you work with what you have.
       | 
       | 2. It's easy to change the units of the configuration parameters.
       | I haven't checked if it's already in these units for example, but
       | TFA's "averaging period" can easily be re-cast as a half-life (in
       | microseconds or hours or whatever's most convenient).
       | 
       | Anywho, this is a nice write-up on exponential smoothing. I like
       | the attention to edge cases affecting rate-limiting specifically.
       | 
       | As a bit of a fun aside, the state-update equation here is an
       | Euler update solving the exponential ODE, fusing in the new
       | knowledge you have (new requests) as instantaneous tweaks. For
       | rate-limiting in particular the exponential is probably a good
       | choice (numerically stable, hardware instructions, works even
       | when time gaps or other features are far outside of your intended
       | configuration space, ...), but in other domains, like memoryless
       | (still "stateful", obviously, but only requiring an amalgamation
       | condensed into the very last state, as opposed to some O(n)
       | thing) animation updates, you might prefer other properties like
       | the ability to decay to exactly 0 in finite time. You can also
       | recover leaky buckets and other similar algorithms with the right
       | choice of ODE.
        
         | brightball wrote:
         | > As a rate limiter in my application I found CPU seconds per
         | second to be much more useful than requests per second.
         | 
         | This is really insightful and makes good sense. I've toyed for
         | years with the idea of building a "Cost Benefit Cache" that
         | basically tracked the request volume for a piece of data, the
         | time to retrieve that data and the size of the data retrieved
         | to make decisions about what to cache during system stress.
         | 
         | The idea was basically a cost benefit formula where the cost is
         | the size of the data, so how much of the available cache space
         | would it use, while the benefit is the seconds of savings from
         | requests * average retrieval time. Requests per second alone
         | doesn't really matter if the request is low/no impact.
        
           | CBLT wrote:
           | Google had a library like that, and one of the foot guns is
           | that requests that are malformed or otherwise do 0 work have
           | incredibly low cost, which results in them being prioritized
           | above all other requests. Every user had to configure some
           | obscure options to set cost multipliers based on return code
           | to try and correct for this, which felt like a pretty
           | fundamental bug that was being hacked around.
           | 
           | Edit: this was actually for cost-based load balancing, not
           | load shedding. You obviously can't load shed based on return
           | code.
        
             | travisjungroth wrote:
             | Interesting that "if the request is bad or there's an
             | error" is a difficult condition to implement universally.
        
               | jpollock wrote:
               | When you've got 80,000 engineers you end up with lots of
               | different opinions on what the wire protocol should look
               | like.
               | 
               | * Sometimes, you want to fail closed - errors are
               | returned for all errors.
               | 
               | * Sometimes, you want to fail open - ok is returned for
               | errors.
               | 
               | * Sometimes, you want to separate "failure" from
               | "denied", the code "worked", but it didn't do what the
               | client wanted - think "payment declined", so you return
               | "OK", with the actual result in the response.
               | 
               | Compare the different costs to charge a credit card that
               | is expired vs one that works.
               | 
               | * The expired one can be checked before calling out to
               | the processor - cheap, returns "OK + Declined".
               | 
               | * The success ends up with a call to the processor and
               | takes a second or two, returns "OK + Success".
        
               | hansvm wrote:
               | Some things are really hard to appropriately abstract,
               | but there's so much similarity we can't help but be
               | tempted to try.
               | 
               | Soemthing that I wrote at $WORK recently was a data-
               | oriented exponentially smoothed reservoir sampling order
               | statistic randomized splay tree.
               | 
               | All of those components exist individually. Even picking
               | on two of them though; given a tree implementation it's
               | trivial (as a human) to create an analogous order-
               | statistic tree. In any pseudo-mainstream programming
               | language though, especially at a system level, how do you
               | tack on the "order-statistic" behavior to an arbitrary
               | tree?
               | 
               | It's not too bad if you additional require some interface
               | the primary tree needs to adhere to. That interface is,
               | necessarily, biased toward the needs of an order-
               | statistic tree though. There isn't actually that much
               | difference between implementing a tree which adheres to
               | the interface and one which just does every operation
               | manually. Plus, there are some allocation questions and
               | other inefficiencies (or, alternatively, unsoundness)
               | that level of abstraction introduces which you can avoid
               | by writing the whole thing from scratch.
               | 
               | Bringing it back to the $GOOG library, the problem is
               | that you have a lot of code which _looks_ like it's the
               | same, but it's impossible to make it actually _be_ the
               | same without heavy lifting for callers of that library.
               | 
               | Taking those monolithic frameworks (e.g., inadvisedly
               | popular rate-limiting shenanigans) and turning them into
               | libraries of helpful utilities often helps. General
               | solutions are harder to come by.
        
         | jeffbee wrote:
         | CPU rate is only quasi-continuous. The fact of a thread running
         | or not running is binary. You cannot be 63% on a CPU, that is
         | an illusion of sampling and integration.
         | 
         | But the real trouble with CPU rate is that it represents the
         | past, not the future. It tells you what happened to the last
         | request and nothing about what will happen to the next request.
         | A signal with actual predictive power is how many other
         | requests are running or waiting to run.
        
           | hansvm wrote:
           | Like everything, the devil's in the details.
           | 
           | In our case, non-idle cumulative thread CPU time (including
           | system activity) is a nice metric. If it exceeds 70%, start
           | sending some 503s (or 500s depending on the client). A single
           | core can easily service 100B requests per day, a few times
           | more than we needed to, so if we exceeded that by enough
           | margin to saturate every core to 70% then the right course of
           | action was some combination of (1) scale up, and (2) notify
           | engineering that a moderately important design assumption was
           | violated. We could still service 100B requests per day per
           | core, but the excess resulted in a few 5xx errors for a
           | minute while autoscaling kicked in. It's not good enough for
           | every service, but it was absolutely what we needed and leaps
           | and bounds better than what we had before, with a side-
           | benefit of being "obviously" correct (simple code causes
           | fewer unexpected bugs).
           | 
           | The core idea (in tweaking the numerator in exponential
           | smoothing) isn't any easily refutable definition of CPU time
           | though. It's that you're able to tailor the algorithm to your
           | individual needs. A related metric is just the wall-clock
           | time from the beginning of a request being queued till when
           | it was serviced (depending on your needs, maybe compared to
           | the actual processing time of said request). If you ever get
           | to total_time >= 2 * processing_time, that likely indicates a
           | queue about to exceed a critical threshold.
           | 
           | > how many other requests are running or waiting to run
           | 
           | I referenced that idea pretty explicitly: "If you have more
           | control over the low-level networking then queue depth or
           | something might be more applicable, but you work with what
           | you have."
           | 
           | Yes, that matters. That doesn't make other approaches
           | invalid. Even then, you usually want a smoothed version of
           | that stat for automated decision-making.
        
       | alphazard wrote:
       | This is like a capacitor discharging. I had always considered
       | this a variant of token bucket with exponential instead of linear
       | decay.
        
         | fanf2 wrote:
         | The main difference from my point of view is the EWMA produces
         | a measurement of the rate, whereas token bucket only produces
         | the result of a comparison with the limit. If you don't need
         | the actual value of the rate then you can save a lot of time
         | and space by using GCRA instead
         | https://dotat.at/@/2024-08-30-gcra.html
        
           | ted_dunning wrote:
           | If doing a small amount of arithmetic and keeping 4 bytes (or
           | less) additional memory per client is "a lot of time and
           | space", you may need to rethink your priorities.
           | 
           | It is very, very unusual for any kind of request to anything
           | to take a time within many orders of magnitude of how long
           | exponential averaging will take. The difference between a
           | GCRA implementation and complex multi-exponential average
           | will be nanoseconds.
        
       | taneq wrote:
       | We were just this week talking at work about all the many and
       | varied names that people make up for "weighted average between
       | smoothed value and new sample". :D
        
       | tlarkworthy wrote:
       | I wrote an interactive notebook visualization and a fast converge
       | optimization for one of these
       | 
       | https://observablehq.com/@tomlarkworthy/rate-estimation
        
       | o11c wrote:
       | Hm, I really need to study this more. The early claim that it
       | requires "complicated float arithmetic" took me by surprise. I'm
       | used to fixed-time-period sampling, so you can just do:
       | // Textbook version:        //average = alpha * new_value +
       | (1-alpha)*average        // Which is equivalent to:
       | //average += (new_value - average)*alpha        // To get rid of
       | the float, define w=1/alpha  and make it an integer.        // If
       | w is a power of 2 (try 8), this is just a shift.        average
       | += (new_value - average) / w
        
         | fanf2 wrote:
         | Yeah, it's easy when alpha is fixed. The complications -- the
         | exp() and ln() -- come from updating the smoothed rate at
         | varying intervals, whenever a request happens to arrive, so
         | alpha has to be adjusted according to the interval to make sure
         | the exponential decay happens at a fixed speed regardless of
         | the interval.
        
         | ted_dunning wrote:
         | Absolutely. You just need integer implementation of fixed point
         | for evenly spaced data.
         | 
         | For irregularly spaced data, you need exp(-t), but that isn't
         | hard to do with fixed point.
        
       | ted_dunning wrote:
       | An often overlooked extension of simple exponential weighting is
       | that if you take the difference of two or three exponential
       | averages, you can get some very nice behavior.
       | 
       | One particular behavior that may be desirable is that you can
       | build a system that allows flexibly defined short bursts, but
       | limits long-term request rates which is really nice for chatty
       | protocols where you might want to nearly ignore a burst of 100
       | requests as long as the burst takes less than a few milliseconds,
       | but clamp down hard if there is a single query per second over
       | the last thirty seconds.
       | 
       | Differences in exponential decay can also be used to implement a
       | good approximation of a Gaussian kernel of any desired size. In
       | image processing, for instance, this allows a Gaussian blur whose
       | cost doesn't depend on the scale of the blur (unlike a true
       | convolution).
        
       | socketcluster wrote:
       | For WebSockets, using SocketCluster (https://socketcluster.io),
       | it's possible to queue up all requests from the same client and
       | then detect and respond to high backpressure spikes (e.g. by
       | disconnecting the client and/or recording the incident).
       | 
       | You can combine different approaches like limiting the number of
       | connections from a single IP within a certain timeframe and also
       | limiting the backpressure.
       | 
       | The ability to detect and respond to backpressure spikes on a
       | per-end-user basis is highly valuable because backpressure in
       | SocketCluster takes into account the processing time of client
       | requests.
       | 
       | A common strategy that spammers use is to identify and invoke the
       | most expensive endpoints in your system.
       | 
       | Unfortunately, a lot of people still don't understand the value
       | proposition of being able to process requests from clients in-
       | stream and in-order. It's also good for preventing race
       | conditions and makes your environment highly predictable.
       | 
       | In terms of security, many of the worst, most expensive hacks in
       | history were the result of asynchronous events exploiting
       | unexpected race conditions on the server side. The crypto
       | industry has been plagued with those since its inception.
       | 
       | People seem to have gotten used to running chaotic, unpredictable
       | systems, supporting impossibly complex permutations of concurrent
       | events and trying to handle every possibile edge case instead of
       | constraining each stream to certain sequences.
       | 
       | I don't understand the industry's obsession with type safety in
       | the face of far greater problems like this. Maybe we just need a
       | new catchphrase: Concurrency safety? Async safety?
       | 
       | Queueing up message processing per-client doesn't add any
       | noticeable delays from the user's perspective because most
       | operations are a few milliseconds which is unnoticeable once you
       | factor in latency between client and server. It's only noticeable
       | when you want it to be; for example when the user uploads a large
       | chunk of data which requires heavy processing. You can also
       | specify which streams can be parallel and which can be serial.
        
       | deisteve wrote:
       | i've written an implementation on postgresql
       | 
       | you can use it like this:                    SELECT * FROM
       | check_rate_limit('client1', 1.0, 10.0, INTERVAL '1 minute');
       | 
       | -- Create a table to store client rate limiting data
       | 
       | CREATE TABLE rate_limiter ( client_id VARCHAR(255) PRIMARY KEY,
       | last_update_time TIMESTAMP WITH TIME ZONE NOT NULL, average_rate
       | DOUBLE PRECISION NOT NULL );
       | 
       | -- Function to check and update rate limit
       | 
       | CREATE OR REPLACE FUNCTION check_rate_limit( client_id
       | VARCHAR(255), cost DOUBLE PRECISION DEFAULT 1.0, limit_value
       | DOUBLE PRECISION DEFAULT 600.0, period INTERVAL DEFAULT INTERVAL
       | '1 hour' ) RETURNS TABLE ( allowed BOOLEAN, next_allowed_time
       | TIMESTAMP WITH TIME ZONE ) AS $$ DECLARE current_time TIMESTAMP
       | WITH TIME ZONE := NOW(); time_interval DOUBLE PRECISION; alpha
       | DOUBLE PRECISION; instantaneous_rate DOUBLE PRECISION; new_rate
       | DOUBLE PRECISION; client_data RECORD; BEGIN                   --
       | Get client data or use defaults if not exists              SELECT
       | \* INTO client_data         FROM rate_limiter         WHERE
       | rate_limiter.client_id = check_rate_limit.client_id;
       | IF NOT FOUND THEN             client_data := (client_id,
       | current_time - period, 0.0)::rate_limiter;         END IF;
       | -- Calculate interval              time_interval := EXTRACT(EPOCH
       | FROM (current_time - client_data.last_update_time)) /
       | EXTRACT(EPOCH FROM period);         time_interval :=
       | GREATEST(time_interval, 1.0e-10);              -- Calculate alpha
       | (exponential smoothing weight)              alpha :=
       | EXP(-time_interval);              -- Calculate instantaneous rate
       | instantaneous_rate := cost / time_interval;              --
       | Calculate new average rate              new_rate := (1 - alpha) *
       | instantaneous_rate + alpha * client_data.average_rate;
       | -- Ensure rare requests are counted in full              new_rate
       | := GREATEST(new_rate, cost);              -- Check if rate limit
       | is exceeded              IF new_rate > limit_value THEN
       | -- Calculate next allowed time             next_allowed_time :=
       | current_time + (period * LN(new_rate / limit_value));
       | allowed := FALSE;         ELSE                  -- Update client
       | data                  INSERT INTO rate_limiter (client_id,
       | last_update_time, average_rate)             VALUES (client_id,
       | current_time, new_rate)             ON CONFLICT (client_id) DO
       | UPDATE             SET last_update_time =
       | EXCLUDED.last_update_time,                 average_rate =
       | EXCLUDED.average_rate;                  next_allowed_time :=
       | current_time;             allowed := TRUE;         END IF;
       | RETURN NEXT;
       | 
       | END; $$ LANGUAGE plpgsql;
       | 
       | give that a whirl
        
       ___________________________________________________________________
       (page generated 2024-09-12 23:01 UTC)