[HN Gopher] We scaled the GitHub API with a sharded, replicated ...
       ___________________________________________________________________
        
       We scaled the GitHub API with a sharded, replicated rate limiter in
       Redis
        
       Author : prakhargurunani
       Score  : 118 points
       Date   : 2021-04-08 13:26 UTC (9 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | tayloramurphy wrote:
       | I originally thought this article was going to be about John
       | Berryman's proposed Redis rate limiter [0]
       | 
       | [0] http://blog.jnbrymn.com/2021/03/18/estimated-average-
       | recent-...
        
       | gigatexal wrote:
       | We had a saying at my old job: if something's broken it's never
       | Redis. Redis is such a tank in my experience. We set it up.
       | Secured it. And then forgot about it.
        
         | spullara wrote:
         | At Twitter we hit 15s memory allocation pause times due to
         | fragmentation. We had to switch the memory allocator to fix it.
        
           | hrpnk wrote:
           | Which memory allocator did you end up using?
        
             | spullara wrote:
             | I think we settled on jemalloc but my memory may be failing
             | me.
        
         | rshaw1 wrote:
         | I thought the same up until yesterday!
         | 
         | Yesterday the replica randomly disconnected from the master and
         | could no longer reconnect, the resyncs were failing because a
         | replication buffer on the master was being hit
         | (https://redislabs.com/blog/top-redis-headaches-for-devops-
         | re...). Once that buffer was increased the replica was able to
         | sync the snapshot from the master but for some reason was
         | taking a very long time to load the snapshot, during this time
         | the sentinels thought the replica was healthy again and started
         | allowing the application to read from them, of course Redis
         | responded with an error "dataset loading". We were running
         | Redis 6.0.8 and upgrading the replica to 6.2.1 allowed the
         | replica to sync and become healthy in seconds.
         | 
         | I'm still not sure why the sentinels thought the replica was
         | healthy as issuing commands to it always returned an error.
        
           | [deleted]
        
           | gigatexal wrote:
           | Whoa! You should blog about this for posterity for the next
           | hopeless soul who has to figure out why the thing that never
           | dies died.
        
         | junon wrote:
         | Absolutely. Of course I plan for what happens if Redis were to
         | fail, but I don't remember a single time it ever did. It
         | really, truly is a tank.
         | 
         | Antirez, if you're reading this (I know you'll eventually find
         | your way here), thanks for making my job a little easier over
         | the years <3
        
         | lefrancaiz wrote:
         | >The sharding project bought us some time regarding database
         | capacity, but as we soon found out, there was a huge single
         | point of failure in our infrastructure. All those shards were
         | still using a single Redis. At one point, the outage of that
         | Redis took down all of Shopify, causing a major disruption we
         | later called "Redismageddon". This taught us an important
         | lesson to avoid any resources that are shared across all of
         | Shopify.
         | 
         | It seems that it does happen sometimes, however.
         | 
         | https://shopify.engineering/e-commerce-at-scale-inside-shopi...
        
       | [deleted]
        
       | junon wrote:
       | This is strange to me. Did Github do client-based sharding
       | because they were trying to get around the upfront key
       | enumeration limitation in Lua scripts? Why didn't they use the
       | cluster's ability to proxy requests to the appropriate sharded
       | worker?
       | 
       | As-is, they could have just passed `rate_limit_key+':exp'` as a
       | second KEYS entry and it would have ensured the key existed for
       | operation. They were deriving keys off of apriori information, so
       | they could have just as easily foregone the client-side
       | complexity and just put the redis cluster in a sharded
       | configuration.
       | 
       | I wonder what sorts of performance impact this had (the page
       | doesn't mention it). Client-side sharding almost certainly
       | increased the codebase complexity and it doesn't seem like they
       | measured any real impact from doing it this way (or maybe they
       | just chose not to report it).
        
         | tlhunter wrote:
         | I was concerned about the missing KEYS entry, too.
         | 
         | I believe the script would fail if the two keys were on
         | different machines, assuming both key names were provided as
         | KEYS, though the {} syntax should have avoided that. By
         | generating the key name in a Lua script it forces the two
         | related keys to be on the same machine.
         | 
         | At the end of the day, calculating the server client side isn't
         | necessarily messy. Surely the Lua script deserves a warning
         | about the key name being generated within the script.
         | 
         | I think the ideal script would use two keys, the first being
         | like `foo-{1234}`, and the second `foo-{1234}:exp`, with both
         | key names being provided via KEYS. Then the native Redis
         | clustering should work.
        
       | fasteo wrote:
       | Related: mailgun's gubernator[1]. No redis.
       | 
       | [1] https://www.mailgun.com/blog/gubernator-cloud-native-
       | distrib...
        
       | rattray wrote:
       | The article mentions they took some inspiration from a Stripe
       | blogpost/gist; for convenience, here's the direct link to the
       | relevant lua code (helps compare what is interesting/unique about
       | github's approach):
       | 
       | https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff...
       | 
       | (disclaimer, I worked on the rate limiter at Stripe a bit, but
       | can't remember how similar the 2019-era code was to what you see
       | there; I think broadly similar).
        
         | rattray wrote:
         | Ah, it turns out my former colleagues @brandur, who worked on
         | rate limiting at Stripe for some time after the blog post was
         | written, has a Redis rate-limiting module published here:
         | https://github.com/brandur/redis-cell/
        
           | Operyl wrote:
           | Ah, that's great! I was just looking down this path the other
           | day for a project I need coming up soon. This'll save a good
           | days worth of work :).
        
         | uyt wrote:
         | I also found https://cloud.google.com/solutions/rate-limiting-
         | strategies-... to be useful. In particular at the very bottom
         | they link to a lot of other blogs about rate limiting such as:
         | https://www.figma.com/blog/an-alternative-approach-to-rate-l...
         | 
         | I am only reading about them because for whatever reason this
         | has become a trendy "system design" interview question.
        
           | hrpnk wrote:
           | It's a task that's easy and quick to understand for any
           | engineer in terms of requirements. Also, it offers great
           | depth in terms of approaches, technology selection, and data
           | volumes to be analyzed.
        
       ___________________________________________________________________
       (page generated 2021-04-08 23:01 UTC)