[HN Gopher] We scaled the GitHub API with a sharded, replicated ...
___________________________________________________________________
We scaled the GitHub API with a sharded, replicated rate limiter in
Redis
Author : prakhargurunani
Score : 118 points
Date : 2021-04-08 13:26 UTC (9 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| tayloramurphy wrote:
| I originally thought this article was going to be about John
| Berryman's proposed Redis rate limiter [0]
|
| [0] http://blog.jnbrymn.com/2021/03/18/estimated-average-
| recent-...
| gigatexal wrote:
| We had a saying at my old job: if something's broken it's never
| Redis. Redis is such a tank in my experience. We set it up.
| Secured it. And then forgot about it.
| spullara wrote:
| At Twitter we hit 15s memory allocation pause times due to
| fragmentation. We had to switch the memory allocator to fix it.
| hrpnk wrote:
| Which memory allocator did you end up using?
| spullara wrote:
| I think we settled on jemalloc but my memory may be failing
| me.
| rshaw1 wrote:
| I thought the same up until yesterday!
|
| Yesterday the replica randomly disconnected from the master and
| could no longer reconnect, the resyncs were failing because a
| replication buffer on the master was being hit
| (https://redislabs.com/blog/top-redis-headaches-for-devops-
| re...). Once that buffer was increased the replica was able to
| sync the snapshot from the master but for some reason was
| taking a very long time to load the snapshot, during this time
| the sentinels thought the replica was healthy again and started
| allowing the application to read from them, of course Redis
| responded with an error "dataset loading". We were running
| Redis 6.0.8 and upgrading the replica to 6.2.1 allowed the
| replica to sync and become healthy in seconds.
|
| I'm still not sure why the sentinels thought the replica was
| healthy as issuing commands to it always returned an error.
| [deleted]
| gigatexal wrote:
| Whoa! You should blog about this for posterity for the next
| hopeless soul who has to figure out why the thing that never
| dies died.
| junon wrote:
| Absolutely. Of course I plan for what happens if Redis were to
| fail, but I don't remember a single time it ever did. It
| really, truly is a tank.
|
| Antirez, if you're reading this (I know you'll eventually find
| your way here), thanks for making my job a little easier over
| the years <3
| lefrancaiz wrote:
| >The sharding project bought us some time regarding database
| capacity, but as we soon found out, there was a huge single
| point of failure in our infrastructure. All those shards were
| still using a single Redis. At one point, the outage of that
| Redis took down all of Shopify, causing a major disruption we
| later called "Redismageddon". This taught us an important
| lesson to avoid any resources that are shared across all of
| Shopify.
|
| It seems that it does happen sometimes, however.
|
| https://shopify.engineering/e-commerce-at-scale-inside-shopi...
| [deleted]
| junon wrote:
| This is strange to me. Did Github do client-based sharding
| because they were trying to get around the upfront key
| enumeration limitation in Lua scripts? Why didn't they use the
| cluster's ability to proxy requests to the appropriate sharded
| worker?
|
| As-is, they could have just passed `rate_limit_key+':exp'` as a
| second KEYS entry and it would have ensured the key existed for
| operation. They were deriving keys off of apriori information, so
| they could have just as easily foregone the client-side
| complexity and just put the redis cluster in a sharded
| configuration.
|
| I wonder what sorts of performance impact this had (the page
| doesn't mention it). Client-side sharding almost certainly
| increased the codebase complexity and it doesn't seem like they
| measured any real impact from doing it this way (or maybe they
| just chose not to report it).
| tlhunter wrote:
| I was concerned about the missing KEYS entry, too.
|
| I believe the script would fail if the two keys were on
| different machines, assuming both key names were provided as
| KEYS, though the {} syntax should have avoided that. By
| generating the key name in a Lua script it forces the two
| related keys to be on the same machine.
|
| At the end of the day, calculating the server client side isn't
| necessarily messy. Surely the Lua script deserves a warning
| about the key name being generated within the script.
|
| I think the ideal script would use two keys, the first being
| like `foo-{1234}`, and the second `foo-{1234}:exp`, with both
| key names being provided via KEYS. Then the native Redis
| clustering should work.
| fasteo wrote:
| Related: mailgun's gubernator[1]. No redis.
|
| [1] https://www.mailgun.com/blog/gubernator-cloud-native-
| distrib...
| rattray wrote:
| The article mentions they took some inspiration from a Stripe
| blogpost/gist; for convenience, here's the direct link to the
| relevant lua code (helps compare what is interesting/unique about
| github's approach):
|
| https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff...
|
| (disclaimer, I worked on the rate limiter at Stripe a bit, but
| can't remember how similar the 2019-era code was to what you see
| there; I think broadly similar).
| rattray wrote:
| Ah, it turns out my former colleagues @brandur, who worked on
| rate limiting at Stripe for some time after the blog post was
| written, has a Redis rate-limiting module published here:
| https://github.com/brandur/redis-cell/
| Operyl wrote:
| Ah, that's great! I was just looking down this path the other
| day for a project I need coming up soon. This'll save a good
| days worth of work :).
| uyt wrote:
| I also found https://cloud.google.com/solutions/rate-limiting-
| strategies-... to be useful. In particular at the very bottom
| they link to a lot of other blogs about rate limiting such as:
| https://www.figma.com/blog/an-alternative-approach-to-rate-l...
|
| I am only reading about them because for whatever reason this
| has become a trendy "system design" interview question.
| hrpnk wrote:
| It's a task that's easy and quick to understand for any
| engineer in terms of requirements. Also, it offers great
| depth in terms of approaches, technology selection, and data
| volumes to be analyzed.
___________________________________________________________________
(page generated 2021-04-08 23:01 UTC)