[HN Gopher] Reining in the thundering herd: Getting to 80% CPU u...
___________________________________________________________________
Reining in the thundering herd: Getting to 80% CPU utilization with
Django
Author : domino
Score : 82 points
Date : 2021-08-15 17:19 UTC (5 hours ago)
(HTM) web link (blog.clubhouse.com)
(TXT) w3m dump (blog.clubhouse.com)
| stingraycharles wrote:
| Tangent, but I always had a different understanding of the
| "thundering herd" problem; that is, if a service is down for
| whatever reason, and it's brought back online, it immediately
| grinds to a halt again because there are a bazillion requests
| waiting to be handled.
|
| And the solution to this problem is to slowly, rate-limited,
| bring the service back online, rather than letting the whole
| thundering herd go through the door immediately.
| Ozzie_osman wrote:
| Yea you are right. It could be a service being down and
| requests piling up, or a cache key expiring and many processes
| trying to regenerate the value at the same time, etc.
|
| I think the article just used this phrase to describe something
| else. (Great article otherwise).
| taylorhughes wrote:
| Phrase borrowed from excellent uWSGI docs https://uwsgi-
| docs.readthedocs.io/en/latest/articles/Seriali...
| ambicapter wrote:
| Funny reading this comment after reading the article
|
| > So many options meant plenty of levers to twist around,
| but the lack of clear documentation meant that we were
| frequently left guessing the true intention of a given
| flag.
|
| And then reading your link, they complain >inside the docs<
| that the docs aren't complete. I have no idea what to
| believe anymore :D
| fanf2 wrote:
| There is an explanation of this kind of thundering herd about
| 3/4 down this article
| https://httpd.apache.org/docs/trunk/misc/perf-scaling.html
|
| The short version is that when you have multiple processes
| waiting on listening sockets and a connection arrives, they
| all get woken up and scheduled to run, but only one will pick
| up the connection, and the rest have to go back to sleep.
| These futile wakeups can be a huge waste of CPU, so on
| systems without accept() scalability fixes, or with more
| tricky server configurations, the web server puts a lock
| around accept() to ensure only one process is woken up at a
| time.
|
| The term (and the fix) dates back to the performance
| improvement work on Apache 1.3 in the mid-1990s.
| [deleted]
| j4mie wrote:
| If you're delegating your load balancing to something else
| further up the stack and would prefer a simpler WSGI server than
| Gunicorn, Waitress is worth a look:
| https://github.com/pylons/waitress
| latchkey wrote:
| If it is just a backend, why not port it over to one of the
| myriad of cloud autoscaling solutions that are out there?
|
| The opportunity cost of spending time figuring out why only 29
| workers are receiving requests over adding new features that
| generate more revenue, seems like a quick decision.
|
| Personally, I just start off with that now in the first place,
| the development load isn't any greater and the solutions that are
| out there are quite good.
| lddemi wrote:
| Author here. We do and did use autoscaling heavily but at a
| certain scale we just ran out of headroom on the smaller
| instance types we were using. Jumping to a much larger instance
| types meant that we will likely never run into those headroom
| issues again, plus solves other problems like faster spin up,
| better sidecar connection pooling and allows for a much higher
| hit rate on per instance caching.
| latchkey wrote:
| You were autoscaling a single threaded process. You had 1000
| connections coming in and scaling 1000 workers for those
| connections. Everything was filtered through gunicorn and
| nginx, which just adds additional latencies and complexity,
| for no real benefit.
|
| What I'm talking about is just pointing at something like
| AppEngine, Cloud Functions, etc... (or whatever solution AWS
| has that is similar) and being done with it. I'm talking
| about not running your own infrastructure, at all. Let AWS
| and Google be your devops so that you can focus on building
| features.
| ddorian43 wrote:
| Now you just 5x their costs.
| latchkey wrote:
| Not if you do it right.
|
| a) you get to fire the devops person, which saves $150k+
| a year.
|
| b) you add appropriate caching layers in front of
| everything.
|
| c) you spend time adding features, which generate
| revenue.
|
| I've done all of this before at scale. This whole case
| study was written about work I did [1]. Two devs, 3
| months to release, first year was $80m gross revenue on
| $500/month cloud bills. Infinite scalability, zero
| devops.
|
| [1] https://cloud.google.com/customers/gearlaunch
| TekMol wrote:
| Did you consider switching form CPython to Pypy?
| _tom_ wrote:
| They are on a back-end that does auto-scaling. They stated that
| they had problems when scaling up past 1000 nodes.
|
| Now, maybe they could have fixed that issue instead, but going
| from 29 to 58 workers is easy, it's not the same going to
| 29,000 to 58,000. And 1000 hosts vs 500 is a non-trivial cost.
| PaywallBuster wrote:
| containers would've solved it
|
| one process per container, easy peasy
| motoboi wrote:
| you now containers are just processes, right?
|
| This is what they did, but because they didn't need to
| schedule other jobs on the same machine, kubernetes or even
| docker would be overkill.
|
| In this case, simple VM orchestration seems like a fine
| solution.
| Spivak wrote:
| This doesn't work so easily with architectures with process
| pools for workers. So now your app server needs to speak
| docker (or whatever control plane) to spawn new workers and
| deal with more complicated IPC. Also the startup time is
| brutal.
|
| One process per container and multiprocessing is a huge lift
| most of the time. I've done it but it can be a mess because
| you don't really have as much a handle on containers than
| subprocesses because you can only poke them at a distance
| through the control plane.
| TekMol wrote:
| Performance is the only thing that is holding me back to consider
| Python for bigger web applications.
|
| Of the 3 main languages for web dev these days - Python, PHP and
| Javascript - I like Python the most. But it is scary how slow the
| default runtime, CPython, is. Compared to PHP and Javascript, it
| crawls like a snake.
|
| Pypy could be a solution as it seems to be about 6x faster on
| average.
|
| Is anybody here using Pypy for Django?
|
| Did Clubhouse document somewhere if they are using CPython or
| Pypy?
| IshKebab wrote:
| Typescript is a nicer language than Python in many ways and it
| doesn't suffer from Python's crippling performance issues or
| dubious static typing situation. Plus you can run it in a
| browser so there's only one language to learn.
| TekMol wrote:
| You cannot run TS in a browser.
|
| You can compile it to JS or to Webassembly. But you can do
| that with every language.
| domino wrote:
| Clubhouse is using CPython
| TekMol wrote:
| Interesting. Is there a reason for this?
| mst wrote:
| It's the standard and best supported approach and the level
| of speedup you get from PyPy is significantly workload
| dependent.
| truffdog wrote:
| I mean... it's the default?
| fnord77 wrote:
| The top 5 web app programming languages by market share are
| PHP, Java, JS, Lua and Ruby.
| TekMol wrote:
| ...said a stranger on the internet without any sources to
| back up this claim.
| sdze wrote:
| only one data point though, but...
| https://redmonk.com/rstephens/2021/08/05/top-20-june-2021/
| TekMol wrote:
| That one lists the top 5 as: Javascript
| Python Java PHP C#
|
| But is it about web?
| sdze wrote:
| use PHP ;)
| nsizx wrote:
| So much this. Practically any other option is better than
| Python for web development if you're looking for performance.
| waprin wrote:
| Yet YouTube, Instagram, Pinterest, Reddit, Robinhood,
| DoorDash, and Lyft backend were originally primarily written
| in Python. What's funny is that nobody can really deny Python
| is slow yet somehow the biggest websites in the world were
| written in it. More proof that Worse Is Better?
| huffmsa wrote:
| In the early stages:
|
| Speed of development is far more important than optimizing
| CPU usage.
|
| You can fake your way to fast responses with good caching,
| but there's not really many ways to fake having the best
| features.
| void_mint wrote:
| By this logic, why not Java, C++, Rust, Go, C#?
|
| They're all web-capable and blow the doors off PHP, Python,
| etc.
| IshKebab wrote:
| Yes all of those would be way better options than Python
| and probably PHP. Well maybe not C++. You'd have to be
| pretty crazy to have web developers writing security
| sensitive code in C++.
|
| The "blame our co-founder for the choice" bit is exactly
| what that graph about the cost of defects vs how early they
| are fixed is talking about.
|
| If they had just picked Go or Java right at the start they
| wouldn't have had to expend all this engineering effort to
| get to a still-not-very-good solution.
| void_mint wrote:
| This thread arose from a person that said "Use PHP", as
| an argument to using Python. It's a silly argument.
| sdze wrote:
| It was just a silly remark about the snail-like
| performance of Python.
|
| Another silly thing:
|
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| sdze wrote:
| C#, Java, C++ need application servers, no?
|
| "Serverless" scales infinitely due to its simpler
| request/response lifecycle.
| void_mint wrote:
| Serverless is too overloaded a term to have any meaning.
| I'm not really seeing how Python or PHP "scales
| infinitely" in any way that C#, Java, C++ couldn't.
| sdze wrote:
| It blows my mind how quickly PHP7.4 processes even shitty
| code.
| polote wrote:
| I wouldn't be very proud of writing an article like that.
|
| Usually engineering blogs exists to show that there are fun stuff
| to do in a company. But here it just seems they have no idea,
| what they are doing. Which is fine, I'm classifying myself in the
| same category.
|
| Reading the article I don't feel like they have solved their
| issue, they just created more future problems
| kvalekseev wrote:
| HAProxy is a beautiful tool but it doesn't buffer requests that
| is why NGINX is recommended in front of gunicorn otherwise it's
| suspectible to slowloris attack. So either cloubhouse can be
| easily DDOS'd right now or they have some tricky setup that
| prevents slow post reqests reaching gunicorn. In the blog post
| they don't mention that problem while recommend others to try and
| replace NGINX with HAPRoxy.
| lddemi wrote:
| 1. HAProxy does support request buffering
| https://cbonte.github.io/haproxy-dconv/2.2/configuration.htm...
|
| 2. our load balancer buffers requests as well
| kvalekseev wrote:
| From HAProxy mailing list about http_buffer_request option
| https://www.mail-
| archive.com/haproxy@formilux.org/msg23074.h...
|
| > In fact, with some app-servers (e.g. most Ruby/Rack
| servers, most Python servers, ...) the recommended setup is
| to put a fully buffering webserver in front. Due to it's
| design, HAProxy can not fill this role in all cases with
| arbitrarily large requests.
|
| A year ago I was evaluating recent version of HAProxy as
| buffering web server and successfully run slowloris attack
| against it. Thus switching from NGINX is not a
| straightforward operation and your blog post should mention
| http-buffer-request option and slow client problem.
| lmilcin wrote:
| 1M requests per minute on 1000 web instances is not an
| achievement, it is a disaster.
|
| It is ridiculous people brag about it.
|
| Guys, if you have budget maybe I can help you up this by couple
| orders of magnitude.
| smashed wrote:
| To be honest the article does realize this, first blaming it on
| the poor hindsight from original developer (co-founder) and in
| the conclusion about maybe rewriting the whole thing.
|
| It seemed to be all about how to extract the most performance
| from the lemon they had to deal with.
|
| I found the linked reference really informative too:
| https://rachelbythebay.com/w/2020/03/07/costly/
| lmilcin wrote:
| I don't know Python or how complex their domain is but the
| number of workers suggests to me it is not that complex and
| their application spends most of its time switching contexts
| and in inefficient frameworks.
|
| Per my experience most applications that mostly serve
| documents from databases should be able to take on at least
| 10k requests per second on a single node. this is 600k
| requests per minute on one node, compared to their 1M per
| 1000 nodes.
|
| This is what I am typically getting from a simple setup with
| Java, WebFlux and MongoDB with a little bit of experience on
| what stupid things not to do but without spending much time
| fine tuning anything.
|
| I think bragging about performance improvements when your
| design and architecture is already completely broken is at
| the very least embarrassing.
|
| > poor hindsight from original developer (co-founder)
|
| Well, you have a choice of technologies to write your
| application in, why chose one that sucks so much when there
| are so many others that suck less?
|
| It is not poor choice, it is lack of competency.
|
| You are co-founder and want your product to succeed? Don't do
| stupid shit like choosing stack that already makes reaching
| your goal very hard.
| nomdep wrote:
| So do you think using Django is stupid? I guess you think
| the same about every product that uses Ruby on Rails?
| lmilcin wrote:
| No, Django is not stupid.
|
| It is the decision to choose it to run load that will
| require 1000s of servers when it could be handled with
| 5-10 servers in another technology without more
| development effort.
| mst wrote:
| I doubt they expected that level of request load that
| early on - I imagine the technology choice was made
| significantly before the whole pandemic thing started.
| taylorhughes wrote:
| (CH employee here)
|
| The job of the cofounder is to create a thing that people
| want, which has nothing to do with performance. The first
| goal is capturing lightning in a bottle with social
| products. Performance doesn't matter until the lightning is
| there, and 99%+ of the time you never have to worry about
| performance, because you don't get the lightning. So,
| probably the correct choice is leveraging the tech stack
| that gives you the best shot at capturing the lightning.
| Django seemed to help!
| jimsimmons wrote:
| Don't sweat it buddy. People here just want to stand on
| your toes and feel taller. Classic HN.
|
| Velocity of development is priority #1 and having
| something that needs to be scaled is a monumental
| achievement.
| mst wrote:
| Plus, if he could've predicted the pandemic that far in
| advance there would probably have been plenty of not
| clubhouse ways to monetise that prescience ;)
| lmilcin wrote:
| This is just silly excuse.
|
| The job of the cofounder is also to anticipate possible
| risks.
|
| And building your company on an astronomically
| inefficient technology sounds like a huge risk to me.
|
| Those 1000s of servers are probably a very significant
| cost with such small technical staff. Just by choosing
| the right technology for the problem, most of that cost
| could have been avoided.
|
| Django has nothing special in it that would allow
| building applications faster than in a lot other
| frameworks that are also much more efficient.
|
| So it is just a matter of simple choice.
|
| Nobody expects people to write webapps in C++ or Rust.
| Just don't choose technology that is famous for being
| inefficient.
| chimen wrote:
| So Django is an "astronomically inefficient technology"?
| I would just stop if I were you.
| jimsimmons wrote:
| Python is not astronomically inefficient. Instagram
| serves like a billion users with it. Job of a cofounder
| is to build what people want. You can always scale in
| Silicon Valley by hiring people like you. You can't build
| another viral app like clubhouse by hiring from the same
| crowd.
|
| This may hurt you but the truth is scaling and software
| engineering is highly commoditised. That's the whole
| point of being in the valley. You can hire people for
| such things and forget about it.
|
| Clubhouse is not a tech company. They don't have to care
| about being the best at infra
| vilified wrote:
| You sound just like the average sports fan commenting
| after a match about what x player should have done,
| shouldn't have done, blame it on decisions, style of the
| trainer, owner etc.. But you're just that.. a fan yapping
| about how they could do better.
| luckycharms810 wrote:
| This is a comically yet incredibly common engineering bad take.
| When you run a company there is only one question to answer,
| one north star - does it make money ?
| yuliyp wrote:
| Knowing nothing else, it's hard to know if this is good or not.
| It's 16 requests per second. Are those requests something like
| "Render a support article" or are they "Give the user a ranked
| feed of what they should see on their home screen"? Is most of
| the logic run by the web server or some combination of app
| servers / backend services behind it? What kind of hardware
| does the web server have?
|
| All of those would affect the answer, and would preclude being
| able to guarantee "up this by couple orders of magnitude"
| tbrock wrote:
| Aside: AWS only allows registering 1000 targets in a target
| group... i wonder if thats the limit they hit. If so, its
| documented.
| vvatsa wrote:
| ya, I pretty much agree with 3 suggestions at the end:
|
| * use uWSGI (read the docs, so many options...)
|
| * use HAProxy, so very very good
|
| * scale python apps by using processes.
| dilyevsky wrote:
| Kinda funny they decided paying a ton of money to aws was ok but
| paying for nginx plus was not
| andrenotgiant wrote:
| ClubHouse runs on AWS?
| dilyevsky wrote:
| Hm actually might be google based on what their traffic is
| going to (i only looked just now). Ok now it makes more sense
| why support wasn't able to figure this out =)
| Spivak wrote:
| I kinda get that honestly. It's why I'll spend $20 without even
| thinking for take out but not spend $2 for an app. It's because
| the cost off the software is way way more than the money. It's
| a commitment to actually use it and integrate it, deal with
| their sales team, talk to purchasing, handle licensing, and
| introducing friction to replacing it or using tools that don't
| integrate well because "well we already pay for it." Licensing
| also complicates deployments substantially when you're doing
| lots of autoscaling.
|
| And on top of that Nginx Plus is also expensive as hell.
| rowanG077 wrote:
| The buy in into AWS is much, much larger then using a piece
| of software though.
| dilyevsky wrote:
| Don't you have to integrate cloud? This whole post is about
| having to put a bunch of workaround bc the cloud can't scale
| apparently
| ClumsyPilot wrote:
| "It's why I'll spend $20 without even thinking for take out
| but not spend $2 for an app."
|
| I pay for apps, its not a healthy attotude
| [deleted]
| spullara wrote:
| The difference people see, as far as I can tell, is that AWS is
| charging you cost+ and pure software companies need to charge
| for value or die.
| JanMa wrote:
| Interesting to read that they are using Unix sockets to send
| traffic to their backend processes. I know that it's easily done
| when using HaProxy but I have never read about people using it. I
| guess the fact that they are not using docker or another
| container runtime makes sockets rather simple to use.
| mst wrote:
| I do that every chance I can get.
|
| At a guess, it's probably most loved by people picking old
| school simple architectures that aren't the sort of thing that
| goes viral.
| kvalekseev wrote:
| It's standard way to connect things in UNIX and provides better
| performance. For example postgresql tcp+ssl is 175% slower than
| socket
| https://momjian.us/main/blogs/pgblog/2012.html#June_6_2012
| stu2010 wrote:
| Interesting to see this. It sounds like they're not on AWS, given
| that they mentioned that having 1000 instances for their
| production environment made them one of the bigger deployments on
| their hosting provider.
|
| If not for the troubles they experienced with their hosting
| provider and managing deployments / cutting over traffic, it
| possibly could have been the cheaper option to just keep
| horizontally scaling vs putting in the time to investigate these
| issues. I'd also love to see some actual latency graphs, what's
| the P90 like at 25% CPU usage with a simple Gunicorn / gevent
| setup?
| ksec wrote:
| I was wondering that too, but there aren't that many common
| cloud provider that has 96 vCPU offering.
|
| I am also wondering on 144 Workers, on 96 _vCPU_ which is not
| 96 CPU Core but 96 CPU _thread_. So effectively 144 Workers on
| 48 CPU Core possibly running at sub 3Ghz Clock Speed. But it
| seems they got it to work out in the end. ( May be at the
| expense of latency )
| mst wrote:
| Assuming you're running a system where normal
| request/response handling blocks on database queries it's
| often optimal to have more workers than available cpu threads
| and 1.5x is a common rule of thumb to try first.
| catillac wrote:
| Famous last words, but I get the sense that the need to handle
| this sort of load on Clubhouse is plateauing and will decline
| from here. The app seems to have shed all the people that drew
| other people initially and lost its small, intimate feel and has
| turned into either crowded rooms where no one can say anything,
| or hyper specific rooms where no one has anything to say.
|
| Good article though! I've dealt with these exact issues and they
| can be very frustrating.
| luhn wrote:
| Unfortunately HAProxy doesn't buffer requests*, which is
| necessary for a production deployment of gunicorn. And for
| anybody using AWS, ALB doesn't buffer requests either. Because of
| this I'm actually running both HAProxy and nginx in front of my
| gunicorn instances--nginx in front for request buffering and
| HAProxy behind that for queuing.
|
| If anybody is interested, I've packaged both as Docker
| containers:
|
| HAProxy queuing/load shedding:
| https://hub.docker.com/r/luhn/spillway
|
| nginx request buffering: https://hub.docker.com/r/luhn/gunicorn-
| proxy
|
| * It does have an http_buffer_request option, but this only
| buffers the first 8kB (?) of the request.
___________________________________________________________________
(page generated 2021-08-15 23:01 UTC)