[HN Gopher] Reining in the thundering herd: Getting to 80% CPU u...
       ___________________________________________________________________
        
       Reining in the thundering herd: Getting to 80% CPU utilization with
       Django
        
       Author : domino
       Score  : 82 points
       Date   : 2021-08-15 17:19 UTC (5 hours ago)
        
 (HTM) web link (blog.clubhouse.com)
 (TXT) w3m dump (blog.clubhouse.com)
        
       | stingraycharles wrote:
       | Tangent, but I always had a different understanding of the
       | "thundering herd" problem; that is, if a service is down for
       | whatever reason, and it's brought back online, it immediately
       | grinds to a halt again because there are a bazillion requests
       | waiting to be handled.
       | 
       | And the solution to this problem is to slowly, rate-limited,
       | bring the service back online, rather than letting the whole
       | thundering herd go through the door immediately.
        
         | Ozzie_osman wrote:
         | Yea you are right. It could be a service being down and
         | requests piling up, or a cache key expiring and many processes
         | trying to regenerate the value at the same time, etc.
         | 
         | I think the article just used this phrase to describe something
         | else. (Great article otherwise).
        
           | taylorhughes wrote:
           | Phrase borrowed from excellent uWSGI docs https://uwsgi-
           | docs.readthedocs.io/en/latest/articles/Seriali...
        
             | ambicapter wrote:
             | Funny reading this comment after reading the article
             | 
             | > So many options meant plenty of levers to twist around,
             | but the lack of clear documentation meant that we were
             | frequently left guessing the true intention of a given
             | flag.
             | 
             | And then reading your link, they complain >inside the docs<
             | that the docs aren't complete. I have no idea what to
             | believe anymore :D
        
           | fanf2 wrote:
           | There is an explanation of this kind of thundering herd about
           | 3/4 down this article
           | https://httpd.apache.org/docs/trunk/misc/perf-scaling.html
           | 
           | The short version is that when you have multiple processes
           | waiting on listening sockets and a connection arrives, they
           | all get woken up and scheduled to run, but only one will pick
           | up the connection, and the rest have to go back to sleep.
           | These futile wakeups can be a huge waste of CPU, so on
           | systems without accept() scalability fixes, or with more
           | tricky server configurations, the web server puts a lock
           | around accept() to ensure only one process is woken up at a
           | time.
           | 
           | The term (and the fix) dates back to the performance
           | improvement work on Apache 1.3 in the mid-1990s.
        
         | [deleted]
        
       | j4mie wrote:
       | If you're delegating your load balancing to something else
       | further up the stack and would prefer a simpler WSGI server than
       | Gunicorn, Waitress is worth a look:
       | https://github.com/pylons/waitress
        
       | latchkey wrote:
       | If it is just a backend, why not port it over to one of the
       | myriad of cloud autoscaling solutions that are out there?
       | 
       | The opportunity cost of spending time figuring out why only 29
       | workers are receiving requests over adding new features that
       | generate more revenue, seems like a quick decision.
       | 
       | Personally, I just start off with that now in the first place,
       | the development load isn't any greater and the solutions that are
       | out there are quite good.
        
         | lddemi wrote:
         | Author here. We do and did use autoscaling heavily but at a
         | certain scale we just ran out of headroom on the smaller
         | instance types we were using. Jumping to a much larger instance
         | types meant that we will likely never run into those headroom
         | issues again, plus solves other problems like faster spin up,
         | better sidecar connection pooling and allows for a much higher
         | hit rate on per instance caching.
        
           | latchkey wrote:
           | You were autoscaling a single threaded process. You had 1000
           | connections coming in and scaling 1000 workers for those
           | connections. Everything was filtered through gunicorn and
           | nginx, which just adds additional latencies and complexity,
           | for no real benefit.
           | 
           | What I'm talking about is just pointing at something like
           | AppEngine, Cloud Functions, etc... (or whatever solution AWS
           | has that is similar) and being done with it. I'm talking
           | about not running your own infrastructure, at all. Let AWS
           | and Google be your devops so that you can focus on building
           | features.
        
             | ddorian43 wrote:
             | Now you just 5x their costs.
        
               | latchkey wrote:
               | Not if you do it right.
               | 
               | a) you get to fire the devops person, which saves $150k+
               | a year.
               | 
               | b) you add appropriate caching layers in front of
               | everything.
               | 
               | c) you spend time adding features, which generate
               | revenue.
               | 
               | I've done all of this before at scale. This whole case
               | study was written about work I did [1]. Two devs, 3
               | months to release, first year was $80m gross revenue on
               | $500/month cloud bills. Infinite scalability, zero
               | devops.
               | 
               | [1] https://cloud.google.com/customers/gearlaunch
        
           | TekMol wrote:
           | Did you consider switching form CPython to Pypy?
        
         | _tom_ wrote:
         | They are on a back-end that does auto-scaling. They stated that
         | they had problems when scaling up past 1000 nodes.
         | 
         | Now, maybe they could have fixed that issue instead, but going
         | from 29 to 58 workers is easy, it's not the same going to
         | 29,000 to 58,000. And 1000 hosts vs 500 is a non-trivial cost.
        
         | PaywallBuster wrote:
         | containers would've solved it
         | 
         | one process per container, easy peasy
        
           | motoboi wrote:
           | you now containers are just processes, right?
           | 
           | This is what they did, but because they didn't need to
           | schedule other jobs on the same machine, kubernetes or even
           | docker would be overkill.
           | 
           | In this case, simple VM orchestration seems like a fine
           | solution.
        
           | Spivak wrote:
           | This doesn't work so easily with architectures with process
           | pools for workers. So now your app server needs to speak
           | docker (or whatever control plane) to spawn new workers and
           | deal with more complicated IPC. Also the startup time is
           | brutal.
           | 
           | One process per container and multiprocessing is a huge lift
           | most of the time. I've done it but it can be a mess because
           | you don't really have as much a handle on containers than
           | subprocesses because you can only poke them at a distance
           | through the control plane.
        
       | TekMol wrote:
       | Performance is the only thing that is holding me back to consider
       | Python for bigger web applications.
       | 
       | Of the 3 main languages for web dev these days - Python, PHP and
       | Javascript - I like Python the most. But it is scary how slow the
       | default runtime, CPython, is. Compared to PHP and Javascript, it
       | crawls like a snake.
       | 
       | Pypy could be a solution as it seems to be about 6x faster on
       | average.
       | 
       | Is anybody here using Pypy for Django?
       | 
       | Did Clubhouse document somewhere if they are using CPython or
       | Pypy?
        
         | IshKebab wrote:
         | Typescript is a nicer language than Python in many ways and it
         | doesn't suffer from Python's crippling performance issues or
         | dubious static typing situation. Plus you can run it in a
         | browser so there's only one language to learn.
        
           | TekMol wrote:
           | You cannot run TS in a browser.
           | 
           | You can compile it to JS or to Webassembly. But you can do
           | that with every language.
        
         | domino wrote:
         | Clubhouse is using CPython
        
           | TekMol wrote:
           | Interesting. Is there a reason for this?
        
             | mst wrote:
             | It's the standard and best supported approach and the level
             | of speedup you get from PyPy is significantly workload
             | dependent.
        
             | truffdog wrote:
             | I mean... it's the default?
        
         | fnord77 wrote:
         | The top 5 web app programming languages by market share are
         | PHP, Java, JS, Lua and Ruby.
        
           | TekMol wrote:
           | ...said a stranger on the internet without any sources to
           | back up this claim.
        
             | sdze wrote:
             | only one data point though, but...
             | https://redmonk.com/rstephens/2021/08/05/top-20-june-2021/
        
               | TekMol wrote:
               | That one lists the top 5 as:                   Javascript
               | Python         Java         PHP         C#
               | 
               | But is it about web?
        
       | sdze wrote:
       | use PHP ;)
        
         | nsizx wrote:
         | So much this. Practically any other option is better than
         | Python for web development if you're looking for performance.
        
           | waprin wrote:
           | Yet YouTube, Instagram, Pinterest, Reddit, Robinhood,
           | DoorDash, and Lyft backend were originally primarily written
           | in Python. What's funny is that nobody can really deny Python
           | is slow yet somehow the biggest websites in the world were
           | written in it. More proof that Worse Is Better?
        
             | huffmsa wrote:
             | In the early stages:
             | 
             | Speed of development is far more important than optimizing
             | CPU usage.
             | 
             | You can fake your way to fast responses with good caching,
             | but there's not really many ways to fake having the best
             | features.
        
           | void_mint wrote:
           | By this logic, why not Java, C++, Rust, Go, C#?
           | 
           | They're all web-capable and blow the doors off PHP, Python,
           | etc.
        
             | IshKebab wrote:
             | Yes all of those would be way better options than Python
             | and probably PHP. Well maybe not C++. You'd have to be
             | pretty crazy to have web developers writing security
             | sensitive code in C++.
             | 
             | The "blame our co-founder for the choice" bit is exactly
             | what that graph about the cost of defects vs how early they
             | are fixed is talking about.
             | 
             | If they had just picked Go or Java right at the start they
             | wouldn't have had to expend all this engineering effort to
             | get to a still-not-very-good solution.
        
               | void_mint wrote:
               | This thread arose from a person that said "Use PHP", as
               | an argument to using Python. It's a silly argument.
        
               | sdze wrote:
               | It was just a silly remark about the snail-like
               | performance of Python.
               | 
               | Another silly thing:
               | 
               | https://benchmarksgame-
               | team.pages.debian.net/benchmarksgame/...
        
             | sdze wrote:
             | C#, Java, C++ need application servers, no?
             | 
             | "Serverless" scales infinitely due to its simpler
             | request/response lifecycle.
        
               | void_mint wrote:
               | Serverless is too overloaded a term to have any meaning.
               | I'm not really seeing how Python or PHP "scales
               | infinitely" in any way that C#, Java, C++ couldn't.
        
           | sdze wrote:
           | It blows my mind how quickly PHP7.4 processes even shitty
           | code.
        
       | polote wrote:
       | I wouldn't be very proud of writing an article like that.
       | 
       | Usually engineering blogs exists to show that there are fun stuff
       | to do in a company. But here it just seems they have no idea,
       | what they are doing. Which is fine, I'm classifying myself in the
       | same category.
       | 
       | Reading the article I don't feel like they have solved their
       | issue, they just created more future problems
        
       | kvalekseev wrote:
       | HAProxy is a beautiful tool but it doesn't buffer requests that
       | is why NGINX is recommended in front of gunicorn otherwise it's
       | suspectible to slowloris attack. So either cloubhouse can be
       | easily DDOS'd right now or they have some tricky setup that
       | prevents slow post reqests reaching gunicorn. In the blog post
       | they don't mention that problem while recommend others to try and
       | replace NGINX with HAPRoxy.
        
         | lddemi wrote:
         | 1. HAProxy does support request buffering
         | https://cbonte.github.io/haproxy-dconv/2.2/configuration.htm...
         | 
         | 2. our load balancer buffers requests as well
        
           | kvalekseev wrote:
           | From HAProxy mailing list about http_buffer_request option
           | https://www.mail-
           | archive.com/haproxy@formilux.org/msg23074.h...
           | 
           | > In fact, with some app-servers (e.g. most Ruby/Rack
           | servers, most Python servers, ...) the recommended setup is
           | to put a fully buffering webserver in front. Due to it's
           | design, HAProxy can not fill this role in all cases with
           | arbitrarily large requests.
           | 
           | A year ago I was evaluating recent version of HAProxy as
           | buffering web server and successfully run slowloris attack
           | against it. Thus switching from NGINX is not a
           | straightforward operation and your blog post should mention
           | http-buffer-request option and slow client problem.
        
       | lmilcin wrote:
       | 1M requests per minute on 1000 web instances is not an
       | achievement, it is a disaster.
       | 
       | It is ridiculous people brag about it.
       | 
       | Guys, if you have budget maybe I can help you up this by couple
       | orders of magnitude.
        
         | smashed wrote:
         | To be honest the article does realize this, first blaming it on
         | the poor hindsight from original developer (co-founder) and in
         | the conclusion about maybe rewriting the whole thing.
         | 
         | It seemed to be all about how to extract the most performance
         | from the lemon they had to deal with.
         | 
         | I found the linked reference really informative too:
         | https://rachelbythebay.com/w/2020/03/07/costly/
        
           | lmilcin wrote:
           | I don't know Python or how complex their domain is but the
           | number of workers suggests to me it is not that complex and
           | their application spends most of its time switching contexts
           | and in inefficient frameworks.
           | 
           | Per my experience most applications that mostly serve
           | documents from databases should be able to take on at least
           | 10k requests per second on a single node. this is 600k
           | requests per minute on one node, compared to their 1M per
           | 1000 nodes.
           | 
           | This is what I am typically getting from a simple setup with
           | Java, WebFlux and MongoDB with a little bit of experience on
           | what stupid things not to do but without spending much time
           | fine tuning anything.
           | 
           | I think bragging about performance improvements when your
           | design and architecture is already completely broken is at
           | the very least embarrassing.
           | 
           | > poor hindsight from original developer (co-founder)
           | 
           | Well, you have a choice of technologies to write your
           | application in, why chose one that sucks so much when there
           | are so many others that suck less?
           | 
           | It is not poor choice, it is lack of competency.
           | 
           | You are co-founder and want your product to succeed? Don't do
           | stupid shit like choosing stack that already makes reaching
           | your goal very hard.
        
             | nomdep wrote:
             | So do you think using Django is stupid? I guess you think
             | the same about every product that uses Ruby on Rails?
        
               | lmilcin wrote:
               | No, Django is not stupid.
               | 
               | It is the decision to choose it to run load that will
               | require 1000s of servers when it could be handled with
               | 5-10 servers in another technology without more
               | development effort.
        
               | mst wrote:
               | I doubt they expected that level of request load that
               | early on - I imagine the technology choice was made
               | significantly before the whole pandemic thing started.
        
             | taylorhughes wrote:
             | (CH employee here)
             | 
             | The job of the cofounder is to create a thing that people
             | want, which has nothing to do with performance. The first
             | goal is capturing lightning in a bottle with social
             | products. Performance doesn't matter until the lightning is
             | there, and 99%+ of the time you never have to worry about
             | performance, because you don't get the lightning. So,
             | probably the correct choice is leveraging the tech stack
             | that gives you the best shot at capturing the lightning.
             | Django seemed to help!
        
               | jimsimmons wrote:
               | Don't sweat it buddy. People here just want to stand on
               | your toes and feel taller. Classic HN.
               | 
               | Velocity of development is priority #1 and having
               | something that needs to be scaled is a monumental
               | achievement.
        
               | mst wrote:
               | Plus, if he could've predicted the pandemic that far in
               | advance there would probably have been plenty of not
               | clubhouse ways to monetise that prescience ;)
        
               | lmilcin wrote:
               | This is just silly excuse.
               | 
               | The job of the cofounder is also to anticipate possible
               | risks.
               | 
               | And building your company on an astronomically
               | inefficient technology sounds like a huge risk to me.
               | 
               | Those 1000s of servers are probably a very significant
               | cost with such small technical staff. Just by choosing
               | the right technology for the problem, most of that cost
               | could have been avoided.
               | 
               | Django has nothing special in it that would allow
               | building applications faster than in a lot other
               | frameworks that are also much more efficient.
               | 
               | So it is just a matter of simple choice.
               | 
               | Nobody expects people to write webapps in C++ or Rust.
               | Just don't choose technology that is famous for being
               | inefficient.
        
               | chimen wrote:
               | So Django is an "astronomically inefficient technology"?
               | I would just stop if I were you.
        
               | jimsimmons wrote:
               | Python is not astronomically inefficient. Instagram
               | serves like a billion users with it. Job of a cofounder
               | is to build what people want. You can always scale in
               | Silicon Valley by hiring people like you. You can't build
               | another viral app like clubhouse by hiring from the same
               | crowd.
               | 
               | This may hurt you but the truth is scaling and software
               | engineering is highly commoditised. That's the whole
               | point of being in the valley. You can hire people for
               | such things and forget about it.
               | 
               | Clubhouse is not a tech company. They don't have to care
               | about being the best at infra
        
               | vilified wrote:
               | You sound just like the average sports fan commenting
               | after a match about what x player should have done,
               | shouldn't have done, blame it on decisions, style of the
               | trainer, owner etc.. But you're just that.. a fan yapping
               | about how they could do better.
        
         | luckycharms810 wrote:
         | This is a comically yet incredibly common engineering bad take.
         | When you run a company there is only one question to answer,
         | one north star - does it make money ?
        
         | yuliyp wrote:
         | Knowing nothing else, it's hard to know if this is good or not.
         | It's 16 requests per second. Are those requests something like
         | "Render a support article" or are they "Give the user a ranked
         | feed of what they should see on their home screen"? Is most of
         | the logic run by the web server or some combination of app
         | servers / backend services behind it? What kind of hardware
         | does the web server have?
         | 
         | All of those would affect the answer, and would preclude being
         | able to guarantee "up this by couple orders of magnitude"
        
       | tbrock wrote:
       | Aside: AWS only allows registering 1000 targets in a target
       | group... i wonder if thats the limit they hit. If so, its
       | documented.
        
       | vvatsa wrote:
       | ya, I pretty much agree with 3 suggestions at the end:
       | 
       | * use uWSGI (read the docs, so many options...)
       | 
       | * use HAProxy, so very very good
       | 
       | * scale python apps by using processes.
        
       | dilyevsky wrote:
       | Kinda funny they decided paying a ton of money to aws was ok but
       | paying for nginx plus was not
        
         | andrenotgiant wrote:
         | ClubHouse runs on AWS?
        
           | dilyevsky wrote:
           | Hm actually might be google based on what their traffic is
           | going to (i only looked just now). Ok now it makes more sense
           | why support wasn't able to figure this out =)
        
         | Spivak wrote:
         | I kinda get that honestly. It's why I'll spend $20 without even
         | thinking for take out but not spend $2 for an app. It's because
         | the cost off the software is way way more than the money. It's
         | a commitment to actually use it and integrate it, deal with
         | their sales team, talk to purchasing, handle licensing, and
         | introducing friction to replacing it or using tools that don't
         | integrate well because "well we already pay for it." Licensing
         | also complicates deployments substantially when you're doing
         | lots of autoscaling.
         | 
         | And on top of that Nginx Plus is also expensive as hell.
        
           | rowanG077 wrote:
           | The buy in into AWS is much, much larger then using a piece
           | of software though.
        
           | dilyevsky wrote:
           | Don't you have to integrate cloud? This whole post is about
           | having to put a bunch of workaround bc the cloud can't scale
           | apparently
        
           | ClumsyPilot wrote:
           | "It's why I'll spend $20 without even thinking for take out
           | but not spend $2 for an app."
           | 
           | I pay for apps, its not a healthy attotude
        
         | [deleted]
        
         | spullara wrote:
         | The difference people see, as far as I can tell, is that AWS is
         | charging you cost+ and pure software companies need to charge
         | for value or die.
        
       | JanMa wrote:
       | Interesting to read that they are using Unix sockets to send
       | traffic to their backend processes. I know that it's easily done
       | when using HaProxy but I have never read about people using it. I
       | guess the fact that they are not using docker or another
       | container runtime makes sockets rather simple to use.
        
         | mst wrote:
         | I do that every chance I can get.
         | 
         | At a guess, it's probably most loved by people picking old
         | school simple architectures that aren't the sort of thing that
         | goes viral.
        
         | kvalekseev wrote:
         | It's standard way to connect things in UNIX and provides better
         | performance. For example postgresql tcp+ssl is 175% slower than
         | socket
         | https://momjian.us/main/blogs/pgblog/2012.html#June_6_2012
        
       | stu2010 wrote:
       | Interesting to see this. It sounds like they're not on AWS, given
       | that they mentioned that having 1000 instances for their
       | production environment made them one of the bigger deployments on
       | their hosting provider.
       | 
       | If not for the troubles they experienced with their hosting
       | provider and managing deployments / cutting over traffic, it
       | possibly could have been the cheaper option to just keep
       | horizontally scaling vs putting in the time to investigate these
       | issues. I'd also love to see some actual latency graphs, what's
       | the P90 like at 25% CPU usage with a simple Gunicorn / gevent
       | setup?
        
         | ksec wrote:
         | I was wondering that too, but there aren't that many common
         | cloud provider that has 96 vCPU offering.
         | 
         | I am also wondering on 144 Workers, on 96 _vCPU_ which is not
         | 96 CPU Core but 96 CPU _thread_. So effectively 144 Workers on
         | 48 CPU Core possibly running at sub 3Ghz Clock Speed. But it
         | seems they got it to work out in the end. ( May be at the
         | expense of latency )
        
           | mst wrote:
           | Assuming you're running a system where normal
           | request/response handling blocks on database queries it's
           | often optimal to have more workers than available cpu threads
           | and 1.5x is a common rule of thumb to try first.
        
       | catillac wrote:
       | Famous last words, but I get the sense that the need to handle
       | this sort of load on Clubhouse is plateauing and will decline
       | from here. The app seems to have shed all the people that drew
       | other people initially and lost its small, intimate feel and has
       | turned into either crowded rooms where no one can say anything,
       | or hyper specific rooms where no one has anything to say.
       | 
       | Good article though! I've dealt with these exact issues and they
       | can be very frustrating.
        
       | luhn wrote:
       | Unfortunately HAProxy doesn't buffer requests*, which is
       | necessary for a production deployment of gunicorn. And for
       | anybody using AWS, ALB doesn't buffer requests either. Because of
       | this I'm actually running both HAProxy and nginx in front of my
       | gunicorn instances--nginx in front for request buffering and
       | HAProxy behind that for queuing.
       | 
       | If anybody is interested, I've packaged both as Docker
       | containers:
       | 
       | HAProxy queuing/load shedding:
       | https://hub.docker.com/r/luhn/spillway
       | 
       | nginx request buffering: https://hub.docker.com/r/luhn/gunicorn-
       | proxy
       | 
       | * It does have an http_buffer_request option, but this only
       | buffers the first 8kB (?) of the request.
        
       ___________________________________________________________________
       (page generated 2021-08-15 23:01 UTC)