[HN Gopher] The Container Throttling Problem
___________________________________________________________________
The Container Throttling Problem
Author : rognjen
Score : 214 points
Date : 2021-12-26 08:42 UTC (14 hours ago)
(HTM) web link (danluu.com)
(TXT) w3m dump (danluu.com)
| jeffrallen wrote:
| Tl;dr, which is too bad, because normally danluu's stuff is
| great.
|
| From the bit I had patience to read it sounds like "we made a
| complicated thing and it's doing complicated things wrong in
| complicated ways".
|
| It is hard to believe that some of these CPU heavy, latency
| sensitive servers should really be in containers. Why are they
| not on dedicated machines? KISS.
| marcosdumay wrote:
| Linux is optimized for desktops and shared servers. When you
| own the entire machine and wants to use it fully, that
| optimization gets in your way.
| londons_explore wrote:
| I think this problem would have been debugged and solved much
| quicker if they'd done a CPU scheduling trace. Then they could
| see, microsecond by microsecond, exactly which processes were
| doing which work, and what incoming requests are still waiting.
|
| Then, let a human go in and say "How come request#77 hasn't yet
| been processed at this point, even though CPU#3 is working on
| printing unused debug data for a low priority request and #77 is
| well after its deadline!??".
|
| Then you debug deeper and deeper, adjusting parameters and
| patching algorithms till you can get a CPU trace that a human can
| look at and think "yeah, I couldn't adjust this schedule by hand
| to get this work done better".
|
| In this process, most people/teams will find at least 10x
| performance gains if they've never done it before, and usually
| still 2x if you limit changes to one layer of the stack ('eg. Im
| just tweaking the application code - we won't touch the runtime,
| VM, OS or hypervisor parameters').
| neerajsi wrote:
| I don't know why the negative reaction. I've done the kind of
| analysis you've described many times and essentially been able
| to quickly identify such issues over the years. We had a
| similar problem in Windows when we first implemented the
| Dynamic Fair Share thread scheduler. It took a couple months to
| have the right tooling to do a proper scheduler trace, but with
| that available the problem was better understood in a week. I
| eventually rewrote the scheduler component and added a control
| law to give better burstable behavior than the hard cap quota
| that this article seems to be describing.
| ghusbands wrote:
| That does not cover almost anything in the article. It's a long
| article, so maybe you could quote the bit you're responding to.
|
| A CPU scheduling trace wouldn't easily show you the details of
| the kernel-level group throttling that was causing a lot of
| issues, for example. They weren't having an issue with threads
| fighting other threads, they were having an issue with threads
| being penalised now for activity from several seconds ago,
| drastically reducing the amount of available CPU.
|
| The article clearly shows a lot of debugging and diagnostic
| patching ability, so it's unlikely they missed the simple
| options. Rather, they probably didn't mention them because they
| were obvious to try and didn't help.
| londons_explore wrote:
| > threads being penalised now for activity from several
| seconds ago,
|
| Exactly... They would have found this out much quicker with a
| trace. They would have seen "how come this application level
| request is being handled on thread number X, yet that thread
| is not running on any core, and many cores are idle"? Then
| quickly they could see the reason that thread isn't scheduled
| by enabling extra tracing detail seeing the internal data
| structures used by the scheduler to see why something is
| schedulable or not at that instant.
| jeffbee wrote:
| I completely agree. KUTrace would have been ideal for this
| and indeed KUTrace was developed to diagnose this exact
| problem.
| ghusbands wrote:
| I think you're suffering hindsight bias, here. A trace is
| rarely as clear as that, and it's hard to see the details
| it's not designed to expose.
|
| Your original message would probably be better received if
| you'd omitted the "I think this problem would have been
| debugged and solved much quicker [...]" and its insulting
| implications and instead started with "Sometimes, I find
| that CPU activity traces can really help with diagnosing
| this sort of problem".
| The_rationalist wrote:
| Please stop advocating for politeness over correctness.
| Sure hindsight help but regardless, a company such as
| Twitter should have experts at tracing that have tools
| and knowledge that goes beyond the average developer
| knowledge about tracing methodologies. Excusing that is
| an appeal to a lowering of technical excellence
| worldwide, which is majorly important and matter more
| than hypothetical feelings.
| londons_explore wrote:
| > a company such as Twitter should have experts at
| tracing
|
| In a big company, getting the person with the most skills
| to solve a problem to be the one actually tasked with
| solving the problem is very hard. This particular problem
| had many avenues to find a solution - and while I think
| my proposed route would have been quicker, if you aren't
| aware of those tools or techniques, then other avenues
| might be much quicker. When starting an investigation
| like this, you don't know where you're going to end up
| either - if it turned out that the performance cliff was
| caused by CPU thermal throttling, it would be hard to see
| in a scheduling trace - everything would just seem
| universally slow all of a sudden.
| neerajsi wrote:
| On Windows, we have the xperf and wpa toolset that makes
| looking at holistic scheduling performance, including
| processor power management and device io tractable. Even
| then, the skillset to analyze an issue like the one
| presented here takes months to acquire and only a few
| engineers can do it. We have dedicated teams to do this
| performance analysis work, and they're always in high
| demand.
| treffer wrote:
| I have been running k8s clusters at utilizations far beyond 50%
| (up to 90% during incidents). For web services/microservices, so
| tail latencies were important.
|
| The way we solved this? 1. Kernel settings. Check e.g. the
| settings of the Ubuntu low latency kernel for example. 2. CFS
| tuning. Short timeslices. There are good documentations on how to
| do that 3. CPU pressure. We cordoned and load shedded overloaded
| nodes (k8s-pressurecooker).
|
| By limiting the maximum CPU pressure to 20% you can say "every
| service will get all the CPU it needs at least 80% of the time on
| most nodes". This is what you want. A low chance of seeing CPU
| exhaustion. This is needed for predictable and stable tail
| latencies.
|
| There are a few more knobs. E.g. scale services such that you use
| at least one core as requests are effectively limits under
| congestions and you can't get half a core continuously.
|
| Very nice to see that people go public about this. We need to
| drop the footprint of services. It is straight up wasted money
| and CO2.
| diegocg wrote:
| Quite interesting problem. It is indeed a contradiction to make a
| service use all the CPUs on a system, and, at the same time have
| an upper limit over how much CPU utilisation they can do.
|
| The thread pool size negotiation seems a necessary fix -
| applications shouldn't be pre calculating their pool sizes on
| their own anyway. But you get additional (smaller) problems, like
| giving more or less threads to some service depending on their
| priority.
|
| One of the big problems here as I understand it is trying to use
| a resource whose "size" changes dynamically (Max CPU usage on a
| cgroup, which can change depending on whether other prioritised
| service is currently running or not) with a fixed sized resource
| (nr of threads when a service starts).
|
| As the number of cores per CPU grows, I wonder if this whole
| approach of scheduling tasks based on their CPU "usage" makes any
| sense. At some point, the basic scheduling unit should be one
| core, and tasks should be assigned a number of core units on the
| system for a given time.
| mabbo wrote:
| I have to wonder why the authors skipped the potential solution
| of removing containers and mesos from the equation entirely.
|
| If you gave this service a dedicated, non-co-located fleet,
| running the JVM directly on the OS, and ran basic autoscaling of
| the number of hosts, you'd eliminate a huge number of the moving
| parts of the system that are causing these issues.
|
| Yes, that would add to ops costs (edit: _human_ ops costs) for
| this service, but when you 're spending 8 figures per year in it,
| clearly the budget is available.
|
| To quote the great philosopher Avril Lavigne: "Why'd you have to
| go and make things so complicated?"
| marcinzm wrote:
| Isn't the problem then that each host would be underutilized on
| average by a lot? It has X cpus and the service can never use
| more than X cpus. If a service has any spiky loads then it'd
| been overprovisioned cpu to handle them at good latency.
|
| That seems significantly more expensive at scale.
| xorcist wrote:
| > that would add to ops costs for this service,
|
| Wouldn't fewer moving parts mean lower operational costs?
| Kalium wrote:
| Only to the extent that cost is a function of complexity.
| This isn't always the case. In a case like this, going to
| bare metal likely brings with it significant drawbacks in
| organizational complexity, orchestrational complexity, and
| more while allowing for much better utilization of memory and
| cpu resources.
|
| Telling someone whose car is making some funny noises that
| it's simpler to go back to horse-and-buggy times would both
| increase costs and decrease the number of user-servicable
| moving parts. There's some significant overhead attached.
| xorcist wrote:
| Bare metal has nothing to do with this. It isn't even
| touched upon in the article. It discusses a scheduler, and
| the parent post suggests exempting these kind of jobs from
| the scheduler in question, which they obviously aren't a
| very good product fit for.
|
| Should you wish to really stretch that car analogy, maybe a
| bit more appropriate than a horse would be: If you aren't
| happy with your travel agency aren't booking your taxi
| trips in time, try booking with the taxi company directly.
| mabbo wrote:
| Yes and no.
|
| It would lower the operations costs of hardware, hopefully
| (that's the entire goal of this article) but you'd need more
| people resources to manage it, I would guess. Mesos and
| containers automate a lot of thinking work.
| Kalium wrote:
| Once you move to hosts dedicated to specific services, as
| seems to be the suggestion here, you also might increase
| the overall hardware cost across your set of services. The
| cost per some of the services might decrease, though.
| toast0 wrote:
| I suspect it's the temptation of oversubscription. If service A
| and service B each use 50% of a server, it's so tempting to put
| them both on one server to maximize efficiency. Even if
| sometimes you need 4 servers running A and B to serve the load
| that can be managed with one server each of A and B.
|
| Or if you've broken things up into small pieces that aren't big
| enough to use a whole server, that can feel inefficient as
| well.
| nvarsj wrote:
| CFS quotas have been broken for a long time - with processes
| being starved far below their utilisation of their quota. I think
| every serious user of k8s discovers this the hard way. Recent
| changes have been done to improve the scheduler for quotas but
| I'm surprised twitter was using them at all in 2019. Java GC also
| suffers badly with quotas. Pinning cpu is probably the best
| compromise, otherwise just use CPU requests with no limits.
| genewitch wrote:
| I can't imagine the man-hours that went into creating this, and,
| from here on out, knowing that core contention is still an issue
| that isn't solved will allow me to waltz in to contract jobs and
| save companies money, e-waste, and power costs - this causes
| hope, joy, something like that.
|
| In case anyone missed it, the removal of throttling in certain
| circumstances saved twitter ~$5mm/year, if I read it correctly.
| With a naive kernel patch. While it takes dedicated engineers
| decades of knowledge to know where to aim an intern, an intern
| still banged out a kernel scheduling patch that made, what I
| assume, is a huge difference.
|
| Dan Luu is a gem.
| euiq wrote:
| Note that the intern in question was close to finishing their
| PhD in a related area.
| wolf550e wrote:
| "Low 8 figures" is more like $25 per year, and that's a single
| service. Across all services it's more.
| fulafel wrote:
| Self-teergrubing by cpu quotas.
|
| Wonder what mechanism could be used to communicate the available
| timeslice length so that the app/thread could stop taking on a
| request when throttling is imminent.
| mkhnews wrote:
| Hi, I recently found similar behavior in an app for our company.
| A simple threaded cpu benchmark shows:
|
| % numactl -C 0,5 ./ssp 12 elapsed time: 99943 ms
|
| cpu.cfs_quota_us = 200000 cpu.cfs_period_us = 100000 % cgexec -g
| cpu:cgtestq ./ssp 12 elapsed time: 420888 ms
|
| cpu.cfs_quota_us = 2000 cpu.cfs_period_us = 1000 % cgexec -g
| cpu:cgtestqx ./ssp 12 elapsed time: 168104 ms
|
| Also interesting was in our app some RR thread priorities are
| used, and those do not get controlled via the cgroup cpu.cfs
| settings.
| tybit wrote:
| I realise that the Twitter is using Mesos, but for those of us on
| Kubernetes does guaranteed QoS solve this?
| https://kubernetes.io/docs/tasks/configure-pod-container/qua...
| mac-chaffee wrote:
| QoS classes are only used "to make decisions about scheduling
| and evicting Pods." It still uses the Completely Fair
| Scheduler, which is where the problem came from (as far as I
| understand).
| KptMarchewa wrote:
| I think they are not using Mesos now.
|
| https://dzone.com/articles/what-can-we-learn-from-twitters-m...
| bboreham wrote:
| If you also use the CPU Manager feature and request an integer
| number of cores, yes. Then for example if you request 3 cores
| your process will be pinned onto 3 specific cores and nothing
| else will be scheduled onto those cores, and CFS will not
| throttle your process.
|
| https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
| d3nj4l wrote:
| As an newbie developer who hasn't dug into this stuff before, but
| found this post fascinating: does anybody have any good pointers,
| like books/articles/videos to learn about low-level details like
| this?
| closeparen wrote:
| Computer Systems: A Programmer's Perspective.
|
| Operating Systems: Three Easy Pieces.
|
| Most important parts of my undergrad. Much more so than
| Algorithms or a anything mathematical.
| mochomocha wrote:
| At Netflix, we're doing a mix of what Dan calls "CPU Pinning and
| Isolation" (ie, host-level scheduling controlled by user-space
| logic) [1] and "Oversubscription at the cluster scheduler level"
| (through a bunch of custom k8s controllers) to avoid placing
| unhappy neighbors on the same box in the first place, while
| oversubscribing the machines based on containers usage patterns.
|
| [1]: https://netflixtechblog.com/predictive-cpu-isolation-of-
| cont...
| tyingq wrote:
| That's a really terrific article, thanks for sharing. I wonder
| if Linux will eventually tie the CPU scheduler together with
| the cgroup cpu affinity functionality, and some awareness of
| cores, smt, shared cache, etc. Seems a shame that you have to
| tie all that together yourself, including a solver.
| eternalban wrote:
| The article mentions "nice values". What does that mean?
| Underutilization/under-provisioning?
|
| [p.s. thanks for the replies]
| bboreham wrote:
| "nice" in Unix is a way to lower the priority of a process,
| so that others are more likely to be scheduled.
|
| Eg https://man7.org/linux/man-pages/man2/nice.2.html
| chrisoverzero wrote:
| It's the kinds of values for things like `nice(2)`:
| https://linux.die.net/man/2/nice
|
| In short, an offset from the base process priority.
| staticassertion wrote:
| I remember working in Java where we'd have huge threadpools that
| sat idle 90% of the time.
|
| It feels like you can eliminate most of this problem in other
| languages by using a much smaller pool and then leveraging
| userland concurrency/ scheduling. You probably don't want to have
| N cores and N + K threads, but in some languages you don't have
| much choice. Java has options for userland concurrency but
| they're pretty ugly and I don't think you'll find a lot of
| integration.
|
| Containers make this a bit harder, and the Linux kernel sounds
| like it had a pretty silly default behavior, but how much of this
| is also just Java?
| le-mark wrote:
| I don't think blaming JVM is productive, but identifying "JVM"
| as a proxy for "language runtime designed to optimize multi
| processor machines" is the core element here.
|
| One could imagine a vm or runtime that is async and
| multiprocess that also enforces quotas on cycles and heap such
| that these types of "noisy neighbor" events aren't a problem.
|
| In this direction there have been solutions that haven't caught
| on; a multi tenant jvm existed at one time, and at least one js
| implementation has this ability. I've often thought Lua would
| be ideal for this.
| staticassertion wrote:
| Yeah to be clear I'm not saying all fault lies with the JVM
| here. But a lack of concurrency primitives exacerbates the
| problem by encouraging very large threadpools.
| neerajsi wrote:
| I haven't used it in anger, but it looks to me like the C#
| async compiler and library support helps reduce the need
| for large threadpools.
|
| But it also looks like the GC was a major contributor, so
| that would not be as influenced by the differences between
| dotnet and Java.
| YokoZar wrote:
| > The gains for doing this for individual large services are
| significant (in the case of service-1, it's [mid 7 figures per
| year] for the service and [low 8 figures per year] including
| services that are clones of it, but tuning every service by hand
| isn't scalable.
|
| This point seems wrong to me, bound too much by requiring
| solutions to be done by a small team of engineers who already
| have a mandate to work on the problem.
|
| With numbers like that Twitter could, profitably, hire dozens of
| engineers that do literally nothing else. Just tweak thread pool
| sizes all day, every day, for service after service. Even though
| it's a boring, manual, "noncomplex" thing, this type of work is
| clearly valuable and should have happened years ago.
|
| Most likely Twitter's job ladder, promotion process, and hiring
| pipeline is highly incentivizing people to avoid such work even
| when it has clear impact. They are very much not alone in that
| regard.
| xyzzy_plugh wrote:
| I solved this same problem for a company also in 2019 (as the
| CPU quota bug hadn't been fixed yet) and it resulted in
| something like 8 figures of yearly cost savings.
|
| You are correct in that most companies are not equipped to
| staff issues like this. Most places just accept their bills as
| a cost of doing business, not something that can be optimized.
| karmakaze wrote:
| I can see how there are diminishing returns when optimizing
| but I would never say that server bills are not a metric to
| be aware of and address. I've always had some idea of what's
| practically achievable in terms of efficiency within a given
| architecture and aim for something that gets a good amount of
| the way there without undue effort. I also enjoy thinking of
| longer term improvements for efficiency whether that could
| improve latency or the bottom line and at the same time know
| that's secondary to providing additional value and gaining
| customers during a growth period.
| [deleted]
| eloff wrote:
| This. I've never worked anywhere that had dedicated ongoing
| effort to cost reduction in compute services. It's always a
| once every couple of years thing to look at the cloud
| spending and spend a little effort dealing with the low
| hanging fruit.
| _jal wrote:
| A side effect of deciding systems engineers can be replaced
| by devops.
|
| In reality, you want both. A good systems person can save you
| a ton of money.
| ithkuil wrote:
| A nice approach is to staff a DevOps team with people from
| diverse backgrounds; some more towards the system side of
| the spectrum and some more towards the dev side of the
| spectrum. As long as everybody knows a little bit of the
| other side. This helps ok avoiding a culture where devs
| "throw some code over the fence" and sysops people just
| moan that devs are careless and/or that they should do
| things differently, but without a clear way of showing
| exactly how differently things should be made (and also
| without a clear understanding of what devs ended up
| choosing the way they choose)
| spockz wrote:
| Afaik, Twitter already has significant and mature
| infrastructure in place to run a plethora of different
| instances on shadowed traffic and compare settings. It is used
| at least by the people working on the optimised jre.
| mnutt wrote:
| There was a startup I talked to at one point that had a service
| where they'd run an agent on your instances that would collect
| performance data and live tune kernel parameters, and they had
| some AI to find the best parameters for your workload. No idea
| how well it worked, but it seems like a potentially good
| application of AI.
| servytor wrote:
| Do you remember the name for it? Sounds really useful.
| syngrog66 wrote:
| Log4Shell.com
| vegetablepotpie wrote:
| They could hire contractors or consultants to do the job, no?
| That class of worker would not be concerned about promotion
| opportunities. For some reason they haven't done that either.
| joshuamorton wrote:
| > With numbers like that Twitter could, profitably, hire dozens
| of engineers that do literally nothing else. Just tweak thread
| pool sizes all day, every day, for service after service. Even
| though it's a boring, manual, "noncomplex" thing, this type of
| work is clearly valuable and should have happened years ago.
|
| The issue is that once you hire a dozen engineers to do this
| (say for 5M a year in total), and they do it for a year, they
| save mid 8 figures (keep in mind this was the largest service,
| so the savings across other services will be smaller).
|
| Then can they keep saving mid 8 figures every year?
|
| I'll paraphrase something I previously wrote privately, but
| imagine you have some team that's able to save 10% of your
| fleetwide resources this year. They densify and optimize and
| improve defaults. So you can now grow your services by 10%
| without any additional cost increase, and you do! The next
| year, they've already saved the easiest 10%. Can they save 10%
| again? Can they keep it up every year? How long until they're
| saving 3% a year, or 1% a year? And that's if you keep the team
| the same size, where its clearly loosing money! If you could
| afford a dozen people to save 10%, you can only really afford
| 1-2 to save 1%, but then you're likely to get an even smaller
| return.
|
| Unless you expect to be able to maintain the same relative
| value of optimizations every year, 3 or 5 years out, its not
| worth it to hire an FTE to work on them.
|
| I should note that I've experienced this myself: I was working
| in an area where resource optimization could lead to
| "significant" savings (not 8 figures, but 6 or maybe 7). My
| first 6 months working in this area, I found all sorts of low
| hanging fruit and saved a fair amount. The second six months,
| 5-10x less ROI. I gave up even trying in the third six moths,
| if I come across a thing, I'll fix it, but its no longer
| worthwhile to _look_.
| ZephyrBlu wrote:
| If you're looking in the same area/domain, what you're saying
| is almost certainly true.
|
| If you're looking across the business as a whole, it seems
| likely that there is a lot of this kind of work lying around
| because there is not much incentive for people to tackle it
| as described in this comment:
| https://news.ycombinator.com/item?id=29691847.
| ghusbands wrote:
| Certainly, at one place I worked, the higher-ups were very
| clear that any work on cost reduction was wasted and devs and
| ops should always work on increasing net income, not decreasing
| costs. It was consistently claimed that cost reduction can only
| get you small percentage decreases, whereas increases in income
| are larger and compound better.
| avianlyric wrote:
| The big difference between cost reduction and income
| increase, is that one has a hard limit on possible upside,
| whereas the other does not. You can reduce your costs by more
| than your total costs, but it's quite possible to increase
| your income by many multiples of your existing income.
|
| Result is that maximising income is generally better than
| reducing cost. Of course, as with all generalisation, there
| are situations where this approach doesn't hold true. But as
| a high level first order strategy, it's a good one to adopt.
| ClumsyPilot wrote:
| "one has a hard limit on possible upside, whereas the other
| does not."
|
| Thats plain wrong, the gloval market for cars, bicycles and
| what have you has a limited size. Every large company
| that's a market leader understands that.
| beiller wrote:
| Would capturing 100% of the market granting a monopoly
| essentially grant unlimited upside? Cause you can just
| jack the price to absurdity? Also wouldn't reducing cost
| to zero have an infinite upside as well? Basically zero
| cost you can produce infinite output. Gettin real
| pedantic here heh.
| wolf550e wrote:
| You meant to write "You cannot reduce your costs"
| lostdog wrote:
| You can do similar types of work, but target speed increases
| instead. Getting all the batch jobs to finish faster can help
| developer productivity, and is even worthwhile at a new
| startup.
| tomrod wrote:
| The higher ups need some basic economics education, it
| appears. Certainly you shouldn't invest everything in long
| term returns, but you should be open to it.
|
| Instead, when something has a payoff of 3 years, executives
| get antsy in orgs that have a 2-year cycle on exec positions.
| kevin_nisbet wrote:
| I've had similar messages articulated to me by my manager,
| and have found myself articulating similar messages to my
| team.
|
| In my team, the key is for the state of the project/product
| we manage, cost optimization is likely one of the lowest ROI
| activities we can spend too much time on. That doesn't mean
| we don't tackle some clear low hanging fruit when we see it,
| or use low hanging fruit as training opportunities to onboard
| new team members, but that we need to be conscious on where
| we make investments, and for the stage we're at, the more
| important investment is into areas that make our product more
| appealing to more customers.
|
| I think it's easy to say someone, like an intern could pay
| for themselves with savings. But this to me overlooks that
| someone has to manage that intern, get them into change
| management, review the work, investigate mistakes or
| interruptions from the changes, etc. And then they're still
| the lowest earning employee, since most of us aren't hired to
| pay for ourselves, but actually to turn a profit for the
| company.
|
| So while I'm not sure I agree with the message "whereas
| increases in income are larger and compound better.", I
| certainly understand and have pushed a similar message, that
| we be conscious on where we're spending our time, and that
| we're selecting the highest impact activities based on the
| resources and team we have. Sometimes that may be fixing high
| wastage, but very frequently that will be investing into the
| product. And I think for the stage of the product we manage,
| that is the best choice for us.
| R0b0t1 wrote:
| This was verified by (what should be) a famous Harvard
| Business school study. Quality before cost, revenue before
| expenses, and there is no three.
| richardwhiuk wrote:
| This article is quite old - the kernel patch has been available
| for a while now, I believe, and CMK is no longer in beta (the
| article references K8s 1.8 and 1.10, but the current latest
| version is 1.23).
| cpitman wrote:
| There are updates from this month at the bottom!
| [deleted]
| throwaway984393 wrote:
| I wonder if k8s' bin-packing features would help here.
|
| The graphs seems to validate my general assumption that large-
| load tasks just suck at scaling whereas small-load tasks can be
| horizontally scaled easier without falling over. The general
| assumption being that for most applications, if you ignore
| everything else about an operation and assume a somewhat random
| distribution of load, smaller-load services use up more available
| resources on average than a single large-load service. That's
| just been an assumption in my head, I can't remember any data to
| back that up.
|
| Back in the day when I worked on a large-traffic internet site,
| we tried jiggering with the scheduler and other kernel tweaks,
| and in the end we literally just routed certain kinds of requests
| to certain machines and said "these requests need loads of cache"
| (memcache, local disk, nfs cache, etc) and "these requests need
| fast io" and "these requests need a ton of cpu". It was "dumb"
| engineering but it worked out.
| zekrioca wrote:
| I think there is another solution, not discussed in the article,
| which lies between CPU isolation and pinning, and that is
| virtualizing the container's /proc so to not let it think the
| number of available (logical) processors is larger than a certain
| limit set by the cluster operator, but which is actually lower
| than the physical capacity of a server (so to allow overbooking
| and increase their 'redacted' savings in $M). This is basically
| presenting a container/application with a number of vCPUs that it
| can use in any way it sees fit, but with all the (invisible)
| control group (quota) limits (i.e., "throttling") the author
| discusses in the text and that avoids the application to spawn so
| many threads that inevitably overloads the physical server and
| destroys tail latency.
|
| This is at the kernel level, opposed to paravirtualization. And I
| guess this is Twitter's use case, but should not be confused by
| the typical vCPUs offers one sees in most cloud providers, which
| is usually done through hypervisors such as Qeme/KVM, VMware, or
| Xen.
|
| I'm not sure why Mesos (maybe this one tried and didn't succeed),
| nor K8S (available through external Intel code) or even Docker,
| never really thought about that, but I guess they want to keep
| their internal (operational) overheads up to a limit, and
| possibly also to maintain the metastability of their services
| [1]. But now we see where it leads to, with all these redacted
| numbers in the article.
|
| [1]
| https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...
|
| Ps: edits for clarifications.
| KaiserPro wrote:
| We had a similar problem, but it exhibited differently
|
| We had two lumps of compute:
|
| 1) huge render farm, at the time it was 36k CPU. The driving goal
| was 100% utilisation. it was a shared resource, and when someone
| wasn't using their share it was aggressively loaned out. (both
| CPU and licenses) Latency wasnt an issue
|
| 2) much smaller VM fleet. Latency was an issue. Even though the
| contention was much less, as was the number utilisation.
|
| Number two was the biggest issue. We had a number of processes
| that needed 100% of one CPU all the time, and they were
| stuttering. Even though the VM thought they were getting 100% of
| a core, they were in practice getting ~50% according to the
| hyperviser. (this was a 24 core box, with only one CPU heavy
| process)
|
| After much graphing, it turned out that it was because we had too
| many VMs on a machine defined with 4-8CPU. Because the hypervisor
| won't allocate only 2 CPUs to a 4 CPU VM, there was lots of spin
| locking waiting for space to schedule the VM. This meant that
| even though the VMs thought they were getting 100% cpu, the host
| was actually giving the VM 25%
|
| The solution was to have more smaller machines. The more threads
| you ask for to be scheduled at the same time reduces the ability
| to share.
|
| We didn't see this on the big farm, because the only thing we
| constrained was memory, The orchestrator would make sure that a
| thing configured for 4 threads was put in a 4 thread slot, but we
| would configure each machine to have 125% CPU allocated to it.
| 3np wrote:
| The way the issue is presented, it sounds to me like context
| switching should be one of the major considerations, especially
| when talking about CPU pinning. Yet it's barely mentioned in
| passing. How come?
| bboreham wrote:
| Dave Chiluk did a great talk covering a similar scheduler
| throttling problem.
|
| https://m.youtube.com/watch?v=UE7QX98-kO0
___________________________________________________________________
(page generated 2021-12-26 23:01 UTC)