[HN Gopher] Go, Containers, and the Linux Scheduler
___________________________________________________________________
Go, Containers, and the Linux Scheduler
Author : rbanffy
Score : 120 points
Date : 2023-11-07 19:10 UTC (3 hours ago)
(HTM) web link (www.riverphillips.dev)
(TXT) w3m dump (www.riverphillips.dev)
| ntonozzi wrote:
| I've been bitten many times by the CFS scheduler while using
| containers and cgroups. What's the new scheduler? Has anyone here
| tried it in a production cluster? We're now going on two decades
| of wasted cores:
| https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf.
| donaldihunter wrote:
| https://kernelnewbies.org/Linux_6.6#New_task_scheduler:_EEVD...
| the8472 wrote:
| The problem here isn't the scheduler. It's resource
| restrictions imposed by the container but the containerized
| process (Go) not checking the OS features used to do that when
| calculating the available amount of parallelism.
| dilyevsky wrote:
| This is subtly incorrect - as far as Docker is concerned CFS
| cgroup extension has several knobs to tune - cfs_quota_us,
| cfs_period_us (typical default is 100ms not a second) and shares.
| When you set shares you get weighted proportional scheduling (but
| only when there's contention). The former two enforce strict
| quota. Don't use Docker's --cpu flag and instead use --cpu-shares
| to avoid (mostly useless) quota enforcement.
|
| From Linux docs: - cpu.shares: The weight of each
| group living in the same hierarchy, that translates into
| the amount of CPU it is expected to get. Upon cgroup creation,
| each group gets assigned a default of 1024. The percentage of CPU
| assigned to the cgroup is the value of shares divided by
| the sum of all shares in all cgroups in the same level.
| - cpu.cfs_period_us: The duration in microseconds of each
| scheduler period, for bandwidth decisions. This defaults
| to 100000us or 100ms. Larger periods will improve
| throughput at the expense of latency, since the scheduler will be
| able to sustain a cpu-bound workload for longer. The
| opposite of true for smaller periods. Note that this only
| affects non-RT tasks that are scheduled by the CFS
| scheduler. - cpu.cfs_quota_us: The maximum time in
| microseconds during each cfs_period_us in for the current
| group will be allowed to run. For instance, if it is set to
| half of cpu_period_us, the cgroup will only be able to peak run
| for 50 % of the time. One should note that this
| represents aggregate time over all CPUs in the system.
| Therefore, in order to allow full usage of two CPUs, for
| instance, one should set this value to twice the value of
| cfs_period_us.
| Thaxll wrote:
| People using Kubernetes don't tune or change those settings,
| it's up to the app to behave properly.
| dilyevsky wrote:
| False. Kubernetes cpu request sets the shares, cpu limit sets
| the cfs quota
| Thaxll wrote:
| You said to change docker flags. Anyway your post is
| irrelevant, the goal is to let know the runtime about how
| many posix threads should it use.
|
| If you set request / limit to 1 core but you run on 64
| cores node , then you runtime will see that which will
| bring performance down.
| dilyevsky wrote:
| Original article is about docker. That's the point of my
| comment - dont set cpu limit
| riv991 wrote:
| I intended it to be applicable to all containerised
| environments. Docker is just easiest on my local machine.
|
| I still believe it's best to set these variables
| regardless of cpu limits and/or cpu shares
| dilyevsky wrote:
| All you did is kneecapped your app to have lower
| performance so it fits under your arbitrary limit. Hardly
| what most people describe as "best" - only useful in
| small percentage of usecases (like reselling compute)
| riv991 wrote:
| I've seen significant performance gains from this in
| production.
|
| Other people have encountered it too hence libraries like
| Automaxprocs existing and issues being open with Go for
| it.
| riv991 wrote:
| Hi I'm the blog author, thanks for the feedback
|
| I'll try and clarify this. I think this is how the sympton
| presents but I should be clearer.
| mratsim wrote:
| > Don't use Docker's --cpu flag and instead use --cpu-shares to
| avoid (mostly useless) quota enforcement.
|
| One caveat is that an application can detect when --cpu is used
| as I think it's using cpuset. When quota are used it cannot
| detect and more threads than necessary will likely be spawned
| cpuguy83 wrote:
| It is not using cpuset (there is a separate flag for this).
| --cpus tweaks the cfs quota based on the number of cpus on
| the system and the requested amount.
| dilyevsky wrote:
| --cpu sets the quota, there is is a --cpuset-cpu flag for
| cpuset and you can detect both by looking at the
| /sys/fs/cgroup
| cpuguy83 wrote:
| > "Don't use Docker's --cpu flag and instead use"
|
| This is rather strong language without any real qualifiers. It
| is definitely not "mostly useless". Shares and quotas are for
| different use-cases, that's all. Understand your use-case and
| choose accordingly.
| dilyevsky wrote:
| It doesn't make any sense to me why --cpu flag is tweaking
| quota and not shares since quota is useful in tiny minority
| of usecases. A lot of people waste a ton of time debugging
| weird latency issues as a result of this decision
| the8472 wrote:
| With shares you're going to experience worse latency if all
| the containers on the system size their thread pool to the
| maximum that's available during idle periods and then
| constantly context-switch due to oversubscription under
| load. With quotas you can do fixed resource allocation and
| the runtimes (not Go apparently) can fit themselves into
| that and not try to service more requests than they can
| currently execute given those resources.
| gregfurman wrote:
| Discovered this sometime last year in my previous role as a
| platform engineer managing our on-prem kubernetes cluster as well
| as the CI/CD pipeline infrastructure.
|
| Although I saw this dissonance between actual and assigned CPU
| causing issues, particularly CPU throttling, I struggled to find
| a scalable solution that would affect all Go deployments on the
| cluster.
|
| Getting all devs to include that autoprocs dependency was not
| exactly an option for hundreds of projects. Alternatively,
| setting all CPU request/limit to a whole number and then
| assigning that to a GOMAXPROCS environment variable in a k8s
| manifest was also clunky and infeasible.
|
| I ended up just using this GOMAXPROCS variable for some of our
| more highly multithreaded applications which yielded some
| improvements but I've yet to find a solution that is applicable
| to all deployments in a microservices architecture with a high
| variability of CPU requirements for each project.
| jeffbee wrote:
| There isn't one answer for this. Capping GOMAXPROCS may cause
| severe latency problems if your process gets a burst of traffic
| and has naive queueing. It's best really to set GOMAXPROCS to
| whatever the hardware offers regardless of your ideas about how
| much time the process will use on average.
| linuxftw wrote:
| You could define a mutating webhook to inject GOMAXPROCS into
| all pod containers.
| hiroshi3110 wrote:
| How about GKE and containerd?
| rickette wrote:
| Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases.
| You can use https://github.com/KimMachineGun/automemlimit to
| automatically set this this limit, kinda like
| https://github.com/uber-go/automaxprocs.
| ImJasonH wrote:
| Thanks for sharing this!
|
| And as a maintainer of ko[1], it was a pleasant surprised to see
| ko mentioned briefly, so that's for that too :)
|
| 1: https://ko.build
| dekhn wrote:
| The common problem I see across many languages is: applications
| detect machine cores by looking at /proc/cpuinfo. However, in a
| docker container (or other container technology), that file looks
| the same as the container host (listing all cores, regardless of
| how few have been assigned to the container).
|
| I wondered for a while if docker could make a fake /proc/cpuinfo
| that apps could parse that just listed "docker cpus" allocated to
| the job, but upon further reflection, that probably wouldn't work
| for many reasons.
| dharmab wrote:
| Point of clarification: Containers, when using quota based
| limits, can use all of the CPU cores on the host. They're
| limited in how much time they can spend using them.
|
| (There are exceptions, such as documented here:
| https://kubernetes.io/docs/tasks/administer-cluster/cpu-
| mana...)
| dekhn wrote:
| Maybe I should be clearer: Let's say I have a 16 core host
| and I start a flask container with cpu=0.5 that forks and has
| a heavy post-fork initializer.
|
| flask/gunicorn will fork 16 processes (by reading
| /proc/cpuinfo and counting cores) all of which will try to
| share 0.5 cores worth of CPU power (maybe spread over many
| physical CPUs; I don't really care about that).
|
| I can solve this by passing a flag to my application; my
| complaint is more that apps shouldn't consult /proc/cpuinfo,
| but have another standard interface to ask "what should I set
| my max parallelism (NOT CONCURRENCY, ROB) so my worker
| threads get adequate CPU time so the framework doesn't time
| out on startup.
| status_quo69 wrote:
| https://stackoverflow.com/questions/65551215/get-docker-
| cpu-...
|
| Been a bit but I do believe that dotnet does this exact
| behavior. Sounds like gunicorn needs a pr to mimic, if they
| want to replicate this.
|
| https://github.com/dotnet/runtime/issues/8485
| Volundr wrote:
| It's not clear to me what the max parallelism should
| actually be on a container with a CPU limit of .5. To my
| understanding that limits CPU time the container can use
| within a certain time interval, but doesn't actually limit
| the parallel processes an application can run. In other
| words that container with .5 on the CPU limit can indeed
| use all 16 physical cores of that machine. It'll just burn
| through it's budget 16x faster. If that's desirable vs
| limiting itself to one process is going to be highly
| application dependent and not something kubernetes and
| docker can just tell you.
| jeffbee wrote:
| That's not what Go does though. Go looks at the population of
| the CPU mask at startup. It never looks again, which of
| problematic in K8s where the visible CPUs may change while your
| process runs.
| dekhn wrote:
| What is the population of the CPU mask at startup? Is this a
| kernel call? A /proc file? Some register?
| EdSchouten wrote:
| On Linux, it likely calls sched_getaffinity().
| dekhn wrote:
| hmm, I can see that as being useful but I also don't see
| that as the way to determine "how many worker threads I
| should start"
| jeffbee wrote:
| It's not a bad way to guess, up to maybe 16 or so. Most
| Go server programs aren't going to just scale up forever,
| so having 188 threads might be a waste.
|
| Just setting it to 16 will satisfy 99% of users.
| dekhn wrote:
| There's going to be a bunch of missing info, though, in
| some cases I can think of. For example, more and more
| systems have asymmetric cores. /proc/cpuinfo can expose
| that information in detail, including (current) clock
| speed, processor type, etc, while cpu_set is literally
| just a bitmask (if I read the man pages right) of system
| cores your process is allowed to schedule on.
|
| Fundamentally, intelligent apps need to interrogate their
| environment to make concurrency decisions. But I agree-
| Go would probably work best if it just picked a standard
| parallelism constant like 16 and just let users know that
| can be tuned if they have additional context.
| jeffbee wrote:
| Yes, running on a set of heterogenous CPUs presents
| further challenges, for the program and the thread
| scheduler. Happily there are no such systems in the
| cloud, yet.
|
| Most people are running on systems where the CPU capacity
| varies and they haven't even noticed. For example in EC2
| there are 8 victim CPUs that handle all the network
| interrupts, so if you have an instance type with 32 CPUs,
| you already have 24 that are faster than the others.
| Practically nobody even notices this effect.
| bruh2 wrote:
| As someone not that familiar with Docker or Go, is this behavior
| intentional? Could the Go team make it aware of the CGroups
| limit? Do other runtimes behave similarly?
| yjftsjthsd-h wrote:
| I'm fairly certain that that .net had to deal with it and Java
| had or still has a problem, I forget which. (Or did you mean
| runtimes like containerd?)
| evntdrvn wrote:
| I know that the .NET CLR team adjusted its behavior to address
| this scenario, fwiw!
| the8472 wrote:
| So did OpenJDK and the Rust standard library.
___________________________________________________________________
(page generated 2023-11-07 23:00 UTC)