[HN Gopher] CPU Pinning and CPU Sets (2020)
___________________________________________________________________
CPU Pinning and CPU Sets (2020)
Author : arnold_palmur
Score : 50 points
Date : 2021-11-29 13:45 UTC (9 hours ago)
(HTM) web link (www.netmeister.org)
(TXT) w3m dump (www.netmeister.org)
| bogomipz wrote:
| This class looks great. I noticed the course page states:
|
| >"This class overlaps significantly with CS392 ``Systems
| Programming'' -- if you have taken this class, please talk to me
| in person before trying to register for CS631."[1]
|
| Does anyone know if the videos for CS392 might also be online? I
| tried to some basic URL substitutions however I came up empty.
|
| [1] https://stevens.netmeister.org/631/
| Sohcahtoa82 wrote:
| Tangental, but does anyone know of a Windows utility for
| automatically pinning processes?
|
| I like to keep up with several cryptocurrency prices on Coinbase,
| but the Coinbase Pro pages consume a pretty significant amount of
| CPU time. I'd love to be able to just shove all of those
| processes to a single CPU thread to reduce the impact on overall
| system performance.
|
| I suppose it wouldn't be too hard to write a Python script that
| does this automatically...scan Window titles to look for
| "Coinbase Pro", find the owning PID, then call SetAffinity...
| ayende wrote:
| The windows task manager has the ability to set process
| affinity
| Sohcahtoa82 wrote:
| Well, yeah, but I'm looking for a way to automate it. If I
| restart Firefox, all those affinities get reset.
| lclarkmichalek wrote:
| Much cheaper than CPU cgroups if you want some corse grained
| isolation when stacking workloads
| 1_player wrote:
| CPU pinning is pretty useful for virtual machines, i.e. I've used
| it myself to improve performance on a VFIO setup, by limiting
| which cores where qemu runs on and thus improving cache locality.
|
| https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#CP...
|
| What are other real-world uses of CPU pinning?
| spacechild1 wrote:
| The Supernova audio server (https://github.com/supercollider/su
| percollider/tree/develop/...) pins each thread of its DSP
| thread pool to a dedicated core.
| gpderetta wrote:
| When implementing one-thread-per-core software architectures,
| explicit pinning is pretty much a requirement.
| mugsie wrote:
| Memory and PCIE lanes in larger systems can be attached to
| particular CPUs, or to sub sections of a single CPU (i.e. AMD
| Threadrippers / Eypcs in particular) where traversing the the
| inter CPU / CCX links can cause latency or bandwidth issues.
|
| The software will be pinned to CPU cores close to the RAM or
| PCIE device they are using.
|
| Only really seen it be an issue in crazy large scale systems,
| or where you have 4 CPUs, but I haven't spent a huge amount of
| time on microsecond critical workloads.
| amarshall wrote:
| Isn't this particular issue partially solved with proper NUMA
| support in whatever kernel or scheduler is being used?
| jandrewrogers wrote:
| Databases and other high-throughput data infrastructure
| software use CPU pinning, also HPC. The reasons are similar:
| higher cache locality, reduced latency, and more predictable
| scheduling. It is most useful when the process is taking over
| part or all of the resources of the machine anyway.
| foton1981 wrote:
| Kubernetes makes CPU pinning rather simple. Just need to meet
| conditions to reach Guaranteed QoS.
| https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
|
| We are running lots of Erlang on k8s and CPU pinning improves
| performance of Erlang schedulers tremendously.
| bogomipz wrote:
| Interesting. I would be curious to hear why pinning here
| improves performance. Is this something specific to the BEAM
| VM? Does this come at hit to K8S scheduler flexibility?
| toast0 wrote:
| I don't have experience with k8s, but with BEAM on a
| traditional system, if BEAM is using the bulk of your CPU,
| you'll tend to get better results if each of the (main) BEAM
| scheduler threads is pinned to one CPU thread. Then all of
| the BEAM scheduler balancing can work properly. If both the
| OS and BEAM are trying to balance things, you can end up with
| a lot of extra task movement or extra lock contention when a
| BEAM thread gets descheduled by the OS to run a different
| BEAM thread that wants the same lock.
|
| On most of the systems I ran, we didn't tend to have much of
| anything running on BEAMs dirty schedulers or other OS
| processes. If you have more of a mix of things, leaving
| things unpinned may work better.
| versale wrote:
| Is your setup open source? I'd love to know more about upsides
| of erlang/otp on top of k8s. Do you use hot code reloads?
| inetknght wrote:
| CPU pinning can be particularly important if you're running
| virtual machines and/or hyperthreading-friendly workloads
| jeffbee wrote:
| Glad you mentioned hyperthreading. That can be easy to
| overlook. You reserved CPU 1 for a given workload? Did you
| remember CPU 49 as well?
| krona wrote:
| True, however, CPU pinning is not the same as
| reserving/isolating the CPU. This is often not made clear in
| articles about CPU pinning.
| thanatos519 wrote:
| The main point of HT is to reduce the cost of context
| switching by keeping twice the number of contexts close to
| the core. I would guess that parts of the process context
| like program counter, TLB, etc live inside the 'HT' and would
| have to be saved/restored every time the process moves
| between threads, even on the same core. Reserving both 'HT'
| on a core gets you cache locality, but isn't there a cost to
| moving the process back and forth, even if that data is in
| L1/L2?
|
| (I'm looking at 'lstopo' from package 'hwloc', Linux on my
| Haswell Xeon: 10MB shared L3, 256KB L2, 32KB L1{d,i} per
| core)
|
| Given my (educated) guess, I've told irqbalance to put
| interrupts only on 'thread 0' and then I schedule cpu-
| intensive tasks to 'thread 1' and schedule them very-not-
| nicely. Linux seems pretty good about keeping everything else
| on 'thread 0' when I have 'thread 1' busy so I don't do any
| further management.
|
| I can have 4 cores 'thread 1' pegged at 100% with no impact
| on interactive or I/O performance.
| jeffbee wrote:
| In the context of the article, if you are trying to keep
| foreign processes "off my cores" then you can't neglect to
| keep them off the adjacent hyperthreads, because those
| share some of the resources. If you have 8 threads on 4
| cores then at least the way Linux counts them cores 0 and 4
| are sharing some caches and all backend execution
| resources. So if you have isolated core 0 but not core 4
| you might as well have not done anything at all.
| thanatos519 wrote:
| This makes sense in general, because the caches are the
| most precious resource.
|
| However, in my case the working set is small enough and
| the processes are top-priority so they probably stay in
| the L2 if not the L1. Also ... I want to keep using my
| desktop so I don't mind the intrusion of my interactive
| processes.
|
| Hmm. Is there a way to check how much L1/L2/L3 a process
| is occupying?
| wmf wrote:
| pqos?
| jeffbee wrote:
| Even RDT isn't going to give you insight into L1
| occupancy.
| jeffbee wrote:
| No, but it is possible on certain top-end Intel SKUs to
| partition the last-level caches such that they are
| effectively reserved to certain processes.
| nuclx wrote:
| Does anyone know how the methods mentioned by the author map to
| 'taskset'?
| StillBored wrote:
| Or numactl, the latter is where this really starts to make a
| lot of sense. The perf improvements of keeping individual
| threads/processes pinned to a small core group (say sharing a
| L2 cache on Arm machines) tend to be fairly trivial in
| comparison to what happens when something gets migrated to a
| different numa node with a large latency to the memory/resident
| cache data.
| sm_ts wrote:
| I've maintained a QEMU fork with pinning support, and even
| coauthored a research paper on the Linux pinning performance
| topic, and the results have been... underwhelming; "sadly" the
| Linux kernel does a pretty good job at scheduling :)
|
| I advise pinning users to carefully measure the supposed
| performance improvement, as there is a tangible risk of spending
| time on imaginary gains.
| Agingcoder wrote:
| Agreed, in my case it became very useful on large boxes (96
| physical cores). The performance gain was about 10%.
| mochomocha wrote:
| In a setup with high-level of containers collocation on large
| ec2 instances, we've seen the opposite behavior at Netflix:
| default CFS performing badly. We've AB tested our flavor of
| custom pinning and measured substantial benefits:
| https://netflixtechblog.com/predictive-cpu-isolation-of-cont...
|
| PMC data at scale is pretty clear: very often, CFS won't do the
| right thing and will leave bad HT neighbors on the same core,
| leading to L1 thrashing, or keep a high-level of imbalance
| between NUMA sockets leading to degraded LLC hit rate.
| guilhas wrote:
| I have looked around a bit, complicated to get right, very lite
| performance gains, most people doing it for gaming report
| mochomocha wrote:
| YMMV. We've seen M$ worth of cloud savings at Netflix doing
| pinning right. Knowing that the task scheduler is also
| heavily forked in Google's kernel, I'm ready to bet they've
| seen order of magnitude higher savings in their own DCs as
| well.
| waynesonfire wrote:
| Not sure how you maintaining QEMU makes you a credible source
| for evaluating a schedulers performance. It's apparent to me
| the performance of the scheduler is a function of the workload,
| so YMMV.
|
| I worked on a project where we collected detailed production
| runtime characteristics and evaluated scheduler algorithms
| against it. Tiny improvements made for massive savings.
| darnir wrote:
| Would you mind sharing the paper on pinning? I'd be interested
| wmf wrote:
| At my last job we initially saw performance loss due to
| pinning; I think multiple QEMU I/O threads got pinned to a
| single CPU. It's very easy to do it wrong.
___________________________________________________________________
(page generated 2021-11-29 23:02 UTC)