[HN Gopher] CPU Pinning and CPU Sets (2020)
       ___________________________________________________________________
        
       CPU Pinning and CPU Sets (2020)
        
       Author : arnold_palmur
       Score  : 50 points
       Date   : 2021-11-29 13:45 UTC (9 hours ago)
        
 (HTM) web link (www.netmeister.org)
 (TXT) w3m dump (www.netmeister.org)
        
       | bogomipz wrote:
       | This class looks great. I noticed the course page states:
       | 
       | >"This class overlaps significantly with CS392 ``Systems
       | Programming'' -- if you have taken this class, please talk to me
       | in person before trying to register for CS631."[1]
       | 
       | Does anyone know if the videos for CS392 might also be online? I
       | tried to some basic URL substitutions however I came up empty.
       | 
       | [1] https://stevens.netmeister.org/631/
        
       | Sohcahtoa82 wrote:
       | Tangental, but does anyone know of a Windows utility for
       | automatically pinning processes?
       | 
       | I like to keep up with several cryptocurrency prices on Coinbase,
       | but the Coinbase Pro pages consume a pretty significant amount of
       | CPU time. I'd love to be able to just shove all of those
       | processes to a single CPU thread to reduce the impact on overall
       | system performance.
       | 
       | I suppose it wouldn't be too hard to write a Python script that
       | does this automatically...scan Window titles to look for
       | "Coinbase Pro", find the owning PID, then call SetAffinity...
        
         | ayende wrote:
         | The windows task manager has the ability to set process
         | affinity
        
           | Sohcahtoa82 wrote:
           | Well, yeah, but I'm looking for a way to automate it. If I
           | restart Firefox, all those affinities get reset.
        
       | lclarkmichalek wrote:
       | Much cheaper than CPU cgroups if you want some corse grained
       | isolation when stacking workloads
        
       | 1_player wrote:
       | CPU pinning is pretty useful for virtual machines, i.e. I've used
       | it myself to improve performance on a VFIO setup, by limiting
       | which cores where qemu runs on and thus improving cache locality.
       | 
       | https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#CP...
       | 
       | What are other real-world uses of CPU pinning?
        
         | spacechild1 wrote:
         | The Supernova audio server (https://github.com/supercollider/su
         | percollider/tree/develop/...) pins each thread of its DSP
         | thread pool to a dedicated core.
        
         | gpderetta wrote:
         | When implementing one-thread-per-core software architectures,
         | explicit pinning is pretty much a requirement.
        
         | mugsie wrote:
         | Memory and PCIE lanes in larger systems can be attached to
         | particular CPUs, or to sub sections of a single CPU (i.e. AMD
         | Threadrippers / Eypcs in particular) where traversing the the
         | inter CPU / CCX links can cause latency or bandwidth issues.
         | 
         | The software will be pinned to CPU cores close to the RAM or
         | PCIE device they are using.
         | 
         | Only really seen it be an issue in crazy large scale systems,
         | or where you have 4 CPUs, but I haven't spent a huge amount of
         | time on microsecond critical workloads.
        
           | amarshall wrote:
           | Isn't this particular issue partially solved with proper NUMA
           | support in whatever kernel or scheduler is being used?
        
         | jandrewrogers wrote:
         | Databases and other high-throughput data infrastructure
         | software use CPU pinning, also HPC. The reasons are similar:
         | higher cache locality, reduced latency, and more predictable
         | scheduling. It is most useful when the process is taking over
         | part or all of the resources of the machine anyway.
        
       | foton1981 wrote:
       | Kubernetes makes CPU pinning rather simple. Just need to meet
       | conditions to reach Guaranteed QoS.
       | https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
       | 
       | We are running lots of Erlang on k8s and CPU pinning improves
       | performance of Erlang schedulers tremendously.
        
         | bogomipz wrote:
         | Interesting. I would be curious to hear why pinning here
         | improves performance. Is this something specific to the BEAM
         | VM? Does this come at hit to K8S scheduler flexibility?
        
           | toast0 wrote:
           | I don't have experience with k8s, but with BEAM on a
           | traditional system, if BEAM is using the bulk of your CPU,
           | you'll tend to get better results if each of the (main) BEAM
           | scheduler threads is pinned to one CPU thread. Then all of
           | the BEAM scheduler balancing can work properly. If both the
           | OS and BEAM are trying to balance things, you can end up with
           | a lot of extra task movement or extra lock contention when a
           | BEAM thread gets descheduled by the OS to run a different
           | BEAM thread that wants the same lock.
           | 
           | On most of the systems I ran, we didn't tend to have much of
           | anything running on BEAMs dirty schedulers or other OS
           | processes. If you have more of a mix of things, leaving
           | things unpinned may work better.
        
         | versale wrote:
         | Is your setup open source? I'd love to know more about upsides
         | of erlang/otp on top of k8s. Do you use hot code reloads?
        
       | inetknght wrote:
       | CPU pinning can be particularly important if you're running
       | virtual machines and/or hyperthreading-friendly workloads
        
         | jeffbee wrote:
         | Glad you mentioned hyperthreading. That can be easy to
         | overlook. You reserved CPU 1 for a given workload? Did you
         | remember CPU 49 as well?
        
           | krona wrote:
           | True, however, CPU pinning is not the same as
           | reserving/isolating the CPU. This is often not made clear in
           | articles about CPU pinning.
        
           | thanatos519 wrote:
           | The main point of HT is to reduce the cost of context
           | switching by keeping twice the number of contexts close to
           | the core. I would guess that parts of the process context
           | like program counter, TLB, etc live inside the 'HT' and would
           | have to be saved/restored every time the process moves
           | between threads, even on the same core. Reserving both 'HT'
           | on a core gets you cache locality, but isn't there a cost to
           | moving the process back and forth, even if that data is in
           | L1/L2?
           | 
           | (I'm looking at 'lstopo' from package 'hwloc', Linux on my
           | Haswell Xeon: 10MB shared L3, 256KB L2, 32KB L1{d,i} per
           | core)
           | 
           | Given my (educated) guess, I've told irqbalance to put
           | interrupts only on 'thread 0' and then I schedule cpu-
           | intensive tasks to 'thread 1' and schedule them very-not-
           | nicely. Linux seems pretty good about keeping everything else
           | on 'thread 0' when I have 'thread 1' busy so I don't do any
           | further management.
           | 
           | I can have 4 cores 'thread 1' pegged at 100% with no impact
           | on interactive or I/O performance.
        
             | jeffbee wrote:
             | In the context of the article, if you are trying to keep
             | foreign processes "off my cores" then you can't neglect to
             | keep them off the adjacent hyperthreads, because those
             | share some of the resources. If you have 8 threads on 4
             | cores then at least the way Linux counts them cores 0 and 4
             | are sharing some caches and all backend execution
             | resources. So if you have isolated core 0 but not core 4
             | you might as well have not done anything at all.
        
               | thanatos519 wrote:
               | This makes sense in general, because the caches are the
               | most precious resource.
               | 
               | However, in my case the working set is small enough and
               | the processes are top-priority so they probably stay in
               | the L2 if not the L1. Also ... I want to keep using my
               | desktop so I don't mind the intrusion of my interactive
               | processes.
               | 
               | Hmm. Is there a way to check how much L1/L2/L3 a process
               | is occupying?
        
               | wmf wrote:
               | pqos?
        
               | jeffbee wrote:
               | Even RDT isn't going to give you insight into L1
               | occupancy.
        
               | jeffbee wrote:
               | No, but it is possible on certain top-end Intel SKUs to
               | partition the last-level caches such that they are
               | effectively reserved to certain processes.
        
       | nuclx wrote:
       | Does anyone know how the methods mentioned by the author map to
       | 'taskset'?
        
         | StillBored wrote:
         | Or numactl, the latter is where this really starts to make a
         | lot of sense. The perf improvements of keeping individual
         | threads/processes pinned to a small core group (say sharing a
         | L2 cache on Arm machines) tend to be fairly trivial in
         | comparison to what happens when something gets migrated to a
         | different numa node with a large latency to the memory/resident
         | cache data.
        
       | sm_ts wrote:
       | I've maintained a QEMU fork with pinning support, and even
       | coauthored a research paper on the Linux pinning performance
       | topic, and the results have been... underwhelming; "sadly" the
       | Linux kernel does a pretty good job at scheduling :)
       | 
       | I advise pinning users to carefully measure the supposed
       | performance improvement, as there is a tangible risk of spending
       | time on imaginary gains.
        
         | Agingcoder wrote:
         | Agreed, in my case it became very useful on large boxes (96
         | physical cores). The performance gain was about 10%.
        
         | mochomocha wrote:
         | In a setup with high-level of containers collocation on large
         | ec2 instances, we've seen the opposite behavior at Netflix:
         | default CFS performing badly. We've AB tested our flavor of
         | custom pinning and measured substantial benefits:
         | https://netflixtechblog.com/predictive-cpu-isolation-of-cont...
         | 
         | PMC data at scale is pretty clear: very often, CFS won't do the
         | right thing and will leave bad HT neighbors on the same core,
         | leading to L1 thrashing, or keep a high-level of imbalance
         | between NUMA sockets leading to degraded LLC hit rate.
        
         | guilhas wrote:
         | I have looked around a bit, complicated to get right, very lite
         | performance gains, most people doing it for gaming report
        
           | mochomocha wrote:
           | YMMV. We've seen M$ worth of cloud savings at Netflix doing
           | pinning right. Knowing that the task scheduler is also
           | heavily forked in Google's kernel, I'm ready to bet they've
           | seen order of magnitude higher savings in their own DCs as
           | well.
        
         | waynesonfire wrote:
         | Not sure how you maintaining QEMU makes you a credible source
         | for evaluating a schedulers performance. It's apparent to me
         | the performance of the scheduler is a function of the workload,
         | so YMMV.
         | 
         | I worked on a project where we collected detailed production
         | runtime characteristics and evaluated scheduler algorithms
         | against it. Tiny improvements made for massive savings.
        
         | darnir wrote:
         | Would you mind sharing the paper on pinning? I'd be interested
        
         | wmf wrote:
         | At my last job we initially saw performance loss due to
         | pinning; I think multiple QEMU I/O threads got pinned to a
         | single CPU. It's very easy to do it wrong.
        
       ___________________________________________________________________
       (page generated 2021-11-29 23:02 UTC)