[HN Gopher] Measuring CPU core-to-core latency
___________________________________________________________________
Measuring CPU core-to-core latency
Author : nviennot
Score : 130 points
Date : 2022-09-18 17:15 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jtorsella wrote:
| If anyone is interested, here are the results on my M1 Pro
| running Asahi Linux:
|
| Min: 48.3 Max: 175.0 Mean: 133.0
|
| I'll try to copy the exact results once I have a browser on
| Asahi, but the general pattern is most pairs have >150ns and a
| few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at
| about 50ns.
| jesse__ wrote:
| This is absolutely the coolest thing I've seen in a while.
| fideloper wrote:
| Because I'm ignorant: What are the practical take aways from
| this?
|
| When is a cpu core sending a message to another core?
| SkipperCat wrote:
| In HFT, we typically pin processes to run on a single isolated
| core (on a multicore machine). That allows the process to avoid
| a lot of kernel and other interrupts which could cause the
| process to not operate in a low latency manner.
|
| If we have two of these processes, each on separate cores, and
| they occasionally need to talk to each other, then knowing the
| best choice of process/core location can keep the system
| operating in the lowest latency setup.
|
| So, an app like this could be very helpful for determining
| where to place pinned processes onto specific cores.
|
| There's also some common rules-of-thumb such as, don't put
| pinned processes that need to communicate on cores that are
| separated by the QPI, that just adds latency. Make sure if
| you're communicating with a NIC to find out which socket has
| the shortest path on the PCI bus to that NIC and other fun
| stuff. I never even thought about NUMA until I started to work
| with folks in HFT. It really makes you dig into the internals
| of the hardware to squeeze the most out of it.
| suprjami wrote:
| I'm surprised how much crossing NUMA nodes can affect
| performance. We've seen NICs halve their throughout with
| (intentionally) wrong setups.
|
| I think of NUMA nodes as multiple computers which just happen
| to share a common operating system.
| ls65536 wrote:
| In general this makes sense, but I think you need to be
| careful in some cases where the lowest latency between two
| logical "cores" is likely to be between those which are SMT
| siblings on the same physical core (assuming you have an SMT-
| enabled system). These logical "cores" will be sharing much
| of the same physical core's resources (such as the low-
| latency L1/L2 and micro-op caches), so depending on the
| particular workload, pinning two threads to these two logical
| "cores" could very well result in worse performance overall.
| slabity wrote:
| SMT is usually disabled in these situations to prevent it
| from being a concern.
| nextaccountic wrote:
| Doesn't this leave some performance on the table? Each
| core has more ports than a single thread could reasonably
| use, exactly because two threads can run on a single core
| slabity wrote:
| In terms of _throughput_ , technically yes, you are
| leaving performance on the table. However, in HFT the
| throughput is greatly limited by IO anyways, so you don't
| get much benefit with it enabled.
|
| What you want is to minimize _latency_ , which means you
| don't want to be waiting for _anything_ before you start
| processing whatever information you need. To do this, you
| need to ensure that the correct things are cached where
| they need to be, and SMT means that you have multiple
| threads fighting each other for that precious cache
| space.
|
| In non-FPGA systems I've worked with, I've seen dozens of
| microseconds of latency added with SMT enabled vs
| disabled.
| bitcharmer wrote:
| No one in hft space runs with smt enabled
| eternalban wrote:
| Hm. Use icelake, with an aggregator process sitting in core 11
| and have all the others run completely on input alone and then
| report to core 11. (Core 11 from that heatmap appears to be the
| only cpu with a sweetheart core having low latency to all other
| cores.) I wonder how hard is to write a re-writer to map an
| executable to match cpu architecture characteristics. Something
| like graph transformations to create clusters (of memory
| addresses) that are then mapped to a core.
| electricshampo1 wrote:
| Answering only the latter question:
|
| A Primer on Memory Consistency and Cache Coherence, Second
| Edition
|
| https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y2019...
|
| (free online book) would help
| crazytalk wrote:
| It's mentioned in the readme - this is measuring the latency of
| cache coherence. Depending on architecture, some sets of cores
| will be organized with shared L2/L3 cache. In order to acquire
| exclusive access to a cache line (memory range of 64-128ish
| bytes), caches belonging to other sets of cores need to be
| waited on to release their own exclusive access, or to be
| informed they need to invalidate their caches. This is
| observable as a small number of cycles additional memory access
| latency that is heavily dependent on hardware cache design,
| which is what is being measured
|
| Cross-cache communication may simply happen by reading or
| writing to memory touched by another thread that most recently
| ran on another core
|
| Check out https://en.wikipedia.org/wiki/MOESI_protocol for
| starters, although I think modern CPUs implement protocols more
| advanced than this (I think MOESI is decades old at this point)
| haroldrijn wrote:
| AMD processors also use a hierarchical coherence directory,
| where the global coherence directory on the IO die enforces
| coherence across chiplets and a local coherence directory on
| each chiplet enforces coherence on-die http://www.cs.cmu.edu/
| afs/cs/academic/class/15740-f03/www/le...
| aseipp wrote:
| The example code uses an atomic store instruction in order to
| write values from threads to a memory location, and then an
| atomic read to read them. The system guarantees that a read of
| a previously written location is consistent with a subsequent
| write, i.e. "you always read the thing you just wrote" (on x86,
| this guarantee is called "Total Store Ordering.") Reads and
| writes to a memory location are translated to messages on a
| memory bus, and that is connected to a memory controller, which
| the CPUs use to talk to the memory they have available. The
| memory controller is responsible for ensuring every CPU sees a
| consistent view of memory according to the respective platform
| memory ordering rules, and with respect to the incoming
| read/write requests from various CPUs. (There are also caches
| between the DRAM and CPU here but they are just another layer
| in the hierarchy and aren't so material to the high-level view,
| because you can keep adding layers, and indeed some systems
| even have L1, L2, L3, and L4 caches!)
|
| A CPU will normally translate atomic instructions like "store
| this 32-bit value to this address" into special messages on the
| memory bus. Atomic operations it turns out are already normally
| implemented in the message protocol between cores and memory
| fabric, so you just translate the atomic instructions into
| atomic messages "for free" and let the controller sort it out.
| But the rules of how instructions flow across the memory bus is
| complicated because the topology of modern CPUs is complicated.
| They are divided, partitioned into NUMA domains, have various
| caches that are shared or not shared between 1-2-or-4-way
| clusters, et cetera. They must still obey the memory
| consistency rules defined by the platform, and all the caches
| and interconnects between them. As a result, there isn't
| necessarily a uniform measurement of time for any particular
| write to location X from a core to be visible to another core
| when it reads X; you have to measure it to see how the system
| responds, which might include expensive operations like
| flushing the cache. It turns out two cores that are very far
| away will just take more time to see a message, since the bus
| path will likely be longer -- the latency will be higher for a
| core-to-core memory write where the write will be visible
| consistently.
|
| So when you're designing high performance algorithms and
| systems, you want to keep the CPU topology and memory hierarchy
| in mind. That's the most important takeaway. From that
| standpoint, these heatmaps are simply useful ways of
| characterizing the baseline performance of _some_ basic
| operations between CPUs, so you might get an idea of how
| topology affects memory latency.
| apaolillo wrote:
| We published a paper where we captured the same kind of insights
| (deep numa hierarchies including cache levels, numa nodes,
| packages) and used them to tailor spinlocks to the underlying
| machine: https://dl.acm.org/doi/10.1145/3477132.3483557
| zeristor wrote:
| I realise these were run on AWS instances, but could this be run
| locally on Apple Silicon?
|
| Erm, I guess I should try.
| wyldfire wrote:
| This is a cool project.
|
| It looks kinda like the color scales are normalized to just-this-
| CPU's latency? It would be neater if the scale represented the
| same values among CPUs. Or rather, it would be neat if there were
| an additional view for this data that could make it easier to
| compare among them.
|
| I think the differences are really interesting to consider. What
| if the scheduler could consider these designs when weighing how
| to schedule each task? Either statically or somehow empirically?
| I think I've seen sysfs info that describes the cache
| hierarchies, so maybe some of this info is available already.
| That nest [1] scheduler was recently shared on HN, I suppose it
| may be taking advantage of some of these properties.
|
| [1] https://dl.acm.org/doi/abs/10.1145/3492321.3519585
| dan-robertson wrote:
| It would be interesting to have a more detailed understanding of
| why these are the latencies, e.g. this repo has 'clusters' but
| there is surely some architectural reason for these clusters. Is
| it just physical distance on the chip or is there some other
| design constraint?
|
| I find it pretty interesting where the interface that cpu makers
| present (eg a bunch of equal cores) breaks down.
| xani_ wrote:
| Just look at the processor architecture diagram.
|
| But TL;DR modern big processors are not one big piece of
| silicon but basically "SMP in a box", a bunch of smaller
| chiplets interconnected with eachother. That helps with yield
| ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64
| core chip). Those also usually come with their own memory
| controllers.
|
| And so you basically have NUMA on a single processor with all
| of the optimization challenges for it
| bitcharmer wrote:
| Most of this cross-core overhead diversity is gone on skylake
| and newer chips because Intel moved from a ring topology to
| mesh design for their l3 caches.
| ip26 wrote:
| Some of it is simple distance. Some of it is architectural
| choices _because_ of the distance. A sharing domain that spans
| a large distance performs poorly because of the latency.
| Therefore domains are kept modest, but the consequence is
| crossing domains has an extra penalty.
| rigtorp wrote:
| I have something similar but in C++:
| https://github.com/rigtorp/c2clat
| virgulino wrote:
| I went to your homepage. Your gif "Programming in C++" made me
| really laugh, thanks for that! 8-)
|
| https://rigtorp.se/
| sgtnoodle wrote:
| I've been doing some latency measurements like this, but between
| two processes using unix domain sockets. I'm measuring more on
| the order of 50uS on average, when using FIFO RT scheduling. I
| suspect the kernel is either letting processes linger for a
| little bit, or perhaps the "idle" threads tend to call into the
| kernel and let it do some non-preemptable book keeping.
|
| If I crank up the amount of traffic going through the sockets,
| the average latency drops, presumably due to the processes being
| able to batch together multiple packets rather than having to
| block on each one.
| Const-me wrote:
| I only use AF_UNIX sockets when I need to pass open file
| handles between processes. I generally prefer message queues:
| https://linux.die.net/man/7/mq_overview
|
| I haven't measured myself, but other people did, and they found
| the latency of message queues is substantially lower:
| https://github.com/goldsborough/ipc-bench
| [deleted]
| jeffbee wrote:
| Fails to build from source with Rust 1.59 so I tried the C++
| `c2clat` from elsewhere in the thread. Quite interesting on Alder
| Lake, because the quartet of Atom cores has uniform latency (they
| share an L2 cache and other resources) while the core-to-core
| latency of the Core side of the CPU varies. Note that the way
| these are logically numbers is 0,1 are SMT threads of the first
| core and so forth through 14-15. 16-19 are Atom cores with 1
| thread each. CPU 0 1 2 3 4 5
| 6 7 8 9 10 11 12 13 14 15 16 17 18
| 19 0 0 12 60 44 60 44 60 43 50 47
| 56 48 58 49 60 50 79 79 78 79 1 12
| 0 45 45 44 44 60 43 51 49 55 47 57 49
| 56 51 76 76 76 76 2 60 45 0 13 42
| 43 53 43 48 37 52 41 53 42 53 42 72 72
| 72 72 3 44 45 13 0 42 43 53 42 47
| 37 51 40 53 41 53 42 72 72 72 72 4
| 60 44 42 42 0 13 56 43 49 52 54 41 56
| 42 42 41 75 75 74 75 5 44 44 43 43
| 13 0 56 43 51 54 55 41 56 42 56 42 77
| 77 77 77 6 60 60 53 53 56 56 0 13
| 49 54 56 41 57 42 57 42 78 78 78 78
| 7 43 43 43 42 43 43 13 0 46 47 54 41
| 41 41 55 41 72 71 71 71 8 50 51 48
| 47 49 51 49 46 0 12 51 51 54 56 55 56
| 75 75 75 75 9 47 49 37 37 52 54 54
| 47 12 0 49 53 54 56 55 54 74 69 67 68
| 10 56 55 52 51 54 55 56 54 51 49 0 13
| 53 58 56 59 75 75 76 75 11 48 47 41
| 40 41 41 41 41 51 53 13 0 51 52 55 59
| 75 75 75 75 12 58 57 53 53 56 56 57
| 41 54 54 53 51 0 13 55 60 77 77 77 77
| 13 49 49 42 41 42 42 42 41 56 56 58 52
| 13 0 55 54 77 77 77 77 14 60 56 53
| 53 42 56 57 55 55 55 56 55 55 55 0 12
| 74 70 78 78 15 50 51 42 42 41 42 42
| 41 56 54 59 59 60 54 12 0 75 74 74 77
| 16 79 76 72 72 75 77 78 72 75 74 75 75
| 77 77 74 75 0 55 55 55 17 79 76 72
| 72 75 77 78 71 75 69 75 75 77 77 70 74
| 55 0 55 55 18 78 76 72 72 74 77 78
| 71 75 67 76 75 77 77 78 74 55 55 0 55
| 19 79 76 72 72 75 77 78 71 75 68 75 75
| 77 77 78 77 55 55 55 0
___________________________________________________________________
(page generated 2022-09-18 23:00 UTC)