[HN Gopher] NUMA Siloing in the FreeBSD Network Stack (2019) [pdf]
___________________________________________________________________
NUMA Siloing in the FreeBSD Network Stack (2019) [pdf]
Author : harporoeder
Score : 64 points
Date : 2021-10-07 14:26 UTC (8 hours ago)
(HTM) web link (2019.eurobsdcon.org)
(TXT) w3m dump (2019.eurobsdcon.org)
| throw0101a wrote:
| Video of the talk:
|
| * https://www.youtube.com/watch?v=_o-HcG8QxPc
|
| * https://2021.eurobsdcon.org/home/speakers/#serving
|
| Playlist of all the 2021 EuroBSDCon videos:
|
| * https://www.youtube.com/playlist?list=PLskKNopggjc4dadqaCDmc...
| drewg123 wrote:
| This is my EuroBSDcon talk from 2 years ago. The more recent one
| (about getting to 400G) is up now at
| https://www.youtube.com/watch?v=_o-HcG8QxPc
| infogulch wrote:
| Great talk, thanks for sharing!
|
| Have you looked into something like DirectStorage to allow the
| nic to request data directly over PCIe, thus cutting out memory
| bandwidth limitations entirely? This is what new generation
| consoles are using to load static scene data from NVMe drives
| to GPUs directly over PCIe, your workload also seems to match
| this model well.
|
| Edit to add: Nvidia branded this as GPUDirect Storage in 2019:
| https://developer.nvidia.com/blog/gpudirect-storage/
|
| Edit2: Oh neat this is mentioned in the connectx-7 doc linked
| in sibling comment.
| drewg123 wrote:
| The problem is that with current generation NICs and NVME
| drives, you need a buffer somewhere, and there is no place
| for the buffer. You need a buffer because the NVME drive
| speaks in 4k chunks, and doing anything sub-4K is non
| optimal.
|
| So picture that TCP wants to send 2 segments (1448* 2 ==
| 2896), possibly from a strange offset (say 7240, or 5
| segments in). You'd be asking the NVME drive to read 2 4K
| blocks and pull 2896 bytes out of the middle. What you really
| want to do is read the first 12K and keep it buffered.
|
| With the current generation NICs, there is no place to
| buffer. With non gold-plated enterprise NVME, there is no
| place to buffer. NVME does have a thing called CMB, but the
| last time I searched, I could not find a drive with CMB that
| was large enough to be useful at a decent price.
|
| The alternative is to have a box of RAM sitting on the PCIe
| bus. That's pretty much a GPU, but GPUs are expensive, power-
| hungry, and unobtainable.
|
| So for now, its just not practical to do this.
| dragontamer wrote:
| > Have you looked into something like DirectStorage to allow
| the nic to request data directly over PCIe, thus cutting out
| memory bandwidth limitations entirely? This is what new
| generation consoles are using to load static scene data from
| NVMe drives to GPUs directly over PCIe, your workload also
| seems to match this model well.
|
| Lets say an Ethernet Port on NUMA#1 is asking for a movie
| found on a NVMe SSD on NUMA#3.
|
| There's simply no way for you to get the data unless you
| traverse the NUMA-fabric somehow. Either NUMA#1 tells NUMA#3
| "Yo, start sending me data so that I can reply to this
| TCP/TLS/HTTP message"... or NUMA#1 offloads the job to NUMA#3
| somehow (Its now NUMA#3's problem to figure out how to talk
| back).
|
| ----------
|
| There are many ways to implement the system, each with
| apparently their own pros and cons. There are further issues:
| in an earlier post a few weeks ago, drewg123 explained that
| his system / bosses demanded that each server was only
| allowed to have 1 IP Address (meaning all 4 NICs had to
| cooperate using link-aggregation).
|
| You can effectively think of this as "each Netflix stream
| comes in on one of the 4-random ports" (which may be on
| NUMA#1 or NUMA#2). While each bit of data is scattered across
| NUMA#1, #2, #3, and #4.
|
| ---------
|
| Note: NUMA#1 and NUMA#2 have the NICs, which means that they
| have fewer remaining PCIe lanes for storage (The ConnectX NIC
| uses multiple PCIe lanes). As a result, most of the movie's
| data will be on NUMA#3 or NUMA#4.
|
| -------
|
| Note: I expect the NUMA-fabric to be faster than PCIe fabric.
| NUMA-fabric is designed so that different chips can pretend
| each other's RAM at roughly 50+GBps bandwidth at something
| like 300ns latencies. In contrast, PCIe 4.0 x16 is only
| 30GBps or so... and a lot of these NVMe SSDs are only 4x
| lanes (aka: 8GBps).
|
| You also need to handle all the networking tidbits: HTTPS has
| state (not just the TLS state, but also the state of the
| stream: who's turn is it to talk and such). Which means the
| CPU / Application data needs to be in the loop somehow. I
| know that the most recent talk offloads the process as much
| as possible thanks to the "sendfile" interface on FreeBSD
| (also available on Linux), which allows you to "pipe" data
| from one file-descriptor to another.
| toast0 wrote:
| From drewg123's other threads, it seems like their machines
| are also hitting (or approaching) memory bandwidth limits,
| so being able to reduce some of the memory write/read
| requirements should help with that.
|
| I think what was working best was having the disk I/O write
| to RAM that was NUMA aligned with the NIC, so disk -> PCIe
| -> NUMA Fabric -> RAM -> PCIe -> NIC.
|
| If instead you could do disk -> PCIe -> NUMA Fabric -> PCIe
| -> NIC, at least for a portion of the disk reads, that
| would still be the same amount of traffic on the NUMA
| Fabric, but less traffic on the memory bus. This probably
| means that the NIC would be doing more high latency reads
| though, so you need more in-flight sends to keep throughput
| up.
| infogulch wrote:
| Both NICs and SSDs are connected to the CPU via PCIe: all
| SSD and NIC traffic already go through PCIe before they hit
| main memory today.
|
| PCIe 4.0 x16 is 64GB/s, and there are _8_ groups of x16
| lanes in a 128 lane EPYC, for a total of 512GB /s or 4Tb/s,
| more than double the max fabric bandwidth of 200GB/s.
|
| Lets take the original diagram from the video:
| CPU | | Storage - Memory - NIC
|
| He was able to use NIC-kTLS offloading to simplify it to
| this: Storage - Memory - NIC
|
| Now lets add a bit more detail to the second diagram,
| expanding it to include PCIe:
| Memory | | Storage - PCIe - NIC
|
| This third diagram is the same as the diagram above it,
| except it explicitly describes how data gets from storage
| to memory and from memory to NIC.
|
| So the story for a request is something like: 1. request
| comes in for a file, loop { 2. CPU requests chunks of data,
| 3. data is delivered to memory via PCIe and signals CPU to
| handle it, 4. CPU tells NIC to send data from memory, 5.
| NIC requests data from memory via PCIe, sends it out on
| port }
|
| If you squint this looks kinda like the first diagram where
| there were extra unnecessary data transfers going up
| through the CPU, except now they're going up through main
| memory. My proposal is to skip main memory and go straight
| from storage to NIC as in: Storage - PCIe
| - NIC
|
| The story for serving a request would now be: 1. request
| comes in for a file, 2. CPU tells NIC: "ok get it from SSD
| 13", 3. NIC requests data from SSD via PCIe, sends it out
| on port, 4. cpu & main memory: _crickets_
| throw0101a wrote:
| Looking to the 800Gb/s next year. :)
|
| Is that with the ConnectX-7?
|
| * https://www.nvidia.com/content/dam/en-
| zz/Solutions/networkin...
| drewg123 wrote:
| No, just multiple CX6-DX
|
| I'm currently stuck because the only boards we can find that
| have enough PCIe lanes exposed to do 800g (64 for NICs +
| close to 64 for NVME, with them more or less equally divided
| between sockets) have only 3 xgmi links to the 2nd socket,
| rather than 4. That causes uneven loading on the xgmi links,
| which causes saturation issues well below 800g
|
| EDIT: I didn't realize the CX7 specs were public; I no longer
| feel quite so special :)
| jeffbee wrote:
| I live for getting downvoted on HN so I'd just like to point out
| that this deck supports my previously-expressed opinion that the
| AMD EPYC architecture is harder to use. Out of the box, the Intel
| machine that is obsolete on paper was beating the EPYC machine by
| more than 50%.
| drewg123 wrote:
| Note that this was with the chip running in NUMA (4-NPS) mode.
| It did much better in 1-NPS mode when paired with a kernel
| without NUMA optimizations.
|
| Rome in 1-NPS mode is just fine, far better than naples.
| monocasa wrote:
| EPYC isn't NUMA anymore.
| MisterTea wrote:
| I can not find anything to back up this claim. How else is
| AMD linking the multiple dies/sockets together?
| monocasa wrote:
| The DDR phys are on the I/O die, so all of the core
| complexes have the same length path to DRAM.
|
| Multi socket is still NUMA, but that's true of Intel as
| well.
| dragontamer wrote:
| > The DDR phys are on the I/O die, so all of the core
| complexes have the same length path to DRAM.
|
| The I/O die has 4 quadrants. The 2 chips in the 1st
| quadrant access the 1st quadrant's 2-memory channels
| slightly faster than the 4th quadrant.
|
| > Multi socket is still NUMA, but that's true of Intel as
| well.
|
| Intel has 6-memory controllers split into pairs of 3 IIRC
| (I'm going off of my memory here). The "left" 3 memory
| channels access the "left 9 cores" a bit faster than the
| "right 9 cores" in an 18-core Intel Skylake-X chip.
|
| --------
|
| Both AMD and Intel have non-uniform latency/bandwidth
| even within the chips that they make.
| monocasa wrote:
| There's a few cycles difference based on how the on chip
| network works, but variability in the number of off chip
| links between you and memory is what dominates the
| design. And in the context of what the parent said (but
| has since edited out), was what was being discussed.
| dragontamer wrote:
| Both EPYC and Intel Skylake-X are NUMA.
|
| Yes, Skylake-X. It turns out that the placement of those L3
| caches matter, and some cores are closer to some memory
| controllers than others.
|
| https://software.intel.com/content/www/us/en/develop/article.
| ..
|
| ------------
|
| Some cores have lower latency access to some memory channels
| than other memory channels. Our modern CPUs are so big, that
| even if everything is on a singular chip, the difference in
| latency can be measured.
|
| The only question that matters is: what is the bandwidth and
| latencies of _EACH_ core compared to _EACH_ memory channel?
| The answer is "it varies". "It varies" a bit for Skylake, and
| "it varies a bit more" for Rome (Zen 2), and "it varies a lot
| lot more" for Naples (Zen1).
|
| ---------
|
| For simplicity, both AMD and Intel offer memory-layouts
| (usually round-robin) that "mixes" the memory channels
| between the cores, causing an average latency.
|
| But for complexity / slightly better performance, both AMD
| and Intel also offer NUMA-modes. Either 4-NUMA for AMD's
| EPYCs, or SNC (Sub-numa clustering) for Intel chips. There
| are always a set of programmers who care enough about
| latency/bandwidth to drop down to this level.
| monocasa wrote:
| It looks like the parent edited the context out of their
| post.
|
| They were specifically calling out EPYC's extreme NUMAness,
| in contrast to Intel's, as the cause of their problems.
| That distinction has more or less been fixed since Zen 2,
| to the point that the NUMA considerations are basically the
| same between Intel and AMD (and really would be for any
| similar high core count design).
| jeffbee wrote:
| Don't blame me for editing out something you imagined. I
| didn't touch it. If you're having problems with
| hallucinations and memory see a neurologist.
| adrian_b wrote:
| Intel Xeons for servers have a few features outside the CPU
| that AMD Epyc are lacking for now, e.g. the ability to transfer
| data directly between the network interface cards and the CPU
| cache.
|
| These features are usually exploited by high-performance
| networking applications and they can provide superior
| performance on Intel, even if the Intel CPUs are inferior.
|
| As long as the application is dominated by the data transfers,
| such extra features of the Intel uncore can provide superior
| performance, but when the application needs heavy processing on
| the cores, the low energy efficiency of the Intel Ice Lake
| Server or older Xeons allows Epyc to deliver better results.
| drewg123 wrote:
| The feature you're talking about, DDIO, is worse than useless
| for our application. It wastes cache ways on I/O that has
| long, long, long since been evicted from the cache by the
| time we go to look at it.
|
| It might be helpful for a low-latency polling sort of
| scenario, but with interrupt coalescing all it does is waste
| cache.
| wmf wrote:
| Intel's unified mesh does have some advantages over AMD's
| quadrants, but Netflix's workload is pretty unusual. Most
| people are seeing better performance on AMD due to more cores
| and much larger cache.
| thinkingkong wrote:
| Previous discussion https://news.ycombinator.com/item?id=28584738
| detaro wrote:
| this is not a previous discussion of this talk.
| thinkingkong wrote:
| Youre totally right. The talk I linked simply references NUMA
| siloing as a way of boosting throughput for their usecase.
| perihelions wrote:
| Related (from a more recent talk):
|
| https://news.ycombinator.com/item?id=28584738 ( _" Serving
| Netflix Video at 400Gb/s on FreeBSD [pdf]"_)
|
| There's also an AMA by the author (in the HN thread).
___________________________________________________________________
(page generated 2021-10-07 23:01 UTC)