[HN Gopher] NUMA Siloing in the FreeBSD Network Stack (2019) [pdf]
       ___________________________________________________________________
        
       NUMA Siloing in the FreeBSD Network Stack (2019) [pdf]
        
       Author : harporoeder
       Score  : 64 points
       Date   : 2021-10-07 14:26 UTC (8 hours ago)
        
 (HTM) web link (2019.eurobsdcon.org)
 (TXT) w3m dump (2019.eurobsdcon.org)
        
       | throw0101a wrote:
       | Video of the talk:
       | 
       | * https://www.youtube.com/watch?v=_o-HcG8QxPc
       | 
       | * https://2021.eurobsdcon.org/home/speakers/#serving
       | 
       | Playlist of all the 2021 EuroBSDCon videos:
       | 
       | * https://www.youtube.com/playlist?list=PLskKNopggjc4dadqaCDmc...
        
       | drewg123 wrote:
       | This is my EuroBSDcon talk from 2 years ago. The more recent one
       | (about getting to 400G) is up now at
       | https://www.youtube.com/watch?v=_o-HcG8QxPc
        
         | infogulch wrote:
         | Great talk, thanks for sharing!
         | 
         | Have you looked into something like DirectStorage to allow the
         | nic to request data directly over PCIe, thus cutting out memory
         | bandwidth limitations entirely? This is what new generation
         | consoles are using to load static scene data from NVMe drives
         | to GPUs directly over PCIe, your workload also seems to match
         | this model well.
         | 
         | Edit to add: Nvidia branded this as GPUDirect Storage in 2019:
         | https://developer.nvidia.com/blog/gpudirect-storage/
         | 
         | Edit2: Oh neat this is mentioned in the connectx-7 doc linked
         | in sibling comment.
        
           | drewg123 wrote:
           | The problem is that with current generation NICs and NVME
           | drives, you need a buffer somewhere, and there is no place
           | for the buffer. You need a buffer because the NVME drive
           | speaks in 4k chunks, and doing anything sub-4K is non
           | optimal.
           | 
           | So picture that TCP wants to send 2 segments (1448* 2 ==
           | 2896), possibly from a strange offset (say 7240, or 5
           | segments in). You'd be asking the NVME drive to read 2 4K
           | blocks and pull 2896 bytes out of the middle. What you really
           | want to do is read the first 12K and keep it buffered.
           | 
           | With the current generation NICs, there is no place to
           | buffer. With non gold-plated enterprise NVME, there is no
           | place to buffer. NVME does have a thing called CMB, but the
           | last time I searched, I could not find a drive with CMB that
           | was large enough to be useful at a decent price.
           | 
           | The alternative is to have a box of RAM sitting on the PCIe
           | bus. That's pretty much a GPU, but GPUs are expensive, power-
           | hungry, and unobtainable.
           | 
           | So for now, its just not practical to do this.
        
           | dragontamer wrote:
           | > Have you looked into something like DirectStorage to allow
           | the nic to request data directly over PCIe, thus cutting out
           | memory bandwidth limitations entirely? This is what new
           | generation consoles are using to load static scene data from
           | NVMe drives to GPUs directly over PCIe, your workload also
           | seems to match this model well.
           | 
           | Lets say an Ethernet Port on NUMA#1 is asking for a movie
           | found on a NVMe SSD on NUMA#3.
           | 
           | There's simply no way for you to get the data unless you
           | traverse the NUMA-fabric somehow. Either NUMA#1 tells NUMA#3
           | "Yo, start sending me data so that I can reply to this
           | TCP/TLS/HTTP message"... or NUMA#1 offloads the job to NUMA#3
           | somehow (Its now NUMA#3's problem to figure out how to talk
           | back).
           | 
           | ----------
           | 
           | There are many ways to implement the system, each with
           | apparently their own pros and cons. There are further issues:
           | in an earlier post a few weeks ago, drewg123 explained that
           | his system / bosses demanded that each server was only
           | allowed to have 1 IP Address (meaning all 4 NICs had to
           | cooperate using link-aggregation).
           | 
           | You can effectively think of this as "each Netflix stream
           | comes in on one of the 4-random ports" (which may be on
           | NUMA#1 or NUMA#2). While each bit of data is scattered across
           | NUMA#1, #2, #3, and #4.
           | 
           | ---------
           | 
           | Note: NUMA#1 and NUMA#2 have the NICs, which means that they
           | have fewer remaining PCIe lanes for storage (The ConnectX NIC
           | uses multiple PCIe lanes). As a result, most of the movie's
           | data will be on NUMA#3 or NUMA#4.
           | 
           | -------
           | 
           | Note: I expect the NUMA-fabric to be faster than PCIe fabric.
           | NUMA-fabric is designed so that different chips can pretend
           | each other's RAM at roughly 50+GBps bandwidth at something
           | like 300ns latencies. In contrast, PCIe 4.0 x16 is only
           | 30GBps or so... and a lot of these NVMe SSDs are only 4x
           | lanes (aka: 8GBps).
           | 
           | You also need to handle all the networking tidbits: HTTPS has
           | state (not just the TLS state, but also the state of the
           | stream: who's turn is it to talk and such). Which means the
           | CPU / Application data needs to be in the loop somehow. I
           | know that the most recent talk offloads the process as much
           | as possible thanks to the "sendfile" interface on FreeBSD
           | (also available on Linux), which allows you to "pipe" data
           | from one file-descriptor to another.
        
             | toast0 wrote:
             | From drewg123's other threads, it seems like their machines
             | are also hitting (or approaching) memory bandwidth limits,
             | so being able to reduce some of the memory write/read
             | requirements should help with that.
             | 
             | I think what was working best was having the disk I/O write
             | to RAM that was NUMA aligned with the NIC, so disk -> PCIe
             | -> NUMA Fabric -> RAM -> PCIe -> NIC.
             | 
             | If instead you could do disk -> PCIe -> NUMA Fabric -> PCIe
             | -> NIC, at least for a portion of the disk reads, that
             | would still be the same amount of traffic on the NUMA
             | Fabric, but less traffic on the memory bus. This probably
             | means that the NIC would be doing more high latency reads
             | though, so you need more in-flight sends to keep throughput
             | up.
        
             | infogulch wrote:
             | Both NICs and SSDs are connected to the CPU via PCIe: all
             | SSD and NIC traffic already go through PCIe before they hit
             | main memory today.
             | 
             | PCIe 4.0 x16 is 64GB/s, and there are _8_ groups of x16
             | lanes in a 128 lane EPYC, for a total of 512GB /s or 4Tb/s,
             | more than double the max fabric bandwidth of 200GB/s.
             | 
             | Lets take the original diagram from the video:
             | CPU                   |   |         Storage - Memory - NIC
             | 
             | He was able to use NIC-kTLS offloading to simplify it to
             | this:                   Storage - Memory - NIC
             | 
             | Now lets add a bit more detail to the second diagram,
             | expanding it to include PCIe:
             | Memory                   |  |         Storage - PCIe - NIC
             | 
             | This third diagram is the same as the diagram above it,
             | except it explicitly describes how data gets from storage
             | to memory and from memory to NIC.
             | 
             | So the story for a request is something like: 1. request
             | comes in for a file, loop { 2. CPU requests chunks of data,
             | 3. data is delivered to memory via PCIe and signals CPU to
             | handle it, 4. CPU tells NIC to send data from memory, 5.
             | NIC requests data from memory via PCIe, sends it out on
             | port }
             | 
             | If you squint this looks kinda like the first diagram where
             | there were extra unnecessary data transfers going up
             | through the CPU, except now they're going up through main
             | memory. My proposal is to skip main memory and go straight
             | from storage to NIC as in:                   Storage - PCIe
             | - NIC
             | 
             | The story for serving a request would now be: 1. request
             | comes in for a file, 2. CPU tells NIC: "ok get it from SSD
             | 13", 3. NIC requests data from SSD via PCIe, sends it out
             | on port, 4. cpu & main memory: _crickets_
        
         | throw0101a wrote:
         | Looking to the 800Gb/s next year. :)
         | 
         | Is that with the ConnectX-7?
         | 
         | * https://www.nvidia.com/content/dam/en-
         | zz/Solutions/networkin...
        
           | drewg123 wrote:
           | No, just multiple CX6-DX
           | 
           | I'm currently stuck because the only boards we can find that
           | have enough PCIe lanes exposed to do 800g (64 for NICs +
           | close to 64 for NVME, with them more or less equally divided
           | between sockets) have only 3 xgmi links to the 2nd socket,
           | rather than 4. That causes uneven loading on the xgmi links,
           | which causes saturation issues well below 800g
           | 
           | EDIT: I didn't realize the CX7 specs were public; I no longer
           | feel quite so special :)
        
       | jeffbee wrote:
       | I live for getting downvoted on HN so I'd just like to point out
       | that this deck supports my previously-expressed opinion that the
       | AMD EPYC architecture is harder to use. Out of the box, the Intel
       | machine that is obsolete on paper was beating the EPYC machine by
       | more than 50%.
        
         | drewg123 wrote:
         | Note that this was with the chip running in NUMA (4-NPS) mode.
         | It did much better in 1-NPS mode when paired with a kernel
         | without NUMA optimizations.
         | 
         | Rome in 1-NPS mode is just fine, far better than naples.
        
         | monocasa wrote:
         | EPYC isn't NUMA anymore.
        
           | MisterTea wrote:
           | I can not find anything to back up this claim. How else is
           | AMD linking the multiple dies/sockets together?
        
             | monocasa wrote:
             | The DDR phys are on the I/O die, so all of the core
             | complexes have the same length path to DRAM.
             | 
             | Multi socket is still NUMA, but that's true of Intel as
             | well.
        
               | dragontamer wrote:
               | > The DDR phys are on the I/O die, so all of the core
               | complexes have the same length path to DRAM.
               | 
               | The I/O die has 4 quadrants. The 2 chips in the 1st
               | quadrant access the 1st quadrant's 2-memory channels
               | slightly faster than the 4th quadrant.
               | 
               | > Multi socket is still NUMA, but that's true of Intel as
               | well.
               | 
               | Intel has 6-memory controllers split into pairs of 3 IIRC
               | (I'm going off of my memory here). The "left" 3 memory
               | channels access the "left 9 cores" a bit faster than the
               | "right 9 cores" in an 18-core Intel Skylake-X chip.
               | 
               | --------
               | 
               | Both AMD and Intel have non-uniform latency/bandwidth
               | even within the chips that they make.
        
               | monocasa wrote:
               | There's a few cycles difference based on how the on chip
               | network works, but variability in the number of off chip
               | links between you and memory is what dominates the
               | design. And in the context of what the parent said (but
               | has since edited out), was what was being discussed.
        
           | dragontamer wrote:
           | Both EPYC and Intel Skylake-X are NUMA.
           | 
           | Yes, Skylake-X. It turns out that the placement of those L3
           | caches matter, and some cores are closer to some memory
           | controllers than others.
           | 
           | https://software.intel.com/content/www/us/en/develop/article.
           | ..
           | 
           | ------------
           | 
           | Some cores have lower latency access to some memory channels
           | than other memory channels. Our modern CPUs are so big, that
           | even if everything is on a singular chip, the difference in
           | latency can be measured.
           | 
           | The only question that matters is: what is the bandwidth and
           | latencies of _EACH_ core compared to _EACH_ memory channel?
           | The answer is "it varies". "It varies" a bit for Skylake, and
           | "it varies a bit more" for Rome (Zen 2), and "it varies a lot
           | lot more" for Naples (Zen1).
           | 
           | ---------
           | 
           | For simplicity, both AMD and Intel offer memory-layouts
           | (usually round-robin) that "mixes" the memory channels
           | between the cores, causing an average latency.
           | 
           | But for complexity / slightly better performance, both AMD
           | and Intel also offer NUMA-modes. Either 4-NUMA for AMD's
           | EPYCs, or SNC (Sub-numa clustering) for Intel chips. There
           | are always a set of programmers who care enough about
           | latency/bandwidth to drop down to this level.
        
             | monocasa wrote:
             | It looks like the parent edited the context out of their
             | post.
             | 
             | They were specifically calling out EPYC's extreme NUMAness,
             | in contrast to Intel's, as the cause of their problems.
             | That distinction has more or less been fixed since Zen 2,
             | to the point that the NUMA considerations are basically the
             | same between Intel and AMD (and really would be for any
             | similar high core count design).
        
               | jeffbee wrote:
               | Don't blame me for editing out something you imagined. I
               | didn't touch it. If you're having problems with
               | hallucinations and memory see a neurologist.
        
         | adrian_b wrote:
         | Intel Xeons for servers have a few features outside the CPU
         | that AMD Epyc are lacking for now, e.g. the ability to transfer
         | data directly between the network interface cards and the CPU
         | cache.
         | 
         | These features are usually exploited by high-performance
         | networking applications and they can provide superior
         | performance on Intel, even if the Intel CPUs are inferior.
         | 
         | As long as the application is dominated by the data transfers,
         | such extra features of the Intel uncore can provide superior
         | performance, but when the application needs heavy processing on
         | the cores, the low energy efficiency of the Intel Ice Lake
         | Server or older Xeons allows Epyc to deliver better results.
        
           | drewg123 wrote:
           | The feature you're talking about, DDIO, is worse than useless
           | for our application. It wastes cache ways on I/O that has
           | long, long, long since been evicted from the cache by the
           | time we go to look at it.
           | 
           | It might be helpful for a low-latency polling sort of
           | scenario, but with interrupt coalescing all it does is waste
           | cache.
        
         | wmf wrote:
         | Intel's unified mesh does have some advantages over AMD's
         | quadrants, but Netflix's workload is pretty unusual. Most
         | people are seeing better performance on AMD due to more cores
         | and much larger cache.
        
       | thinkingkong wrote:
       | Previous discussion https://news.ycombinator.com/item?id=28584738
        
         | detaro wrote:
         | this is not a previous discussion of this talk.
        
           | thinkingkong wrote:
           | Youre totally right. The talk I linked simply references NUMA
           | siloing as a way of boosting throughput for their usecase.
        
       | perihelions wrote:
       | Related (from a more recent talk):
       | 
       | https://news.ycombinator.com/item?id=28584738 ( _" Serving
       | Netflix Video at 400Gb/s on FreeBSD [pdf]"_)
       | 
       | There's also an AMA by the author (in the HN thread).
        
       ___________________________________________________________________
       (page generated 2021-10-07 23:01 UTC)