[HN Gopher] Vector Packet Processing
       ___________________________________________________________________
        
       Vector Packet Processing
        
       Author : teleforce
       Score  : 46 points
       Date   : 2023-11-21 06:32 UTC (16 hours ago)
        
 (HTM) web link (www.netgate.com)
 (TXT) w3m dump (www.netgate.com)
        
       | Traubenfuchs wrote:
       | What kept this from being available til now? Seems like Cisco had
       | it for ages.
       | 
       | > With experimental technologies, Linux has been shown to make
       | some gains in artificial benchmarks, such as dropping all
       | received packets
       | 
       | Is this a joke? A jab?
        
         | readams wrote:
         | This is the same project that was developed at Cisco originally
        
         | nyx wrote:
         | Oh man, on first reading I definitely thought it was saying
         | "with experimental technologies, Linux can drop all traffic",
         | which would have been a hilarious dig... but I think "drop all
         | incoming packets" is a useful benchmark for evaluating
         | overhead, like doing conntrack for every packet that hits then
         | immediately discarding it.
        
         | pyvpx wrote:
         | VPP has been publicly available since 2017 or before. it's
         | incredibly fast and feature-full
         | 
         | dropping a packet after processing overhead and/or explicit
         | classification work is a useful benchmark yes
        
         | tw04 wrote:
         | >What kept this from being available til now? Seems like Cisco
         | had it for ages.
         | 
         | Cisco is the one who wrote it and open sourced it. Netgate is
         | just putting a wrapper around other people's (Cisco's) code.
         | 
         | >Is this a joke? A jab?
         | 
         | No, dropping packets is step 1 to proving you can get past some
         | of the current CPU bottlenecks. Actually doing something useful
         | is obviously significantly more work, but no point bothering
         | with that work if the CPU is still the bottleneck.
        
           | gonzo wrote:
           | > Netgate is just putting a wrapper around other people's
           | (Cisco's) code.
           | 
           | Cisco open sourced VPP in 2016 and we've been busy working on
           | it ever since.
           | 
           | https://www.stackalytics.io/unaffiliated?module=github.com/f.
           | ..
        
         | wmf wrote:
         | Dropping packets efficiently is really important for handling
         | DDoS. AFAIK that was the original motivation for Cloudflare to
         | adopt XDP.
        
       | nayuki wrote:
       | Reminds me of: "PacketShader - GPU-accelerated Software Router".
       | https://shader.kaist.edu/packetshader/ ,
       | http://keonjang.github.io/papers/sigcomm10ps.pdf
        
         | pyvpx wrote:
         | you trade a lot of latency to make GPU parallelism work for
         | packet processing/classification. some massively clever work
         | around hiding it -- but simply no way to avoid it. thus its a
         | niche solution
        
           | touisteur wrote:
           | With GPUDirect and active-wait kernels, you can get a tight
           | controlled latency and saturate PCIe bandwidth without
           | touching main memory. StorageDirect if you need to write to
           | (or read from) disk.
        
             | pyvpx wrote:
             | packet sojurn time is bounded by the latency of the GPU
             | memory architecture. which as I understand has the design
             | dial cranked to ten for parallelism and not so much for
             | expediency
        
       | nyx wrote:
       | VPP is really neat tech. I recently worked on a product that
       | employed it, and it was impressive to see a commodity low-power
       | CPU pushing tens of gigabits of traffic.
        
       | jauntywundrkind wrote:
       | Calico CNI notably has beta support for VPP, including the
       | userland _memif_ interface if you really really really need
       | speed. https://www.tigera.io/blog/high-throughput-kubernetes-
       | cluste... https://docs.tigera.io/calico/latest/getting-
       | started/kuberne...
       | 
       | With memif especially, it's fast as heck. But you need to rebuild
       | your apps to target memif. There's some pretty good drop in
       | stdlib replacements for languages like Go, but it's still some
       | work to use the DMA accelerated shared memory packet processing
       | high speed userland mode that VPP is capable of. Ex:
       | https://github.com/KusakabeShi/wireguard-go-vpp
        
       | dragontamer wrote:
       | 100x faster than Linux is certainly fast. IIRC, Linux packet
       | processing is considered rather slow (though full featured, well
       | behaved and configurable).
       | 
       | VPP here seems to be a "user-mode network stack", as far as I can
       | tell. I was kind of attracted to the title because I was hoping
       | for SIMD / Vector compute maybe even GPUs, but that doesn't seem
       | to be the case.
       | 
       | Still, a usermode network stack is apparently a must-have for any
       | very-high performance network application. I've never needed it,
       | but a lot of optimizers talk about how "slow" Linux networking is
       | when you actually benchmark it.
        
         | crotchfire wrote:
         | > "user-mode network stack"
         | 
         | Kernels typically cannot use vector instructions because if
         | they did they would need to save and restore the vector
         | register state when servicing interrupts. There is a very large
         | performance cost to doing that.
         | 
         | Moving packet processing into userspace means adding latency,
         | including TLB pressure, in order to do the context switch.
         | 
         | I imagine that we might get some innovation by allowing to
         | configure the system such that the kernel owns the vector
         | registers and userspace is not allowed to use them. If your
         | primary interest in vector registers/instructions is packet
         | processing, and you're doing that in kernelspace, you might not
         | mind it if userspace can't use those registers.
        
           | wmf wrote:
           | _Moving packet processing into userspace means adding
           | latency, including TLB pressure, in order to do the context
           | switch._
           | 
           | This isn't the case because VPP polls the NIC from userspace
           | and never enters the kernel. There are no context switches.
        
             | vacuity wrote:
             | Is there something like an IOMMU to provide secure access
             | to the NIC?
        
               | ADSSDA wrote:
               | Yes, DPDK (which VPP is built on) heavily utilizes IOMMU
               | to provide host protection.
               | 
               | https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
               | discusses it a bit.
        
           | crest wrote:
           | The vector in vector packet processing has little to nothing
           | to do with vector instruction sets (SSE/AVX, VMX, RVV, etc.).
           | Only the very latest CPUs (and historical supercomputers)
           | have scatter/gather instructions capable of efficiently
           | extracting packet header fields from multiple packets in
           | parallel and if you have a large enough batch of packets
           | process switching the vector register file is worth it to the
           | kernel. It's just that most kernels don't spill the userspace
           | vector registers on every context switch because it's more
           | common to switch back to the same thread (or an other
           | userspace thread) than using the vector registers inside the
           | kernel. Both Linux and *BSD can and do make limited use of
           | vector registers inside the kernel e.g. for fast
           | encryption/decryption because it's worth the start up cost.
           | 
           | If I understood the VPP design and implementation details
           | correctly they try to reduce the amortised cache misses for a
           | batch of packets by running all packets in a batch through
           | each software pipeline stage before processing the next
           | stage. This should result in very good average instruction
           | cache hit rates and should also help with data cache hit
           | rates because packet headers are small and can be prefetched
           | while the forwarding data structures e.g. 1 million IPv4
           | prefixes and their next hops can be hard to fit into L2 data
           | caches and won't fit into L1 data caches.
           | 
           | I assume a carefully tuned implementation can make further
           | gains by dedicating cores to specific pipeline stages to keep
           | the data caches hotter at the cost of copying processed
           | packet headers their next stage in new (sub-)batches. The
           | actual packet content is only relevant for a few operations
           | like encryption/decryption and modern highend NICs have line
           | rate crypto engines to help with IPsec or TLS.
        
             | benou wrote:
             | Batching packets bring several benefits:                 -
             | amortizing cache misses are you mentioned            -
             | better use of out-of-order, superscalar processors: by
             | processing multiple independent packets in parallel, the
             | processor can fill more execution units            - enable
             | the use of vector instructions (SSE/AVX, VMX etc): again,
             | processing multiple independent packets in parallel means
             | you can leverage SIMD. SIMD instructions are used
             | pervasively in VPP
        
         | reflexe wrote:
         | Actually, in its root it is based on simd and prefetching. In
         | short, each part of the packet processing graph is a node. It
         | receives a vector of packets (represented as a vector of packet
         | indexes), then the output is one or more vectors, each goes as
         | an input to the next step in the processing graph. This
         | architecture maximizes cache hits and heats the branch
         | predictor (since we run the same small code for many packets
         | instead of the whole graph for each packet).
         | 
         | You can read more about it here:
         | https://s3-docs.fd.io/vpp/24.02/aboutvpp/scalar-vs-vector-pa...
        
           | dragontamer wrote:
           | I can certainly imagine some SIMD concepts in that.
           | Particularly stream-compaction (or in AVX512 case:
           | VPCOMPRESSD and VPEXPANDD instructions)
           | 
           | EDIT: I guess from a SIMD-perspective, I'd have expected an
           | interleaved set of packets, a-la struct-of-arrays rather than
           | array-of-structs. But maybe that doesn't make sense for
           | packet formats.
        
       | reflexe wrote:
       | I have been developing a product that uses vpp in production for
       | a few years now. It is very cool to see how much you can squeeze
       | out of cheap low power CPUs. You can easily handle tens of gbits
       | in iMIX with a a few ARM cortex A72s.
       | 
       | Vpp has very good documentation: https://s3-docs.fd.io/vpp/24.02/
       | A very cool unique feature is the graph representation for packet
       | processing, and the ability to insert processing nodes to the
       | graph dynamically per interface at some point in the processing
       | using features
       | (https://s3-docs.fd.io/vpp/24.02/developer/corearchitecture/f...)
        
       | nik736 wrote:
       | We were running our core routers with BGP and VPP for several
       | years pushing around 40-50 Gbps on a software stack without the
       | need of those expensive ASICs. Worked great and stable. VPP is a
       | great piece of technology.
        
       | AceJohnny2 wrote:
       | In short: batch processing at the multi-packet level. Increases
       | throughput, at the cost of latency.
        
       ___________________________________________________________________
       (page generated 2023-11-21 23:03 UTC)