[HN Gopher] Vector Packet Processing
___________________________________________________________________
Vector Packet Processing
Author : teleforce
Score : 46 points
Date : 2023-11-21 06:32 UTC (16 hours ago)
(HTM) web link (www.netgate.com)
(TXT) w3m dump (www.netgate.com)
| Traubenfuchs wrote:
| What kept this from being available til now? Seems like Cisco had
| it for ages.
|
| > With experimental technologies, Linux has been shown to make
| some gains in artificial benchmarks, such as dropping all
| received packets
|
| Is this a joke? A jab?
| readams wrote:
| This is the same project that was developed at Cisco originally
| nyx wrote:
| Oh man, on first reading I definitely thought it was saying
| "with experimental technologies, Linux can drop all traffic",
| which would have been a hilarious dig... but I think "drop all
| incoming packets" is a useful benchmark for evaluating
| overhead, like doing conntrack for every packet that hits then
| immediately discarding it.
| pyvpx wrote:
| VPP has been publicly available since 2017 or before. it's
| incredibly fast and feature-full
|
| dropping a packet after processing overhead and/or explicit
| classification work is a useful benchmark yes
| tw04 wrote:
| >What kept this from being available til now? Seems like Cisco
| had it for ages.
|
| Cisco is the one who wrote it and open sourced it. Netgate is
| just putting a wrapper around other people's (Cisco's) code.
|
| >Is this a joke? A jab?
|
| No, dropping packets is step 1 to proving you can get past some
| of the current CPU bottlenecks. Actually doing something useful
| is obviously significantly more work, but no point bothering
| with that work if the CPU is still the bottleneck.
| gonzo wrote:
| > Netgate is just putting a wrapper around other people's
| (Cisco's) code.
|
| Cisco open sourced VPP in 2016 and we've been busy working on
| it ever since.
|
| https://www.stackalytics.io/unaffiliated?module=github.com/f.
| ..
| wmf wrote:
| Dropping packets efficiently is really important for handling
| DDoS. AFAIK that was the original motivation for Cloudflare to
| adopt XDP.
| nayuki wrote:
| Reminds me of: "PacketShader - GPU-accelerated Software Router".
| https://shader.kaist.edu/packetshader/ ,
| http://keonjang.github.io/papers/sigcomm10ps.pdf
| pyvpx wrote:
| you trade a lot of latency to make GPU parallelism work for
| packet processing/classification. some massively clever work
| around hiding it -- but simply no way to avoid it. thus its a
| niche solution
| touisteur wrote:
| With GPUDirect and active-wait kernels, you can get a tight
| controlled latency and saturate PCIe bandwidth without
| touching main memory. StorageDirect if you need to write to
| (or read from) disk.
| pyvpx wrote:
| packet sojurn time is bounded by the latency of the GPU
| memory architecture. which as I understand has the design
| dial cranked to ten for parallelism and not so much for
| expediency
| nyx wrote:
| VPP is really neat tech. I recently worked on a product that
| employed it, and it was impressive to see a commodity low-power
| CPU pushing tens of gigabits of traffic.
| jauntywundrkind wrote:
| Calico CNI notably has beta support for VPP, including the
| userland _memif_ interface if you really really really need
| speed. https://www.tigera.io/blog/high-throughput-kubernetes-
| cluste... https://docs.tigera.io/calico/latest/getting-
| started/kuberne...
|
| With memif especially, it's fast as heck. But you need to rebuild
| your apps to target memif. There's some pretty good drop in
| stdlib replacements for languages like Go, but it's still some
| work to use the DMA accelerated shared memory packet processing
| high speed userland mode that VPP is capable of. Ex:
| https://github.com/KusakabeShi/wireguard-go-vpp
| dragontamer wrote:
| 100x faster than Linux is certainly fast. IIRC, Linux packet
| processing is considered rather slow (though full featured, well
| behaved and configurable).
|
| VPP here seems to be a "user-mode network stack", as far as I can
| tell. I was kind of attracted to the title because I was hoping
| for SIMD / Vector compute maybe even GPUs, but that doesn't seem
| to be the case.
|
| Still, a usermode network stack is apparently a must-have for any
| very-high performance network application. I've never needed it,
| but a lot of optimizers talk about how "slow" Linux networking is
| when you actually benchmark it.
| crotchfire wrote:
| > "user-mode network stack"
|
| Kernels typically cannot use vector instructions because if
| they did they would need to save and restore the vector
| register state when servicing interrupts. There is a very large
| performance cost to doing that.
|
| Moving packet processing into userspace means adding latency,
| including TLB pressure, in order to do the context switch.
|
| I imagine that we might get some innovation by allowing to
| configure the system such that the kernel owns the vector
| registers and userspace is not allowed to use them. If your
| primary interest in vector registers/instructions is packet
| processing, and you're doing that in kernelspace, you might not
| mind it if userspace can't use those registers.
| wmf wrote:
| _Moving packet processing into userspace means adding
| latency, including TLB pressure, in order to do the context
| switch._
|
| This isn't the case because VPP polls the NIC from userspace
| and never enters the kernel. There are no context switches.
| vacuity wrote:
| Is there something like an IOMMU to provide secure access
| to the NIC?
| ADSSDA wrote:
| Yes, DPDK (which VPP is built on) heavily utilizes IOMMU
| to provide host protection.
|
| https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
| discusses it a bit.
| crest wrote:
| The vector in vector packet processing has little to nothing
| to do with vector instruction sets (SSE/AVX, VMX, RVV, etc.).
| Only the very latest CPUs (and historical supercomputers)
| have scatter/gather instructions capable of efficiently
| extracting packet header fields from multiple packets in
| parallel and if you have a large enough batch of packets
| process switching the vector register file is worth it to the
| kernel. It's just that most kernels don't spill the userspace
| vector registers on every context switch because it's more
| common to switch back to the same thread (or an other
| userspace thread) than using the vector registers inside the
| kernel. Both Linux and *BSD can and do make limited use of
| vector registers inside the kernel e.g. for fast
| encryption/decryption because it's worth the start up cost.
|
| If I understood the VPP design and implementation details
| correctly they try to reduce the amortised cache misses for a
| batch of packets by running all packets in a batch through
| each software pipeline stage before processing the next
| stage. This should result in very good average instruction
| cache hit rates and should also help with data cache hit
| rates because packet headers are small and can be prefetched
| while the forwarding data structures e.g. 1 million IPv4
| prefixes and their next hops can be hard to fit into L2 data
| caches and won't fit into L1 data caches.
|
| I assume a carefully tuned implementation can make further
| gains by dedicating cores to specific pipeline stages to keep
| the data caches hotter at the cost of copying processed
| packet headers their next stage in new (sub-)batches. The
| actual packet content is only relevant for a few operations
| like encryption/decryption and modern highend NICs have line
| rate crypto engines to help with IPsec or TLS.
| benou wrote:
| Batching packets bring several benefits: -
| amortizing cache misses are you mentioned -
| better use of out-of-order, superscalar processors: by
| processing multiple independent packets in parallel, the
| processor can fill more execution units - enable
| the use of vector instructions (SSE/AVX, VMX etc): again,
| processing multiple independent packets in parallel means
| you can leverage SIMD. SIMD instructions are used
| pervasively in VPP
| reflexe wrote:
| Actually, in its root it is based on simd and prefetching. In
| short, each part of the packet processing graph is a node. It
| receives a vector of packets (represented as a vector of packet
| indexes), then the output is one or more vectors, each goes as
| an input to the next step in the processing graph. This
| architecture maximizes cache hits and heats the branch
| predictor (since we run the same small code for many packets
| instead of the whole graph for each packet).
|
| You can read more about it here:
| https://s3-docs.fd.io/vpp/24.02/aboutvpp/scalar-vs-vector-pa...
| dragontamer wrote:
| I can certainly imagine some SIMD concepts in that.
| Particularly stream-compaction (or in AVX512 case:
| VPCOMPRESSD and VPEXPANDD instructions)
|
| EDIT: I guess from a SIMD-perspective, I'd have expected an
| interleaved set of packets, a-la struct-of-arrays rather than
| array-of-structs. But maybe that doesn't make sense for
| packet formats.
| reflexe wrote:
| I have been developing a product that uses vpp in production for
| a few years now. It is very cool to see how much you can squeeze
| out of cheap low power CPUs. You can easily handle tens of gbits
| in iMIX with a a few ARM cortex A72s.
|
| Vpp has very good documentation: https://s3-docs.fd.io/vpp/24.02/
| A very cool unique feature is the graph representation for packet
| processing, and the ability to insert processing nodes to the
| graph dynamically per interface at some point in the processing
| using features
| (https://s3-docs.fd.io/vpp/24.02/developer/corearchitecture/f...)
| nik736 wrote:
| We were running our core routers with BGP and VPP for several
| years pushing around 40-50 Gbps on a software stack without the
| need of those expensive ASICs. Worked great and stable. VPP is a
| great piece of technology.
| AceJohnny2 wrote:
| In short: batch processing at the multi-packet level. Increases
| throughput, at the cost of latency.
___________________________________________________________________
(page generated 2023-11-21 23:03 UTC)