[HN Gopher] Linux network performance parameters
___________________________________________________________________
Linux network performance parameters
Author : dreampeppers99
Score : 285 points
Date : 2023-09-06 11:56 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| freedomben wrote:
| Could anyone recommend a video or video series covering similar
| material?
|
| There lots on networking in general, but I've had a hard time
| finding some on Linux specific implementation
| 8K832d7tNmiQ wrote:
| I'm also seconding this, but from microcontroller perspective.
|
| I want to try developing a simple tcp echo server for a
| microcontroller, but most examples just use the vendor's own
| tcp library and put no effort explaining how to manually setup
| and establish connection to the router.
| patmorgan23 wrote:
| Well you can always read the standard
| [deleted]
| mikece wrote:
| How long until Linux reaches network performance parity with
| FreeBSD (or surpasses it)?
| [deleted]
| [deleted]
| dekhn wrote:
| Linux has better network performance than FreeBSD over nearly
| every use case I've seen.
| nolist_policy wrote:
| From every benchmark I've seen so far, Linux has always been
| faster than the BSDs.
|
| For example, look at these benchmarks from 2003[1]. Makes you
| wonder where the myth comes from.
|
| The newest benchmark I could find[2] points in the same
| direction.
|
| Does anyone have more recent data?
|
| [1] http://bulk.fefe.de/scalability/ [2]
| https://matteocroce.medium.com/linux-and-freebsd-networking-...
| Thaxll wrote:
| This is kind of an urban legend, do you think the multi
| millions servers from Google, Amazon etc... have those
| performance issues?
| dijit wrote:
| This is "appeal to authority" fallacy incarnate.
|
| Google/Amazon "etc;" are likely happy to pay the cost because
| it really is "good enough" and the benefits of Linux over
| FreeBSD are otherwise quite considerable.
|
| Google in particular seems blissfully happy to literally
| throw hardware at problems; since hardware is (for them
| especially) fundamentally extremely cheap.
|
| Even multiple percentage gains in _throughput_ are not
| necessary for most applications, and Linux is decent enough
| with latency if you avoid having complex IP /NFTables rules
| and avoid CONNTRACK like the plague.
|
| as u/jeffbee says anyway, most of the larger tech companies
| these days are using userland networking and bypass the
| kernel almost completely for networking.
| [deleted]
| Thaxll wrote:
| I know they bypass the kernel but my point still stands,
| most of the servers on the internet runs on Linux, that's a
| fact, so there was more money, time invested, man power on
| that OS than any others.
| dijit wrote:
| Your point is that popularity means that it will improve.
|
| This is true, to a point.
|
| Counterpoint: Windows Desktop Experience.
|
| EDIT: that comment was glib, let me do a proper
| counterpoint.
|
| Common area's are some of the most least maintained in
| reality; I can think of meet-me-rooms or central fibre
| hubs in major cities; they are expensive and subject to a
| lot of the whims of the major provider.
|
| Crucially, despite large amounts of investment, the
| underlying architecture or infrastructure remains, even
| if the entire fabric of the area changes around it.
| _Most_ providers using these kind of common areas do
| everything they can to _avoid touching_ the area itself,
| especially as after a while it becomes very difficult to
| navigate and politically charged.
|
| Fundamentally the _architecture_ of Linux 's network
| stack, really is, "good enough", which is almost worse
| than you would originally think since "good enough" means
| there's no reason to look there. There is an old parable
| about "worse is better" because if something is truly
| broken people will put effort into fixing it.
|
| Linux's networking stack is _fine_ , it's just not quite
| as good an architecture as the FreeBSD one. FreeBSD one
| has a lot less attention on it but fundamentally it's a
| cleaner implementation and easier to get much more out
| of..
|
| You will find the same argument ad infinitum regarding
| other subjects such as Epoll vs IOCP vs kqueue (Epoll was
| _abysmally terrible_ though and ended up being replaced
| by IO_URING, but even that took over a decade)
| [deleted]
| Thaxll wrote:
| Yes things improve where we're talking about multi
| billions $ of infra cost.
|
| Linux is not your "random on the side" feature that is
| good enough.
| dijit wrote:
| To start with: it's not that much infra cost.
|
| Especially since you don't even know what you're
| attempting to optimise for.
|
| Latency? p99 of linux is _fine_ , nobody is going to care
| that the request took 300ms longer. Even in aggregate
| across a huge fleet of machines waiting an extra 3ms is
| totally, _totally_ fine.
|
| Throughput? you'll bottleneck on something else most
| likely anyway, getting a storage array to hydrate at line
| rate for 100GBPs is difficult and _anyway_ you want to do
| authentication and distribution of chunks and metadata
| operations anyway? right?
|
| You're forgetting that it's _likely_ an additional cost
| of a couple million dollars per year in absolute hardware
| to solve that issue with throughput, which is, in TCO
| terms, a couple of developers.
|
| Engineering effort to replace the foundation of an OS?
| Probably an order of magnitude more. Definitely contains
| a significant amount more risk, and the potential risk of
| political backlash for upheaving some other companies
| workflow that is weird.
|
| Hardware isn't so expensive really.
|
| Of course, you could just bypass the kernel with much
| less effort and avoid all of this shit entirely.
| tptacek wrote:
| Do you know for a fact that Google primarily uses userland
| networking, or does that just seem accurate to you?
| drewg123 wrote:
| Google makes heavy use of userspace networking. I was
| there roughly a decade ago. At least at that time, a
| major factor is the choice of userspace over kernel
| networking was time to deployment. Services like the ones
| described above were built on the monorepo, and could be
| deployed in seconds at the touch of a button.
|
| Meanwhile, Google had a building full of people
| maintaining the Google kernel (eg, maintaining rejected
| or unsubmitted patches that were critical for business
| reasons), and it took many months to do a kernel release.
| tptacek wrote:
| Yes. I don't think anyone is disputing that Google does
| significant userspace networking things. But the premise
| of this thread is that "ordinary" (ie: non-network-
| infrastructure --- SDN, load balancer, routing)
| applications, things that would normally just get BSD
| sockets, are based on userspace networking. That seems
| not to be the case.
| dijit wrote:
| I can't honestly answer that with the NDA I signed.
|
| However there is some public information on _some_
| components that has been shared in this thread which
| allows you to draw your own conclusion.
| tptacek wrote:
| Yes, the one link shared says essentially the opposite
| thing.
| dijit wrote:
| You may have read it wrong.
| tptacek wrote:
| Did I? How?
| dijit wrote:
| For one by assuming the work that is done primarily for
| microkernels/appliances is the absolute limit of
| userspace networking at Google and that similar work
| would not go into a hypervisor (hypervisors which are
| universally treated as a vSwitch in almost all virtual
| environments the world over).
|
| And making that assumption when there are many public
| examples of Google doing this in other areas such as
| gVisor and Netstack?
| tptacek wrote:
| If you have information about other userspace networking
| projects at Google, I'd love to read it, but the Snap
| paper repeatedly suggests that the userspace networking
| characteristics of the design are distinctive. Certainly,
| most networking at Google isn't netstack. Have you done
| much with netstack? It is many things, but ultra-high-
| performance isn't one of them.
| dijit wrote:
| userspace networking will take different forms depending
| on the use-case.
|
| Which is one of the arguments of why to do it that way;
| instead of using general purpose networking.
|
| I haven't the time or inclination to find anything public
| on this, nor am I interested really in convincing you.
| Ask a former googler.
| tptacek wrote:
| OK. I did. They said "no, it's not the case that
| networking at Google is predominately user-mode". (They
| also said "it depends on what you mean by most"). Do you
| have more you want me to relay to them? Did you work on
| this stuff at Google?
|
| Per the Snap thread above: if you're building a router or
| a load balancer or some other bit of network
| infrastructure, it's not unlikely that there's userland
| IP involved. But if you're shipping a normal program on,
| like, Borg or whatever, it's kernel networking.
| dijit wrote:
| I worked as a Google partner for some specialised
| projects within AAA online gaming.
|
| I continue in a similar position today and thus my NDA is
| still in complete effect which limits what I can say if
| there's nothing public.
|
| I have not worked for Google, just very closely.
| tptacek wrote:
| Oh. Then, unless a Googler jumps in here and says I'm
| wrong: no, ordinary applications at Google are not as a
| rule built on userspace networking. That's not my opinion
| (though: it was my prior, having done a bunch of
| userspace networking stuff), it's the result of asking
| Google people about it.
|
| Maybe it's all changed in the last year! But then: that
| makes all of this irrelevant to the thread, about FreeBSD
| vs. Linux network stack performance.
| ori_b wrote:
| Do you think that (outside of a few special cases) they're
| using anything near the network bandwidth available to them?
|
| I would expect in the 1% to 10% bandwidth utilization, on
| average. From my vague recollection, that's what it was at FB
| when I was there. They put stupid amounts of network capacity
| in so that the engineers rarely have to think about the
| capacity of the links they're using, and that if their needs
| grow, they're not bottlenecked on a build out.
|
| To answer the original question, it's complicated. I have a
| weird client where freebsd gets 450 MiB/s, and Linux gets 85
| with the default congestion control algorithm. Changing the
| congestion control algorithm can get me between 1.7 MiB/s and
| 470 MiB/s. So, better performance... Under what
| circumstances?
| jeffbee wrote:
| The big guys don't have the patience to wait for Linux kernel
| networking to be fast and scalable. They bypass the kernel
| and take over the hardware.
|
| https://blog.acolyer.org/2019/11/11/snap-networking/
| tptacek wrote:
| _Over the course of several years, the architecture
| underpinning Snap has been used in production for multiple
| networking applications, including network virtualization
| for cloud VMs [19], packet-processing for Internet peering
| [62], scalable load balancing [22], and Pony Express, a
| reliable transport and communications stack that is our
| focus for the remainder of this paper._
|
| This paper suggests, as I would have expected, that Google
| uses userland networking in strategic spots where low-level
| network development is important (SDNs and routing), and
| not for normal applications.
| jeffbee wrote:
| "and Pony Express" is the operative phrase. As the paper
| states on page 1, "Snap is deployed to over half of our
| fleet of machines and supports the needs of numerous
| teams." According to the paper it is not niche.
| nolist_policy wrote:
| Makes sense, they're probably using QUIC in lots of
| products and the kernel can't accelerate that anyways, it
| would only pass opaque UDP packets to and from the
| application.
| devonkim wrote:
| Last I remember as of at least 7 years ago Google et al
| were using custom NIC firmware to avoid having the kernel
| get involved in general (I think they managed to do a lot
| of Maglev directly on the NICs) because latency is so
| dang important at high speed networking speeds that
| letting anything context switch and need to wait on the
| kernel is a big performance hit. Not a lot of room for
| latency when you're working at 100 Gbps.
| tptacek wrote:
| Isn't Pony Express a ground-up replacement for all of
| TCP/IP? It doesn't even present a TCP/UDP socket
| interface.
| jeffbee wrote:
| Correct. That is my point. The sockets interface, and
| design choices within the Linux kernel, make ordinary TCP
| sockets too difficult to exploit in a datacenter
| environment. The general trend is away from TCP sockets.
| QUIC (HTTP/3) is a less extreme retreat from TCP, moving
| all the flow control, congestion, and retry logic out of
| the kernel and into the application.
|
| An example of how Linux TCP is unsuitable for datacenters
| is that the minimum RTO is hard-coded to 200ms, which is
| essentially forever. People have been trying to land
| better or at least more configurable parameters upstream
| for decades. I am hardly the first person to point out
| the deficiencies. Google presented tuning Linux for
| datacenter applications at LPC 2022, and their deck has
| barely changed in 15 years.
| tptacek wrote:
| At the point where we're talking about applications that
| don't even use standard protocols, we've stopped
| supplying data points about whether FreeBSD's stack is
| faster than Linux's, which is the point of the thread.
|
| _Later_
|
| Also, the idea that QUIC is a concession made to
| intractable Linux stack problems (the subtext I got from
| that comment) seems pretty off, since the problems QUIC
| addresses (HOLB, &c) are old, well known, and were the
| subject of previous attempts at new transports (SCTP,
| notably).
| corbet wrote:
| That's funny ... the "big guys" are some of the biggest
| contributors to the Linux network stack, almost as if they
| were actually using it and cared about how well it works.
| jeffbee wrote:
| History has shown that tons of Linux networking
| scalability and performance contributions have been
| rejected by the gatekeepers/maintainers. The upstream
| kernel remains unsuitable for datacenter use, and all the
| major operators bypass or patch it.
| eddtests wrote:
| Do you have links on this? I've not heard anything about
| it
| tptacek wrote:
| I believe they're paraphrasing the Snap paper, and also
| that they're extrapolating too far from it.
| sophacles wrote:
| All the major operators sometimes bypass or patch it for
| some use cases. For others they use it as is. For other
| still they laugh at you for taking the type of drugs that
| makes one think any CPU is sufficient to handle
| networking in code.
|
| Networking isn't a one size fits all thing - different
| networks have different needs, and different systems in
| any network will have different needs.
|
| Userland networking is great until you start needing to
| deal with weird flows or unexpected traffic - then you
| end up either needing something a bit more robust and
| your performance starts dropping because you added a
| bunch of branches to your code or switched over to a
| kernel implementation that handles those cases. I've seen
| a few cases of userland networking being slower than just
| doing the kernel - and being kept because sometimes the
| what you care about is control over packet lifecycle more
| than raw throughput.
|
| Kernels prioritize robust network stacks that can handle
| a lot of cases good enough. Different implementations
| handle different scenarios better - there's plenty of
| very high performance networking done with vanilla linux
| and vanilla freebsd.
| sophacles wrote:
| Performance parity on which axis? For which use case?
|
| Talking generally about "network performance" is approximately
| as useful as talking generally about "engine performance". Just
| like it makes no sense to compare a weed-eater engine to a
| locomotive diesel without talking about use case and desired
| outcomes, it makes no sense to compare "performance of FreeBSD
| network stack" and "Linux network stack" without understanding
| the role those systems will be playing in the network.
|
| Depending on context, FreeBSD, Linux or various userland stacks
| can be a great, average, or terrible choices.
| circularfoyers wrote:
| Can you provide some examples of different contexts where
| Linux or FreeBSD might be better or worse choices?
| sophacles wrote:
| Sure:
|
| Linux is a networking swiss army knife (or maybe a
| dremmel). It can do a lot of stuff reasonably well. It has
| all sorts of knobs and levers, so you can often configure
| it to do really weird stuff. I tend to reach for it first
| to understand the shape of a problem/solution.
|
| BSD is fantastic for a lot of server applications,
| particularly single tenant high throughput ones like mail
| servers, dedicated app servers, etc. A great series of case
| studies have come out of Netflix on this (google for
| "800Gbps on freebsd netflix" for example - every iteration
| of that presentation is fantastic and usually discussed
| here at least once, and Drew G. shows up in comments and
| answers questions).
|
| It's also pretty nice for firewalling/routing small and
| medium networks - (opn|pf)sense are both great systems for
| this built on FreeBSD (apologies for the drama injection
| this may cause below).
|
| One of the reasons I reach for linux first unless I already
| know the scope and shape of the problem is that the entire
| "userland vs kernel" distinction is much blurrier there.
| Linux allows you to pass some or all traffic to userland at
| various points in the stack and in various ways, and inject
| code at the kernel level via ebpf, leading to a lot of
| hybrid solutions - this is nice in middleboxes where you
| want some dynamism and control, particularly in multi-
| tenant networks (and thats the space my work is in, so it's
| what I know best)
|
| Please bear in mind that these are my opinions and
| uses/takes on the tools. Just like with programming there's
| a certain amount of "art" (or maybe "craft") to this, and
| other folks will have different (but likely just a valid)
| views - there's a lot of ways to do anything in networking.
| [deleted]
| doctorpangloss wrote:
| Does performance tuning for Wi-Fi adapters matter?
|
| On desktops, other than disabling features, can anything fix the
| problems with i210 and i225 ethernet? Those seem to be the two
| most common NICs nowadays.
|
| I don't really understand why common networking hardware and
| drivers are so flawed. There is a lot of attention paid to
| RISC-V. How about start with a fully open and correct NIC?
| They'll shove it in there if it's cheaper than an i210. Or maybe
| that's impossible.
| jeffbee wrote:
| i225 is just broken but I get excellent performance from i210.
| 1gb is hardly challenging on a contemporaneous CPU, and the
| i210 offers 4 queues. What's your beef with i210?
| doctorpangloss wrote:
| There are a lot of problems with the i210. Here's a sample:
|
| https://www.google.com/search?q=i210+proxmox+e1000e+disable
|
| Most people don't really use their NICs "all the time" "with
| many hosts." The i210 in particular will hang after a few
| months of e.g. etcd cluster traffic on 9th and 10th gen
| Intel, which is common for SFFPCs.
|
| On Windows, the ndis driver works a lot better. Many
| disconnects in similar traffic load as Linux, and features
| like receive side coalescing are broken. They also don't
| provide proper INFs for Windows server editions, just
| because.
|
| I assume Intel does all of this on purpose. I don't think
| their functionally equivalent server SKUs are this broken.
|
| Apparently the 10Gig patents are expiring very soon. That
| will make Realtek, Broadcom and Aquantia's chips a lot
| cheaper. IMO, motherboards should be much smaller, shipping
| with BMC and way more rational IO: SFP+, 22110, Oculink, U.2,
| and PCIe spaced for Infinity Fabric & NVLink. Everyone should
| be using LVFS for firmware - NVMe firmware, despite being
| standardized to update, is a complete mess with bugs on every
| major controller.
|
| I share all of this as someone with experience in operating
| commodity hardware at scale. People are so wasteful with
| their hardware.
| trustingtrust wrote:
| There are 3 revisions of i225 and Intel essentially got rid
| of it and launched i226. That one also seems to be
| problematic [1] . Why is it exponentially harder to make a
| 2.5gbps NIC when the 1gbps NIC (i210 and i211) has worked
| well for them. Shouldn't it be trivial to make it 2.5x? They
| seem to make good 10gbps NICs so I would assume 2.5gbps
| shouldn't need a 5th try from intel ?
|
| [1] - https://shorturl.at/esCNP
| jeffbee wrote:
| The bugs I am aware of are on the PCIe side. i225 will lock
| up the bus if it attempts to do PTM to support PTP. That's
| a pretty serious bug. You would think Intel has this nailed
| since they invented PCIe and PCI for that matter.
| Apparently not. Maybe they outsourced it.
| elabajaba wrote:
| > Does performance tuning for Wi-Fi adapters matter?
|
| If you're willing to potentially sacrifice 10-20% of (max local
| network) throughput you can drastically improve wifi fairness
| and improve ping times/reduce bufferbloat (random ping spikes
| will still happen on wifi though).
|
| There's a huge thread https://forum.openwrt.org/t/aql-and-the-
| ath10k-is-lovely/590... that has stuff about enabling and
| tuning aqm, and some of the tradeoffs between throughput and
| latency.
| gjulianm wrote:
| This is great, not just the parameters themselves but all the
| steps that a packet follows from the point it enters the NIC
| until it gets to userspace.
|
| Just one thing to add regarding network performance: if you're
| working in a system with multiple CPUs (which is usually the case
| in big servers), check NUMA allocation. Sometimes the network
| card will be in one CPU while the application is executing on a
| different one, and that can affect performance too.
| klabb3 wrote:
| A random thing I ran into with the defaults (Ubuntu Linux):
|
| - net.ipv4.tcp_rmem ~ 6MB
|
| - net.core.rmem_max ~ 1MB
|
| So.. the tcp_rmem value overrides by default, meaning that the
| TCP receive window for a vanilla TCP socket actually goes up to
| 6MB if needed (in reality - 3MB because of the halving, but let's
| ignore that for now since it's a constant).
|
| But if I "setsockopt SO_RCVBUF" in a user-space application, I'm
| actually capped at a maximum 1MB, even though I already have 6MB.
| If I try to _reduce it_ from 6MB to e.g. 4MB, it will result in
| 1MB. This seems very strange. (Perhaps I 'm holding it wrong?)
|
| (Same applies to SO_SNDBUF/wmem...)
|
| To me, it seems like Linux is confused about the precedence order
| of these options. Why not have core.rmem_max be larger and the
| authoritative directive? Is there some historical reason for
| this?
| pengaru wrote:
| net.ipv4.tcp_rmem max is a limit for the auto-tuning the kernel
| performs
|
| once you do SO_RCVBUF the auto-tuning is out of the picture for
| that socket, and net.core.rmem_max becomes the max.
|
| It's pretty clearly documented @ Documentation/networking/ip-
| sysctl.rst
|
| Edit: downvotes, really? smh
| dekhn wrote:
| And to add: the kernel autotunes better than you can, so
| leave that enabled unless you're Vint Cert, Jim Gettys, or
| Vern Paxton.
| napkin wrote:
| Just changing Linux's default congestion control
| (net.ipv4.tcp_congestion_control) to 'bbr' can make a _huge_
| difference in some scenarios, I guess over distances with
| sporadic packet loss and jitter, and encapsulation.
|
| Over the last year, I was troubleshooting issues with the
| following connection flow:
|
| client host <-- HTTP --> reverse proxy host <-- HTTP over
| Wireguard --> service host
|
| On average, I could not get better than 20% theoretical max
| throughput. Also, connections tended to slow to a crawl over
| time. I had hacky solutions like forcing connections to close
| frequently. Finally switching congestion control to 'bbr' gives
| close to theoretical max throughput and reliable connections.
|
| I don't really understand enough about TCP to understand why it
| works. The change needed to be made on both sides of Wireguard.
| drewg123 wrote:
| The difference is that BBR does not use loss as a signal of
| congestion. Most TCP stacks will cut their send windows in half
| (or otherwise greatly reduce them) at the first sign of loss.
| So if you're on a lossy VPN, or sending a huge burst at 1Gb/s
| on a 10Mb/s VPN uplink, TCP will normally see loss, and back
| way off.
|
| BBR tries to find Bottleneck Bandwidth rate. Eg, the bandwidth
| of the narrowest or most congested link. It does this by
| measuring the round trip time, and increasing the transmit rate
| until the RTT increases. When the RTT increases, the assumption
| is that a queue is building at the narrowest portion of the
| path and the increase of RTT is proportional to the queue
| depth. It then drops rate until the RTT normalizes due to the
| queue draining. It sends at that rate for a period of time, and
| then slightly increases the rate to see if RTT increases again
| (if not, it means that the queuing that saw before was due to
| competing traffic which has cleared).
|
| I upgraded from a 10Mb/s cable uplink to 1Gb/s symmetrical
| fiber a few years ago. When I did so, I was ticked that my
| upload speed on my corp. VPN remained at 5Mb/s or so. When I
| switched to RACK TCP (or BBR) on FreeBSD, my upload went up by
| a factor of 8 or so, to about 40Mb/s, which is the limit of the
| VPN.
| [deleted]
___________________________________________________________________
(page generated 2023-09-06 20:00 UTC)