[HN Gopher] Better visibility into Linux packet-dropping decisions
___________________________________________________________________
Better visibility into Linux packet-dropping decisions
Author : rwmj
Score : 88 points
Date : 2022-03-03 17:39 UTC (5 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| egberts1 wrote:
| There are 2,700 ways to arguably lose a packet within the kernel
| and 6/10th of 1% are reported to the log.
|
| One thing I've noticed is that introducing another function
| argument can alter tightly written loops and calls (also known as
| introducing instruction-cache busting).
|
| I am quite sure Jen Axboe has something to say about this while
| he is busy pushing the Linux kernel toward multi-million IO and
| packet processing throughout.
| rwmj wrote:
| Error/packet-drop paths can be moved to a cold path outside of
| hot loops using if(unlikely(...)), although I agree that does
| require analysis & code modifications.
| contingencies wrote:
| Cool but I think the following are safe assumptions: Most packets
| are dropped "elsewhere". Most packets are dropped due to
| misconfiguration, route loss (often due to power loss, link
| change or link integrity issues), or firewalling. Well written
| applications tend to survive regardless. IPv6 is nominally better
| than IPv4 in terms of surviving link state changes.
|
| Also: The recent trend of putting everything down web sockets is
| bad, as it effectively regresses to circuit-switched networking
| with undefined QoS guarantees / failure types and crap tooling.
| Hopefully we've already passed 'peak websocket'.
| toast0 wrote:
| > Most packets are dropped "elsewhere".
|
| This is probably true, but elsewhere packet loss is often
| diffuse. If you're looking into packet loss, it's likely
| because something is overloaded on your machine resulting in a
| burst of loss that affects user experience or your stats.
| Surviving is great, but eliminating bottlenecks you can control
| is better.
| jeffbee wrote:
| I don't take the statement that packets are dropped elsewhere
| as obviously true. Your own box will drop due to queue
| discipline buffer sizes, NIC queue pruning, and a hundred
| other reasons.
| predakanga wrote:
| I recently came across another useful utility for debugging
| unexpected packet drops - PWRU[0] (Packet, Where Are You) by
| Cilium.
|
| It uses eBPF to try to trace the path of the packet through the
| kernel. Haven't needed to use it yet, but it could have saved me
| a lot of trouble in the past.
|
| [0]: https://github.com/cilium/pwru
| tptacek wrote:
| drop_mon (or whatever it's called) is one of the weirder things
| in the Linux kernel. It has only one implementation I've found,
| "dropwatch", which is, to put it gently, not a great example of a
| modern C CLI program --- for instance, the kernel subsystem gives
| you snapshots of packet contents themselves, and there is already
| a very flexible and easy-to-use library for filtering packets
| based on their contents with an enormous ecosystem, but all
| dropwatch will do is print dumps.
|
| I threw together a half-assed POC alternative implementation in
| Go a couple months ago, using Matt Layher's fantastic netlink
| libraries:
|
| https://github.com/superfly/dropspy
|
| I have the impression that the drop_mon stuff isn't taken super
| seriously by anyone, but it's incredibly useful when you're
| debugging complicated networking stuff.
| suprjami wrote:
| There's a dropwatch in SystemTap which produces similar output
| to the dropwatch program but has been significantly more useful
| for our purposes. Adding a backtrace to that can be useful.
| Beyond that we usually instrument the kernel to either printk
| when decisions are made, or add fields in the skb and set the
| fields to values as they pass through different parts of the
| stack, then finally things which react to those fields. That's
| a lot quicker and more efficient than looking at every packet.
| spockz wrote:
| Is there any way to get to statistics about number of package
| drops, retransmits, etc from the application level?
|
| We are running jvm (netty client and tomcat server) applications
| in K8s and are experiencing p99 delays for some requests. Server
| side, application metrics have a p99 of 8ms, but clients
| experience a p99 of 2000ms. In this same interval other requests
| continue happily. I do suspects either scheduling somehow, or
| something in the network.
|
| Any suggestions on how to figure out how to detect where this
| happens? (Putting wireshark in between would be a last resort.)
| jeffbee wrote:
| Better than wireshark would be a BPF program. For established
| sockets you can get retransmission stats out of the kernel with
| the `ss` tool, but that leaves you no visibility into transient
| sockets.
|
| Switching to QUIC or other user-space protocol would give you
| the best observability.
| xxpor wrote:
| If it's almost exactly 2000 ms, it sounds like a retransmission
| timer, i.e. packet loss.
| spockz wrote:
| Thanks. I'll look into that. It actually appears to be
| multiples of 2000 + some additional processing times.
| jeffbee wrote:
| More specifically, it's probably a lost SYN. 2s is too long
| to be a single retransmission on an established flow.
| bigcat123 wrote:
| namibj wrote:
| If it's usiong kernel TCP, switch to a TCP congestion control
| that uses tail loss probes. If you have enough control and the
| network you control supports it, consider L4S TCP technology or
| at least normal (old-style) ECN-based congestion control.
| toast0 wrote:
| netstat -s gives protocol statistics, some of which might be
| useful, you might also have ethernet interface statistics
| somewhere in case the interface is dropping packets (on
| FreeBSD, interface stats for nice drivers will be in sysctl
| dev.X where X is the driver name; some drivers have better data
| than others, but I haven't debugged Linux issues enough to find
| the same data there)
|
| Either way, tcpdump on the host or the client (sometimes you
| need dumps from both) should tell the tale. You probably don't
| need or want wireshark in between the peers, a capture from
| either can be loaded at your leasure. If you suspect a non-
| application issue, you can do -s 64 or so and not capture most
| of the application data, but still have all the TCP headers.
| pcaps aren't always exactly the truth about what's on the wire,
| but they're pretty close.
|
| Personally, I get a lot of mileage out of Wireshark, so every
| problem report looks like a pcap. If the server is really busy,
| the pcaps get too big, and you have to do things like short
| bursts of capturing and hope you see the issue in the burst, or
| sampling, but if your server isn't very busy, you can do things
| like pcap all the packets, and ask for problem reports to
| include client ip and port number and then you can find it in
| the big pcap a lot easier.
| tptacek wrote:
| Wireshark, tcpdump, and pcap tools in general are probably
| way overkill for basic network statistics issues.
| toast0 wrote:
| Overkill is the best kind of kill! Client and server
| disagree on p99 times by a lot and there's enough reports
| to actually look into it says there's probably something
| interesting going on, and pcap will tell you what it is.
| And you'll almost always see at least a few other things
| that might be worth fixing while you're looking at pcaps;
| some of which might actually be fixable.
|
| Sometimes you even get lucky and can see what traffic was
| going on before the mysterious gap in packets, and make a
| good guess at what's blocking your rx queues or whatever.
|
| Stats are nice too, of course.
| tptacek wrote:
| Yes, there are a bunch of different network statistics, which
| you can scrape for instance out of proc (see the "net"
| subdirectory, and, particularly, "net/snmp"; it's trivially
| parseable).
|
| A more ordinary way to come at this would be to hook up a
| Prometheus node_exporter and just look at this stuff in
| Grafana.
___________________________________________________________________
(page generated 2022-03-03 23:00 UTC)