[HN Gopher] Tracking NFS problems down to the SFP level
___________________________________________________________________
Tracking NFS problems down to the SFP level
Author : CaliforniaKarl
Score : 30 points
Date : 2021-02-05 20:17 UTC (1 days ago)
(HTM) web link (news.sherlock.stanford.edu)
(TXT) w3m dump (news.sherlock.stanford.edu)
| lykr0n wrote:
| This seems like an issue that could have been resolved a lot
| quicker if they were doing network monitoring on the host size
| and the switch side.
|
| Ideally you would be able to spot a large amount of link errors
| on a port/switch/host, and fix it before it becomes a problem.
| toast0 wrote:
| I no longer have access to it, but I wrote a tool to find these
| types of problems at my last job. It didn't seem generally
| applicable enough to try to get it open sourced (and I didn't
| want to polish it enough for that either).
|
| The key insight is that LACP is almost always configured to use
| do a hash of { Source IP, Dest IP, Protocol, Source Port, Dest
| Port } so that packets from each TCP and UDP flow will always be
| sent on the same individual link. (this is directional though, so
| it may go from peer A to peer B on cable X and from B to A on
| cable Y).
|
| So the way to confirm a broken link is to connect a bunch of UDP
| flows (on different ports) between peer A and peer B, send data,
| and measure loss (and/or delay!). If you see zero loss / uniform
| delay, either none of your flows cross the broken link or the
| problem isn't a broken link or the rate of issues is too low to
| detect. Once you've found a broken flow, you can use a 'paris
| traceroute' tool to confirm the IP routers it's between. Paris
| traceroute holds the UDP source and destination ports fixed, so
| the route on LACP should say the same. I contributed support for
| this in mtr, but I'm not sure if it still works; if you see 100%
| packet loss with mtr in fixed udp port mode, send me an email
| (address in profile) and I'll ask you for data and try to debug.
|
| Once you narrow down to the two routers the link is between, it
| should be easy enough to confirm. Usually through link quality
| counters, if not, through just pulling links and seeing if things
| get better or worse.
|
| If you have a long network path, and most links are LACP, the
| total number of possible paths between two peers gets large, and
| there's a chance that you might not be able to survey them all
| from a pair of servers; so you may have to try a few different
| hosts.
|
| You can find packet loss, but also congestion/buffering this way.
| In an ideal world, all the link error counters would be collected
| and anomalies would be addressed, but it seems that doesn't
| always happen.
| pwarner wrote:
| Yeah we get major problems in a top 2 cloud provider where our
| on prem to cloud link dropped packets. We narrowed it down with
| iperf to the packet loss only happening on some ephemeral
| source ports. Ports were always ok, or always slow. Destroyed
| and recreated the cloud gateways and all was well. Should say
| another engineer figured it out. Cloud provider tried to blame
| our side. They did not excel at operations...
| azinman2 wrote:
| Given enough of these fault analysis articles, I wonder if it's
| possible to compile them into some kind of decision tree-like
| interface where you can describe your problems and have it guide
| possible failure scenarios (plus diagnose steps). Would be cool
| to collect all of this knowledge beyond google, as this is the
| type of stuff where Google often breaks down.
| toast0 wrote:
| 1) run tcpdump on both sides and compare
|
| 2) If both sides have the same tcpdump, it's not a network
| problem. Find the software problem. truss or strace can help
|
| 3) If the sides differ; figure out if the network is broken, or
| the os/network card is lying.
|
| 4) If the network is broken, fix it ;)
|
| 5) if the os/network card is lying, turn off the lying (mostly
| offloads, like segmentation, large receive, and checksum) and
| go back to step 1
|
| This is basically common pattern debugging. I'm not getting the
| results that I expect; find a way to observe when/where the
| data in progress changes from what I expect to something else;
| along the way, being as explicit as needed about what data is
| expected. The closer you can narrow down where the failure
| occurs, the more likely you are to be able to find the failure,
| or a person responsible for fixing the failure who you can give
| your investigation results to and they can fix the failure.
| physicsgraph wrote:
| I agree with your suggestion conceptually, but the starting
| point of a decision tree strongly depends on many factors:
| which version of the software is being used, how the software
| was compiled, running with what OS, which what patches, on what
| hardware, under what environmental conditions, in support of
| what application usage patterns, and what load, etc. Merely
| capturing the (potentially) relevant input conditions becomes
| challenging, never mind the process of eliminating irrelevant
| variables. And that's all premised on the concept that a
| problem is recurring (rather than some fluke that no one else
| has encountered).
|
| I think that's why Stack Overflow websites focused on shallow
| conditions flourish -- the deep dives are usually specific to a
| given situation.
| sneak wrote:
| My meta-fix for such a thing would be to hack up a script to
| start putting these interface error counters into prometheus, and
| then alerting on a spike above some threshold.
___________________________________________________________________
(page generated 2021-02-06 23:00 UTC)