[HN Gopher] WAN router IP address change blamed for global Micro...
___________________________________________________________________
WAN router IP address change blamed for global Microsoft 365 outage
Author : mikece
Score : 79 points
Date : 2023-01-30 13:53 UTC (9 hours ago)
(HTM) web link (www.theregister.com)
(TXT) w3m dump (www.theregister.com)
| kloch wrote:
| > As part of a planned change to update the IP address on a WAN
| router, a command given to the router caused it to send messages
| to all other routers in the WAN, which resulted in all of them
| recomputing their adjacency and forwarding tables. During this
| re-computation process, the routers were unable to correctly
| forward packets traversing them. "The command that caused the
| issue has different behaviors on different network devices, and
| the command had not been vetted using our full qualification
| process on the router on which it was executed."
|
| From this it sounds like they might have changed the primary
| loopback IP, which by default is the "router-id" for various
| routing protocols, causing the entire network to have to
| reconverge. You can override the default router-id with an
| explicit address that does not depend on lo0 but lots of networks
| don't do that.
|
| It's extremely uncommon to change the primary loopback address.
| It's less uncommon to add an additional one but as the article
| says that syntax varies by vendor: Juniper will add as additional
| by default, Cisco and Arista will replace the existing primary
| one (IPv4) unless you include the "secondary" keyword...
| AdamJacobMuller wrote:
| I feel like they intended to /ADD/ a new loopback IP and in the
| process accidentally removed the existing one and replaced
| because I think anyone intentionally changing the loopback IP
| knows it's going to reset all bgp sessions. I think more modern
| cisco/arista platforms now "secondary" by default and perhaps
| that is what bit them?
| kemals wrote:
| This was a rather interesting event. In general, changing the
| IP address (even the loopback address) shouldn't have caused it
| from the BGP perspective. For example, if you were to change
| the IP address of BGP enabled router that has multiple BGP
| sessions, all other routers tore down the sessions to it, and
| withdrew the prefixes. BGP reconverge events take time.
| However, less than this took (90+ minutes and then a few more
| hours until __full__ recovery).
|
| This seems like one of the events in which they changed IP on
| Route Reflector routers that were pretty busy, which would
| cause reconvergence and CPU spikes for all routers that it had
| sessions with. Also, there was a lot of volatility, as part of
| which re-advertisements were happening continuously. They also
| attempted rollback, which caused reverse operation, which
| triggered reconvergence. The other scenario is doing this
| change on the SDN controller, which affected all other routers.
|
| More details: https://www.thousandeyes.com/blog/microsoft-
| outage-analysis-... https://www.thousandeyes.com/resources/na-
| microsoft-outage-a...
| nzgrover wrote:
| It's not DNS
|
| There's no way it's DNS
|
| It was DNS
|
| Credit: https://www.cyberciti.biz/humour/a-haiku-about-dns/
| jimmyl02 wrote:
| It seems that in modern large scale systems networking continues
| to be one of the few things were a a seemingly small and
| inconsequential change can cause entire cloud providers and
| highly redundant systems to go down. It makes sense as networking
| is the fabric connecting all systems together but each time an
| incident like this occurs I'm reminded of just how important
| networking is.
|
| Network engineers and the people handling network ops always
| amaze me.
| iso1631 wrote:
| IME Network engineers put too much faith in vendors. They think
| "the vendor says this is a resilient virtual chassis so it
| can't break", rather than thinking "ok, if this breaks what
| happens"
|
| A crash affecting both sides of a "resilient" virtual chassis I
| had to work with took off a major broadcast last year (it was a
| last minute favour I was doing, and I rerouted to a tertiary
| route in a couple of minutes).
|
| Meanwhile I ran a rather large event going out to some hundred
| million listeners via two crappy PS300 switches which were
| completely independent of each other, into two independent
| routers, running via two separate systems (one on a UPS, one on
| mains). If one of them broke the other one was completely
| independent and the broadcast would have continued just fine.
|
| As far as I am concerned, that is far better than a virtual
| chassis.
| ccakes wrote:
| This may be true of enterprise network engineers but I've
| worked across a lot of very large networks (telco, not cloud)
| and we never _ever_ trust the vendor.
|
| The kind of bugs that I've read about in errata notes over
| the years is wild and truly unpredictable.
| Spooky23 wrote:
| Enterprise is definitely different - network guys need
| multiple customers to develop the vendor skepticism. I used
| to get into brutal internal fights with network directors
| over whatever bullshit the Cisco salesman said offhand that
| was treated as though it was delivered by Moses off the
| mountain. One guy tried to get me fired because I offended
| an SE. lol.
|
| I worked on systems and platforms at the time, and we were
| more cynical even about vendors we liked.
| jacquesm wrote:
| It wouldn't be the first time that your redundant vendors
| end up sharing a conduit for a bunch of fiber somewhere.
| Guess where that backhoe will start digging?
| oarsinsync wrote:
| Redundant vendors in the GP's context referred to using
| multiple router vendors, eg Cisco and Juniper.
|
| Using multiple connectivity vendors doesn't guarantee
| path diversity. Demanding fibre maps and ensuring that
| your connectivity has separate points of entry into the
| building, doesn't cross outside the building, and
| validating with your DC provider that your cross connects
| aren't crossing either, guaranteed path diversity /
| redundancy.
| jacquesm wrote:
| The GP was clearly talking about whole networks, not just
| the hardware vendors, if I read that different than the
| GP intended I'll wait for their correction.
|
| One of the problems that I've seen in practice that with
| the degree of virtualization at play that it has at the
| same time become much more easy to _in principle_ be
| guaranteed 100% independence and _in practice_ it has
| become much harder to verify that this is the case
| because of all of the abstraction layers underneath the
| topology. One of my customers specializes in software
| that allows one to make such guarantees and this is a
| non-trivial problem, to put it mildly, especially when
| the situation becomes more dynamic due to outages from
| various causes.
| jacquesm wrote:
| The network _is_ a single point of failure, even if the network
| itself is redundant!
| wmf wrote:
| One possible way to fix that is to replace _the_ network with
| multiple independent networks. It 's really expensive though.
| jacquesm wrote:
| Yes, exactly. Most really mission critical places do
| exactly that.
|
| The first time I saw something like that put into practice
| was when an experiment in the oil and gas industry that was
| scheduled to run for years delivered their network design.
| On the runtime cost of the experiment the extra network
| wasn't a big deal, but a service interruption would have
| been and would have caused them to have to restart the
| whole thing from scratch. It's more than a decade ago and I
| forgot what the exact context was but the whole thing was
| fascinating from a redundancy perspective as well as the
| degree of thinking that had gone into the risk assessment.
| Those guys _really_ knew their business. Also the amount of
| data that experiment was expected to generated was off the
| scale. Multiple petabytes, which at the time (a decade ago
| or so) was a non trivial amount of data.
| bogomipz wrote:
| This doesn't really make sense. The modern WAN operates on
| multiple independent networks - SD-WANs, multiple transit
| providers, fiber-ring MPLS, EVPN etc. If you propagate a
| bad network change throughout your autonomous system or
| backbone you can still have an outage on your hands.
| jgrahamc wrote:
| Having, uh, had bad things happen with router configuration I
| feel for them.
|
| https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...
| newah1 wrote:
| I remember this happening. The 20 some sites we ran went down
| as they were supported by cloudflare. I spent a panicked 30
| minutes trying to figure out what I had done wrong, to
| eventually find out it was on CF's end.
|
| I remember voicing at our team meeting "boy, they must be
| panicking at CloudFlare."
|
| Cloudflare works so spectacularly we just wrote it off as a one
| time thing.
| jgrahamc wrote:
| There was no panic but there was a lot of VUF (Very Urgent
| Focus)!
| alex-mohr wrote:
| It does seem like network configuration remains rather manual
| compared to other large scale systems that include more
| automation.
|
| In Microsoft's case, the remediation is not to put in place
| higher level systems to safely accomplish the goal of the
| command. Instead:
|
| - "We have blocked highly impactful commands from getting
| executed on the devices (Completed)"
|
| - "We will require all command execution on the devices to
| follow safe change guidelines (Estimated completion: February
| 2023)"
|
| Requiring commands to follow guidelines sounds suspiciously
| like they're requiring network ops not to break things.
| iso1631 wrote:
| That's the norm in network ops. Automated testing is pretty
| much impossible, easy rollback may be possible depending on
| exactly what was screwed, but not always.
|
| Take this for example, looks like the problem was an
| unplanned recalculation of routing tables. That's not going
| to be the case on a small scale test network, and rolling
| back won't help, indeed in this case it likely would cause
| more problems.
| jsz0 wrote:
| One of the reasons I got out of network engineering was how
| frequently the work I was required to do would cause
| unintended consequences. You can do all your due diligence,
| get your work blessed by vendor support, and still get
| blown up by a bug or undocumented behaviors on a regular
| basis. The conspiratorial part of my brain says these
| network device makers intentionally provide unreliable
| software and terrible documentation to bolster their
| support contract profits. I was just the guy typing in the
| commands and getting all the blame.
| candiddevmike wrote:
| Networking and storage changes are always butt clenching
| affairs. Way more stressful than anything else in IT due
| to their blast radius if something shits the bed.
| spookthesunset wrote:
| I remember the first time I got access to an employers
| production Cisco router. It's pretty scary how easy it is
| to majorly fuck something up.
|
| There isn't a concept of a transaction or a rollback. You
| just enter a command, press enter and it's live.
|
| To counter this we'd write all the commands we planned on
| executing and peer review it. Nothing was to be done "on
| the fly" (at least in theory)
|
| In short, coming from a developer perspective with ample
| version controls and gated releases... networking is a
| very wild ride.
| simoncion wrote:
| > There isn't a concept of a transaction or a rollback.
|
| Yeah, Cisco gear is bonkers.
|
| Mikrotik has "Safe Mode", which undoes all commands since
| you entered "Safe Mode" if the connection that created
| the shell gets interrupted. It has saved my bacon on
| several occasions, but there are several obvious
| situations in which you can get yourself locked out.
|
| Juniper gear has "commit confirmed $NUMBER_OF_MINUTES",
| which will roll back everything since your last commit if
| you don't do a "commit" within $NUMBER_OF_MINUTES. It
| will also, apply all of the changes you've staged all at
| once (and do configuration sanity checking before it
| performs the commit).
|
| I do have no idea how Juniper's rollback works when
| multiple users are doing simultaneous config editing...
| maybe don't do that?
| atxbcp wrote:
| That's not entirely true, you can rollback a change on
| modern switches/routers, either via a rollback command,
| or with a revert timer (configure terminal revert timer
| X) (because the new configuration might have made the
| router unreachable, so you're never sure you'll be able
| to rollback manually if you're working remotely).
| bogomipz wrote:
| >"There isn't a concept of a transaction or a rollback.
| You just enter a command, press enter and it's live."
|
| This hasn't been true for a very long time. Juniper
| router's have rollbacks, commits and revisions:
|
| https://www.juniper.net/documentation/us/en/software/juno
| s/c...
|
| and
|
| https://www.juniper.net/documentation/us/en/software/juno
| s/c...
|
| Cisco has similar:
|
| https://www.cisco.com/c/en/us/td/docs/ios/ios_xe/fundamen
| tal...
| ccakes wrote:
| Modern router operating systems have this.
|
| It's been a long time since I've touched IOS-XE (Cisco
| enterprise gear) but Cisco IOS-XR, Junos, Arista EOS and
| the Nokia SRs all support some combination of
| configuration transactions with rollback and commit
| confirm on a timer
|
| This definitely doesn't stop you shooting yourself in the
| foot, similar to how you can still push broken config to
| a k8s controller, but it's some level of protection for
| certain types of changes.
| meltyness wrote:
| Interesting. There's also some stuff in Cisco that can't
| be done both atomically and remotely, so you may have to
| push a change as a file to the router and then source the
| file into the running config with some permutation of
| `copy`.
| meltyness wrote:
| Hadn't thought about it from the perspective of support
| contract profits, but they also have their friendship
| stick firmly planted in technicians via the semi-required
| training since as you indicate the manuals are deficient.
|
| At some point network vendors switched manuals from
| engineers documenting features whitebox to educated techs
| documenting features blackbox.
|
| There's a clear transition for docs produced after 2008,
| prior to which more care went into tech notes and
| interpreting technologies -- after you're lucky to even
| get a complete set of steps and caveats without having to
| cross-reference bugs, release notes, old-manuals, new-
| manuals, draft manuals, reference manuals, licensing
| manuals, the inevitable errors that appear in the logs,
| and of course the configuration guide where this should
| all be in the first place.
|
| In short, yes, this.
| Cyph0n wrote:
| > The conspiratorial part of my brain says these network
| device makers intentionally provide unreliable software
| and terrible documentation to bolster their support
| contract profits.
|
| As a dev who has worked at one of the major networking
| vendors, I can assure you that is the not the case. You'd
| be surprised by how major bugs are handled internally,
| especially if the bug affects "important" customers.
| throw0101a wrote:
| > _That 's the norm in network ops. Automated testing is
| pretty much impossible, easy rollback may be possible
| depending on exactly what was screwed, but not always._
|
| Ansible/Napalm is a thing in NetOps in some places. Some
| folks use Eve-ng / GNS3 to spin up virtual networks to test
| config changes, and it may be possible to do CI/CD changes
| if you track things in a repo.
|
| Juniper JunOS has auto-rollback if you don't confirm the
| change after "x" minutes:
|
| * https://www.juniper.net/documentation/us/en/software/juno
| s/c...
|
| So if you did something that causes breakage and
| disconnection from the router, you (ideally) don't have to
| do anything but wait it out.
| AdamJacobMuller wrote:
| commit confirmed is such a life-saver. I ran a production
| network which spanned multiple continents and even though
| I probably only ever actually needed commit confirmed a
| single digit number of times, the fact that it was there
| made every change I did 99% less stressful. I knew that
| even if I made a mistake, all I had to do was wait 5-10
| minutes and it would all revert.
|
| Compare this to my cisco/foundry/other experience where I
| would delay changes until I was in the office (physically
| colocated with main routers) or calling people to be
| onsite for what was 99% of the time an innocuous change.
| The stress of it led to me deferring changes or just
| skipping them entirely which led to more
| issues/stress/etc.
|
| I'm really not sure there is a single software feature
| which improved my life as much as "commit confirmed"
| bogomipz wrote:
| Your CEO sure doesn't seem to have much empathy when it's
| someone else though:
|
| https://twitter.com/eastdakota/status/1143182575680143361
| latchkey wrote:
| How about describing how you implement systems that prevent
| this? You kind of talk about what was 'fixed', but not how.
| CI/CD is pretty hard to do for global networking changes. I'm
| sure whatever CF has done in this area is a lot of magic sauce
| and it would be super interesting to learn more about it, even
| at a high level.
| libraryatnight wrote:
| Holy shit I have been there and it sucks. I wasn't the guy who
| made the change, but I was on the long call that followed.
| int0x2e wrote:
| Time to share one of my favorite talks (and speakers) ever -
|
| "Debugging Under Fire: Keep your Head when Systems have Lost
| their Mind" (Bryan Cantrill, GOTO 2017)
|
| https://www.youtube.com/watch?v=30jNsCVLpAE
| libraryatnight wrote:
| This was an awesome lunch listen, thank you for sharing!
| rexarex wrote:
| The curse of network engineering. You're invisible and
| insignificant when everything is running well, and public enemy
| number one if you make a mistake!
| asmor wrote:
| Security too.
| septune wrote:
| I miss the old days of IOS :
|
| switchport trunk allowed vlan (add) xxx
|
| Can't imagine how many outages where caused by the missing << add
| >> command.
| candiddevmike wrote:
| Too many Cisco commands would truncate the syntax if you didnt
| know better:
|
| no access-list 101 permit something
|
| so long access-list 101!
| raffraffraff wrote:
| Token-ring network. Someone configured their printer to use the
| gateway address in the ip address field. Idiot. "Turn off all
| devices on the internet, then turn them all on again one by one
| until we find the bastard who did this"
___________________________________________________________________
(page generated 2023-01-30 23:00 UTC)