[HN Gopher] All Circuits are Busy Now: The 1990 AT&T Long Distan...
___________________________________________________________________
All Circuits are Busy Now: The 1990 AT&T Long Distance Network
Collapse (1995)
Author : hexbus
Score : 63 points
Date : 2023-02-05 14:13 UTC (8 hours ago)
(HTM) web link (users.csc.calpoly.edu)
(TXT) w3m dump (users.csc.calpoly.edu)
| pifm_guy wrote:
| Obviously you do everything possible to stop an outage like this
| happening...
|
| But when it inevitably does, you should be prepared for a full
| system simultaneous restart. Ie. So that no 'bad' signals or data
| from the old system can impact the new.
|
| That is the sort of thing you should practice in the staging
| environment from time to time, just for when it might be needed.
| It could have taken this outage from many hours down to just many
| minutes.
| pifm_guy wrote:
| You should also design all your code to be rollbackable... But
| for the very rare case that a rollback won't solve the problem
| (eg. An outage is caused by changes outside your organisation's
| control), you also need to be able to do a rapid code change,
| recompile and push. Many companies aren't able to do this for
| example their release process involves multiple days worth of
| interlocked manual steps.
|
| Don't get yourself in that position.
| vhold wrote:
| This is an example of why you want interoperable diversity in
| complex distributed systems.
|
| By having everything so standardized and consistent, they had the
| exact same failure mode everywhere and lost redundant fault
| tolerance. If they had different interoperable switches, running
| different software, the outage wouldn't have been absolute.
|
| When large complex distributed systems grow organically over
| time, they tend to wind up with diversity. It usually takes a big
| centralized project focused on efficiency to destroy that
| property.
| yusyusyus wrote:
| I appreciate this comment. In my world of packet pushing, I try
| to promote vendor diversity for this reason.
|
| The practical downsides of this diversity live in the
| complexity of the interop (often slowing feature velocity),
| operations, and procurement/support.
|
| But issues like the AT&T 4ESS outage have occurred before in IP
| networks, as an example, in some BGP bug. Diversity alleviates
| some of the global impact.
| vlovich123 wrote:
| There are other ways of accomplishing this like doing staged
| rollouts without giving up the cost efficiencies of
| implementing your own network only once and avoiding a
| combinatorial explosion in testing complexity.
|
| You can sometimes play this game with vendors because you want
| _them_ to give you an interoperable interface so that you avoid
| vendor lock-in and have better pricing, but that's a secondary
| benefit and staged rollouts should still be performed even if
| you have heterogenous software.
| kortilla wrote:
| Staged rollouts do not protect you from long lurking bugs.
| Even in this ATT case they most certainly did do a staged
| rollout just because they couldn't just shut off the entire
| phone network to run an update across all systems.
| a-dub wrote:
| these days i think the remediation would be "fuzz test the timing
| of critical network protocols to find nonobvious edge cases and
| state machine implementation faults"
| hexbus wrote:
| Break statements contained in an if-then-else loop are a bad
| idea.
| phkahler wrote:
| Skipping the full test suite is also a bad idea.
| DangitBobby wrote:
| It can be hard to test what you didn't anticipate!
| acadiel wrote:
| I think the article mentioned they were fanatical about their
| test suite - but this somehow still slipped through.
|
| I wonder if peer review would have caught it.
| gumby wrote:
| I don't think that was the case in this situation
| carl_sandland wrote:
| would complete code coverage from tests of this line found the
| problem, or was it temporal somehow ?
| twisteriffic wrote:
| > Clearly, the use of C programs and compilers contributed to the
| breakdown. A more structured programming language with stricter
| compilers would have made this particular defect much more
| obvious.
|
| Nice to see that "should have used Rust" has been a thing since
| before Rust existed.
| bee_rider wrote:
| I'm sure people have been saying "shouldn't have used C" for
| longer than most of us here have been alive.
| EGreg wrote:
| Should have used Q !
| conjectureproof wrote:
| Why?
|
| I briefly had an interest in learning Q, then looked at some
| code: https://github.com/KxSystems/cookbook/blob/master/start
| /buil...
|
| Why not just build what you need with C/arrow/parquet?
| EGreg wrote:
| Not that Q.
|
| The one I am talking about hasn't been released yet!!
| gumby wrote:
| I remember that outage. It was finally blamed (as described in
| this brief) on phone switch manufacturer DSC. IIRC this killed
| the company. Their SLA with their customers was something like
| three minutes of downtime per decade.
|
| DSC was our customer at Cygnus. They were interesting as a
| customer (tough requirements but they paid a lot for them). For
| example if they reported a bug and got a fix they diffed the
| binaries and looked at every difference to be sure that the
| change was a result of the fix, and nothing else (no, they didn't
| want upgrades).
| JosephRedfern wrote:
| > For example if they reported a bug and got a fix they diffed
| the binaries and looked at every difference to be sure that the
| change was a result of the fix, and nothing else (no, they
| didn't want upgrades).
|
| That sounds pretty diligent!
| peteradio wrote:
| Sounds like a pain in the ass, it means custom branching for
| eternity.
| gumby wrote:
| It was a pain, but they paid a hefty premium for this level
| of service.
|
| We required that all other customers upgrade at least once
| a year (or maybe 18 months? I don't remember).
| yunohn wrote:
| > Their SLA with their customers was something like three
| minutes of downtime per decade.
|
| That is insane. I really feel like modern SLAs are only getting
| worse - so much so that most companies fudge them, and try
| their hardest to never declare any sort of outage.
| wincy wrote:
| What do you mean, Microsoft 358 is an excellent product that
| has uptime for exactly what they've got on the label!
| [deleted]
| rubatuga wrote:
| Worst are SLAs where you have to prove the outage as a
| customer like wtf
| foobiekr wrote:
| Most modern SLAs are worthless. The penalty is meaningless
| and/or "up" is carefully defined in such a way that even a
| service failing 100% of requests is "up" because it's
| responding, or is defined such that a single customer can
| have a total outage but the service is up because it's
| servicing others.
|
| Networking is the last bastion of SLAs that actually seem to
| matter.
| gumby wrote:
| The Bell System standard was an extremely high Erlang number,
| which of course they strove to provide at the lowest cost
| which meant extremely high utilization rate on hardware which
| in turn meant extreme uptime (compare to the then
| contemporary PTT QoS even in major economies like France).
|
| This is also why the software itself was designed with so
| many internal defenses and what I would consider an "immune
| system". I've never seen anything like it even on an aircraft
| control system. That is mentioned in passing in the brief
| article but is easily missed if you don't know what it's
| referring to.
|
| Most of what is done on the Internet at, say, "layer 5 or
| above" isn't at all important so there's no need for this
| level of SLA, but the actual backbone carriers do still carry
| SLAs at around that level. With packet switching it's easier
| for them to provide than it was in the days of the 4ESS and
| 5ESS.
| vatys wrote:
| > The Bell System standard was an extremely high Erlang
| number
|
| I wasn't aware of erlang the unit (measuring telephone
| circuit load) and at first thought this had something to do
| with the language.
|
| https://en.wikipedia.org/wiki/Erlang_(unit)
| pacificmint wrote:
| If some one wants to read a lot more details an out this
| incident, there is book about it. It's been a decade or two since
| I read it, but I remember it being well written.
|
| 'The Day the Phones Stopped Ringing' by Leonard Lee
| acadiel wrote:
| > 'The Day the Phones Stopped Ringing' by Leonard Lee
|
| Definitely going to look into this! The whole *ESS architecture
| still underpins a lot of the telephony system. There are quite
| a few still running, even though other TDM equipment is being
| phased out.
| mdmglr wrote:
| So what was the fix? Remove lines 9-10? Or do "set up pointers to
| optional parameters" before break?
___________________________________________________________________
(page generated 2023-02-05 23:00 UTC)