hngopher.com

       [HN Gopher] All Circuits are Busy Now: The 1990 AT&T Long Distan...
       ___________________________________________________________________
        
       All Circuits are Busy Now: The 1990 AT&T Long Distance Network
       Collapse (1995)
        
       Author : hexbus
       Score  : 63 points
       Date   : 2023-02-05 14:13 UTC (8 hours ago)
        
 (HTM) web link (users.csc.calpoly.edu)
 (TXT) w3m dump (users.csc.calpoly.edu)
        
       | pifm_guy wrote:
       | Obviously you do everything possible to stop an outage like this
       | happening...
       | 
       | But when it inevitably does, you should be prepared for a full
       | system simultaneous restart. Ie. So that no 'bad' signals or data
       | from the old system can impact the new.
       | 
       | That is the sort of thing you should practice in the staging
       | environment from time to time, just for when it might be needed.
       | It could have taken this outage from many hours down to just many
       | minutes.
        
         | pifm_guy wrote:
         | You should also design all your code to be rollbackable... But
         | for the very rare case that a rollback won't solve the problem
         | (eg. An outage is caused by changes outside your organisation's
         | control), you also need to be able to do a rapid code change,
         | recompile and push. Many companies aren't able to do this for
         | example their release process involves multiple days worth of
         | interlocked manual steps.
         | 
         | Don't get yourself in that position.
        
       | vhold wrote:
       | This is an example of why you want interoperable diversity in
       | complex distributed systems.
       | 
       | By having everything so standardized and consistent, they had the
       | exact same failure mode everywhere and lost redundant fault
       | tolerance. If they had different interoperable switches, running
       | different software, the outage wouldn't have been absolute.
       | 
       | When large complex distributed systems grow organically over
       | time, they tend to wind up with diversity. It usually takes a big
       | centralized project focused on efficiency to destroy that
       | property.
        
         | yusyusyus wrote:
         | I appreciate this comment. In my world of packet pushing, I try
         | to promote vendor diversity for this reason.
         | 
         | The practical downsides of this diversity live in the
         | complexity of the interop (often slowing feature velocity),
         | operations, and procurement/support.
         | 
         | But issues like the AT&T 4ESS outage have occurred before in IP
         | networks, as an example, in some BGP bug. Diversity alleviates
         | some of the global impact.
        
         | vlovich123 wrote:
         | There are other ways of accomplishing this like doing staged
         | rollouts without giving up the cost efficiencies of
         | implementing your own network only once and avoiding a
         | combinatorial explosion in testing complexity.
         | 
         | You can sometimes play this game with vendors because you want
         | _them_ to give you an interoperable interface so that you avoid
         | vendor lock-in and have better pricing, but that's a secondary
         | benefit and staged rollouts should still be performed even if
         | you have heterogenous software.
        
           | kortilla wrote:
           | Staged rollouts do not protect you from long lurking bugs.
           | Even in this ATT case they most certainly did do a staged
           | rollout just because they couldn't just shut off the entire
           | phone network to run an update across all systems.
        
       | a-dub wrote:
       | these days i think the remediation would be "fuzz test the timing
       | of critical network protocols to find nonobvious edge cases and
       | state machine implementation faults"
        
       | hexbus wrote:
       | Break statements contained in an if-then-else loop are a bad
       | idea.
        
         | phkahler wrote:
         | Skipping the full test suite is also a bad idea.
        
           | DangitBobby wrote:
           | It can be hard to test what you didn't anticipate!
        
           | acadiel wrote:
           | I think the article mentioned they were fanatical about their
           | test suite - but this somehow still slipped through.
           | 
           | I wonder if peer review would have caught it.
        
           | gumby wrote:
           | I don't think that was the case in this situation
        
         | carl_sandland wrote:
         | would complete code coverage from tests of this line found the
         | problem, or was it temporal somehow ?
        
       | twisteriffic wrote:
       | > Clearly, the use of C programs and compilers contributed to the
       | breakdown. A more structured programming language with stricter
       | compilers would have made this particular defect much more
       | obvious.
       | 
       | Nice to see that "should have used Rust" has been a thing since
       | before Rust existed.
        
         | bee_rider wrote:
         | I'm sure people have been saying "shouldn't have used C" for
         | longer than most of us here have been alive.
        
         | EGreg wrote:
         | Should have used Q !
        
           | conjectureproof wrote:
           | Why?
           | 
           | I briefly had an interest in learning Q, then looked at some
           | code: https://github.com/KxSystems/cookbook/blob/master/start
           | /buil...
           | 
           | Why not just build what you need with C/arrow/parquet?
        
             | EGreg wrote:
             | Not that Q.
             | 
             | The one I am talking about hasn't been released yet!!
        
       | gumby wrote:
       | I remember that outage. It was finally blamed (as described in
       | this brief) on phone switch manufacturer DSC. IIRC this killed
       | the company. Their SLA with their customers was something like
       | three minutes of downtime per decade.
       | 
       | DSC was our customer at Cygnus. They were interesting as a
       | customer (tough requirements but they paid a lot for them). For
       | example if they reported a bug and got a fix they diffed the
       | binaries and looked at every difference to be sure that the
       | change was a result of the fix, and nothing else (no, they didn't
       | want upgrades).
        
         | JosephRedfern wrote:
         | > For example if they reported a bug and got a fix they diffed
         | the binaries and looked at every difference to be sure that the
         | change was a result of the fix, and nothing else (no, they
         | didn't want upgrades).
         | 
         | That sounds pretty diligent!
        
           | peteradio wrote:
           | Sounds like a pain in the ass, it means custom branching for
           | eternity.
        
             | gumby wrote:
             | It was a pain, but they paid a hefty premium for this level
             | of service.
             | 
             | We required that all other customers upgrade at least once
             | a year (or maybe 18 months? I don't remember).
        
         | yunohn wrote:
         | > Their SLA with their customers was something like three
         | minutes of downtime per decade.
         | 
         | That is insane. I really feel like modern SLAs are only getting
         | worse - so much so that most companies fudge them, and try
         | their hardest to never declare any sort of outage.
        
           | wincy wrote:
           | What do you mean, Microsoft 358 is an excellent product that
           | has uptime for exactly what they've got on the label!
        
             | [deleted]
        
           | rubatuga wrote:
           | Worst are SLAs where you have to prove the outage as a
           | customer like wtf
        
           | foobiekr wrote:
           | Most modern SLAs are worthless. The penalty is meaningless
           | and/or "up" is carefully defined in such a way that even a
           | service failing 100% of requests is "up" because it's
           | responding, or is defined such that a single customer can
           | have a total outage but the service is up because it's
           | servicing others.
           | 
           | Networking is the last bastion of SLAs that actually seem to
           | matter.
        
           | gumby wrote:
           | The Bell System standard was an extremely high Erlang number,
           | which of course they strove to provide at the lowest cost
           | which meant extremely high utilization rate on hardware which
           | in turn meant extreme uptime (compare to the then
           | contemporary PTT QoS even in major economies like France).
           | 
           | This is also why the software itself was designed with so
           | many internal defenses and what I would consider an "immune
           | system". I've never seen anything like it even on an aircraft
           | control system. That is mentioned in passing in the brief
           | article but is easily missed if you don't know what it's
           | referring to.
           | 
           | Most of what is done on the Internet at, say, "layer 5 or
           | above" isn't at all important so there's no need for this
           | level of SLA, but the actual backbone carriers do still carry
           | SLAs at around that level. With packet switching it's easier
           | for them to provide than it was in the days of the 4ESS and
           | 5ESS.
        
             | vatys wrote:
             | > The Bell System standard was an extremely high Erlang
             | number
             | 
             | I wasn't aware of erlang the unit (measuring telephone
             | circuit load) and at first thought this had something to do
             | with the language.
             | 
             | https://en.wikipedia.org/wiki/Erlang_(unit)
        
       | pacificmint wrote:
       | If some one wants to read a lot more details an out this
       | incident, there is book about it. It's been a decade or two since
       | I read it, but I remember it being well written.
       | 
       | 'The Day the Phones Stopped Ringing' by Leonard Lee
        
         | acadiel wrote:
         | > 'The Day the Phones Stopped Ringing' by Leonard Lee
         | 
         | Definitely going to look into this! The whole *ESS architecture
         | still underpins a lot of the telephony system. There are quite
         | a few still running, even though other TDM equipment is being
         | phased out.
        
       | mdmglr wrote:
       | So what was the fix? Remove lines 9-10? Or do "set up pointers to
       | optional parameters" before break?
        
       ___________________________________________________________________
       (page generated 2023-02-05 23:00 UTC)