[HN Gopher] Why do systems fail? Tandem NonStop system and fault...
       ___________________________________________________________________
        
       Why do systems fail? Tandem NonStop system and fault tolerance
        
       Author : PaulHoule
       Score  : 33 points
       Date   : 2024-10-11 17:10 UTC (5 hours ago)
        
 (HTM) web link (www.erlang-solutions.com)
 (TXT) w3m dump (www.erlang-solutions.com)
        
       | 082349872349872 wrote:
       | at Tandem, even the company coffee mugs had redundancy:
       | https://i.etsystatic.com/33311136/r/il/08fbca/5271808290/il_...
        
       | sillywalk wrote:
       | I'm still hoping to find a more detailed article about modern
       | X86-64 NonStop, complete with Mackie Diagrams.
       | 
       | The last one I can find is for the NonStop Advanced Architecture
       | (on Itanium), with ServetNet. I gather that this was replaced
       | with the NonStop Multicore Architecture (also on Itanium), with
       | Infiniband, and I assume x86-64 is basically the same but on
       | x86-64, but in pseudo big-endian.
        
       | macintux wrote:
       | 10 years ago I used Jim Gray's piece about Tandem fault tolerance
       | in a talk about Erlang at Midwest.io (RIP, was a great
       | conference).
       | 
       | https://youtu.be/E18shi1qIHU
       | 
       | Because it's a small world, a former Tandem employee was
       | attending the talk. Unfortunately it's been long enough that I
       | don't remember much of our conversation, but it was impressive to
       | hear how they moved a computer between data centers; IIRC, they
       | simply turned it off, and when they powered it back on, the CPU
       | resumed precisely where it had been executing before.
       | 
       | (I have no idea how they handled the system clock.)
       | 
       | Jim Gray's paper:
       | 
       | https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...
        
         | sillywalk wrote:
         | > I have no idea how they handled the system clock.)
         | 
         | It is or was on the Internet Archive and probably elsewhere -
         | 
         | Tandem Systems Review, Volume 2, Number 1 (February 1986) -
         | "Managing System Time Under Guardian 90"
        
           | macintux wrote:
           | Nice, thanks, will have to look that up.
        
       | Animats wrote:
       | Tandem was interesting. They had a lot of good ideas, many
       | unusual today.
       | 
       | * Databases reside on raw disks. There is no file system
       | underneath the databases. If you want a flat file, it has to be
       | in the database. Why? Because databases can be made with good
       | reliability properties and made distributed and redundant.
       | 
       | * Processes can be moved from one machine to another. Much like
       | the Xen hypervisor, which was a high point in that sort of thing.
       | 
       | * Hardware must have built in fault detection. Everything had
       | ECC, parity, or duplication. It's OK to fail, but not make
       | mistakes. IBM mainframes still have this, but few microprocessors
       | do, even though the necessary transistors would not be a high
       | cost today. (It's still hard to get ECC RAM on the desktop,
       | even.)
       | 
       | * Most things are transactions. All persistent state is in the
       | database. Think REST with CGI programs, but more efficient.
       | That's what makes this work. A transaction either runs to
       | successful completion, or fails and has no lasting effect.
       | Database transactions roll back on failures.
       | 
       | The Tandem concept lived on through several changes of ownership
       | and hardware. Unfortunately, it ended up at HP in the Itanium
       | era, where it seems to have died off.
       | 
       | It's a good architecture. The back ends of banks still look much
       | like that, because that's where the money is. But not many
       | programmers think that way.
        
         | spockz wrote:
         | Not to take away from your main point: The only reason it is
         | hard to get ECC in a desktop is because it is used as customer
         | segmentation, not because it if technically hard or because it
         | would drive the actual cost of the hardware up.
        
         | sillywalk wrote:
         | > Databases reside on raw disks. There is no file system
         | underneath the databases.
         | 
         | The terminology of "filesystem" here is confusing. The original
         | database system was/is called Enscribe, and was/is similar to
         | VMS Record Management Services - it had different types of
         | structured files types, in addition to unstructured
         | unix/dos/windows stream-of-byte "flat" files. Around 1987
         | Tandem added NonStop SQL files. They're accessed through a
         | PATH: Volume.SubVolume.Filename, but depending on the file
         | type, there is different things you can do with them.
         | 
         | > If you want a flat file, it has to be in the database.
         | 
         | You could create unstructured files as well.
         | 
         | > Processes can be moved from one machine to another
         | 
         | Critical system processes are process-pairs, where a Primary
         | process does the work, but sends checkpoint messages to a
         | Backup process on another processor. If the Primary process
         | fails, the Backup process transparently takes over and becomes
         | the Primary. Any messages to the process-pair are automatically
         | re-routed.
         | 
         | > Unfortunately, it ended up at HP in the Itanium era, where it
         | seems to have died off.
         | 
         | It did get ported to Xeon processors around 10 years ago, and
         | is still around. Unlike OpenVMS, HPE still works on it, but as
         | I don't think there is even a link to it on the HPE website* .
         | It still runs on (standard?) HPE x86 servers connected to HPE
         | servers running Linux to provide storage/networking/etc.
         | Apparently it also runs supported under VMWare of some kind.
         | 
         | * Something something Greenlake?
        
         | kev009 wrote:
         | Yes, IBM mainframes employ or have analogous concepts to all of
         | this which may be one of many reasons they haven't disappeared.
         | A lot of it was built up over time whereas Tandem started from
         | the HA specification so the concepts and marketing are clearer.
         | 
         | Stratus was another interesting HA vendor, particularly the
         | earlier VOS systems as their modern systems are a bit more
         | pedestrian. http://www.teamfoster.com/stratus-computer
        
           | sillywalk wrote:
           | I present to you "Commercial Fault Tolerance: A Tale of Two
           | Systems" [2004][0] - a paper comparing the similarities and
           | differences towards reliability/available/integrity between
           | Tandem Nonstop and IBM Mainframe systems,
           | 
           | and the book "Reliable Computer Systems - Design and
           | Evaluation"[1] which has general info on reliability, and
           | specific looks at IBM Mainframe, Tandem, and Stratus, plus
           | AT&T switches and spaceflight computers.
           | 
           | [0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Pap
           | ers...
           | 
           | [1] https://archive.org/download/reliablecomputer00siew/relia
           | ble...
        
       ___________________________________________________________________
       (page generated 2024-10-11 23:01 UTC)