[HN Gopher] Why do systems fail? Tandem NonStop system and fault...
___________________________________________________________________
Why do systems fail? Tandem NonStop system and fault tolerance
Author : PaulHoule
Score : 33 points
Date : 2024-10-11 17:10 UTC (5 hours ago)
(HTM) web link (www.erlang-solutions.com)
(TXT) w3m dump (www.erlang-solutions.com)
| 082349872349872 wrote:
| at Tandem, even the company coffee mugs had redundancy:
| https://i.etsystatic.com/33311136/r/il/08fbca/5271808290/il_...
| sillywalk wrote:
| I'm still hoping to find a more detailed article about modern
| X86-64 NonStop, complete with Mackie Diagrams.
|
| The last one I can find is for the NonStop Advanced Architecture
| (on Itanium), with ServetNet. I gather that this was replaced
| with the NonStop Multicore Architecture (also on Itanium), with
| Infiniband, and I assume x86-64 is basically the same but on
| x86-64, but in pseudo big-endian.
| macintux wrote:
| 10 years ago I used Jim Gray's piece about Tandem fault tolerance
| in a talk about Erlang at Midwest.io (RIP, was a great
| conference).
|
| https://youtu.be/E18shi1qIHU
|
| Because it's a small world, a former Tandem employee was
| attending the talk. Unfortunately it's been long enough that I
| don't remember much of our conversation, but it was impressive to
| hear how they moved a computer between data centers; IIRC, they
| simply turned it off, and when they powered it back on, the CPU
| resumed precisely where it had been executing before.
|
| (I have no idea how they handled the system clock.)
|
| Jim Gray's paper:
|
| https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...
| sillywalk wrote:
| > I have no idea how they handled the system clock.)
|
| It is or was on the Internet Archive and probably elsewhere -
|
| Tandem Systems Review, Volume 2, Number 1 (February 1986) -
| "Managing System Time Under Guardian 90"
| macintux wrote:
| Nice, thanks, will have to look that up.
| Animats wrote:
| Tandem was interesting. They had a lot of good ideas, many
| unusual today.
|
| * Databases reside on raw disks. There is no file system
| underneath the databases. If you want a flat file, it has to be
| in the database. Why? Because databases can be made with good
| reliability properties and made distributed and redundant.
|
| * Processes can be moved from one machine to another. Much like
| the Xen hypervisor, which was a high point in that sort of thing.
|
| * Hardware must have built in fault detection. Everything had
| ECC, parity, or duplication. It's OK to fail, but not make
| mistakes. IBM mainframes still have this, but few microprocessors
| do, even though the necessary transistors would not be a high
| cost today. (It's still hard to get ECC RAM on the desktop,
| even.)
|
| * Most things are transactions. All persistent state is in the
| database. Think REST with CGI programs, but more efficient.
| That's what makes this work. A transaction either runs to
| successful completion, or fails and has no lasting effect.
| Database transactions roll back on failures.
|
| The Tandem concept lived on through several changes of ownership
| and hardware. Unfortunately, it ended up at HP in the Itanium
| era, where it seems to have died off.
|
| It's a good architecture. The back ends of banks still look much
| like that, because that's where the money is. But not many
| programmers think that way.
| spockz wrote:
| Not to take away from your main point: The only reason it is
| hard to get ECC in a desktop is because it is used as customer
| segmentation, not because it if technically hard or because it
| would drive the actual cost of the hardware up.
| sillywalk wrote:
| > Databases reside on raw disks. There is no file system
| underneath the databases.
|
| The terminology of "filesystem" here is confusing. The original
| database system was/is called Enscribe, and was/is similar to
| VMS Record Management Services - it had different types of
| structured files types, in addition to unstructured
| unix/dos/windows stream-of-byte "flat" files. Around 1987
| Tandem added NonStop SQL files. They're accessed through a
| PATH: Volume.SubVolume.Filename, but depending on the file
| type, there is different things you can do with them.
|
| > If you want a flat file, it has to be in the database.
|
| You could create unstructured files as well.
|
| > Processes can be moved from one machine to another
|
| Critical system processes are process-pairs, where a Primary
| process does the work, but sends checkpoint messages to a
| Backup process on another processor. If the Primary process
| fails, the Backup process transparently takes over and becomes
| the Primary. Any messages to the process-pair are automatically
| re-routed.
|
| > Unfortunately, it ended up at HP in the Itanium era, where it
| seems to have died off.
|
| It did get ported to Xeon processors around 10 years ago, and
| is still around. Unlike OpenVMS, HPE still works on it, but as
| I don't think there is even a link to it on the HPE website* .
| It still runs on (standard?) HPE x86 servers connected to HPE
| servers running Linux to provide storage/networking/etc.
| Apparently it also runs supported under VMWare of some kind.
|
| * Something something Greenlake?
| kev009 wrote:
| Yes, IBM mainframes employ or have analogous concepts to all of
| this which may be one of many reasons they haven't disappeared.
| A lot of it was built up over time whereas Tandem started from
| the HA specification so the concepts and marketing are clearer.
|
| Stratus was another interesting HA vendor, particularly the
| earlier VOS systems as their modern systems are a bit more
| pedestrian. http://www.teamfoster.com/stratus-computer
| sillywalk wrote:
| I present to you "Commercial Fault Tolerance: A Tale of Two
| Systems" [2004][0] - a paper comparing the similarities and
| differences towards reliability/available/integrity between
| Tandem Nonstop and IBM Mainframe systems,
|
| and the book "Reliable Computer Systems - Design and
| Evaluation"[1] which has general info on reliability, and
| specific looks at IBM Mainframe, Tandem, and Stratus, plus
| AT&T switches and spaceflight computers.
|
| [0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Pap
| ers...
|
| [1] https://archive.org/download/reliablecomputer00siew/relia
| ble...
___________________________________________________________________
(page generated 2024-10-11 23:01 UTC)