[HN Gopher] Why do systems fail? Tandem NonStop system and fault...
       ___________________________________________________________________
        
       Why do systems fail? Tandem NonStop system and fault tolerance
        
       Author : PaulHoule
       Score  : 108 points
       Date   : 2024-10-11 17:10 UTC (1 days ago)
        
 (HTM) web link (www.erlang-solutions.com)
 (TXT) w3m dump (www.erlang-solutions.com)
        
       | 082349872349872 wrote:
       | at Tandem, even the company coffee mugs had redundancy:
       | https://i.etsystatic.com/33311136/r/il/08fbca/5271808290/il_...
        
       | sillywalk wrote:
       | I'm still hoping to find a more detailed article about modern
       | X86-64 NonStop, complete with Mackie Diagrams.
       | 
       | The last one I can find is for the NonStop Advanced Architecture
       | (on Itanium), with ServetNet. I gather that this was replaced
       | with the NonStop Multicore Architecture (also on Itanium), with
       | Infiniband, and I assume x86-64 is basically the same but on
       | x86-64, but in pseudo big-endian.
        
         | hi-v-rocknroll wrote:
         | A hypervisor (software) approach is one way to accomplish it
         | far cheaper and much more configurable and reusable than having
         | to rely on dedicated hardware. VMware's virtualization method
         | of x86_64 fault tolerant feature runs 2 VMs on different hosts
         | using the lockstep method. Either fails, then the hypervisor
         | moves the (V)IP over with ARP to the running one and spawns
         | another to replace it. More often than not, it's a way to run a
         | critical machine that cannot accept any downtime and cannot
         | otherwise be (re)engineered in a conventional HA manner with
         | other building-blocks. In general, one should never do this and
         | prefer to use always consistent quorum 2 phase commit
         | transactions at the cost of availability or throughput, or
         | eventual consistency through gossip updates at the cost of
         | inconsistency and potential data loss.
        
         | adastra22 wrote:
         | What do you want to know?
        
           | sillywalk wrote:
           | What has changed since Itanium? What counts as a logical
           | NonStop CPU now? As I (mis?)understand it, under Itanium a
           | physical server blade was called a slice. It had multiple CPU
           | sockets (called Processing Elements) and memory on the was
           | partitioned with MMU mapping and Itanium security keys so
           | each Processing Element could only access a portion of it.
           | All IO on a Processing Element went out over ServerNet (or
           | Infiniband) to a pair of Logical Sync Units, and was
           | checked/compared with IO from another Processing Element
           | running the same code on a different physical server blade.
           | The 2 (or 3) processing elements combined to form a single
           | logical CPU. I wonder if this is still the case? I believe
           | there was a follow on (I assume when Itanium went multi-core)
           | called NonStop Multicore Architecture, but I haven't found a
           | paper on it.
           | 
           | Also, I'm curious how the Disk Process fits in with Storage
           | Clustered IO Modules(CLIMs)? Do CLIMs just act as a raw disk,
           | with the Disk Process talking to it like they would use to
           | talk to a locally attached disk? Or is there more integration
           | with the CLIM - like a portion of the Disk Process has been
           | ported to Linux, or has Enscribe been ported to run on the
           | CLIMs.
           | 
           | The same thing with how Networking CLIMs fit in.
        
       | macintux wrote:
       | 10 years ago I used Jim Gray's piece about Tandem fault tolerance
       | in a talk about Erlang at Midwest.io (RIP, was a great
       | conference).
       | 
       | https://youtu.be/E18shi1qIHU
       | 
       | Because it's a small world, a former Tandem employee was
       | attending the talk. Unfortunately it's been long enough that I
       | don't remember much of our conversation, but it was impressive to
       | hear how they moved a computer between data centers; IIRC, they
       | simply turned it off, and when they powered it back on, the CPU
       | resumed precisely where it had been executing before.
       | 
       | (I have no idea how they handled the system clock.)
       | 
       | Jim Gray's paper:
       | 
       | https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...
        
         | sillywalk wrote:
         | > I have no idea how they handled the system clock.)
         | 
         | It is or was on the Internet Archive and probably elsewhere -
         | 
         | Tandem Systems Review, Volume 2, Number 1 (February 1986) -
         | "Managing System Time Under Guardian 90"
        
           | macintux wrote:
           | Nice, thanks, will have to look that up.
        
           | throw0101c wrote:
           | > _Tandem Systems Review, Volume 2, Number 1 (February 1986)
           | - "Managing System Time Under Guardian 90"_
           | 
           | * https://vtda.org/pubs/Tadem_Systems_Review/
           | 
           | * https://www.mrynet.com/FTP/os/DEC/www.hpl.hp.com/hpjournal/
           | t...
        
         | abrookewood wrote:
         | That is crazy! I assume that all the RAM was battery backed?
         | What about the CPU cache, the OS state etc? I'm struggling to
         | see how this was possible.
        
       | Animats wrote:
       | Tandem was interesting. They had a lot of good ideas, many
       | unusual today.
       | 
       | * Databases reside on raw disks. There is no file system
       | underneath the databases. If you want a flat file, it has to be
       | in the database. Why? Because databases can be made with good
       | reliability properties and made distributed and redundant.
       | 
       | * Processes can be moved from one machine to another. Much like
       | the Xen hypervisor, which was a high point in that sort of thing.
       | 
       | * Hardware must have built in fault detection. Everything had
       | ECC, parity, or duplication. It's OK to fail, but not make
       | mistakes. IBM mainframes still have this, but few microprocessors
       | do, even though the necessary transistors would not be a high
       | cost today. (It's still hard to get ECC RAM on the desktop,
       | even.)
       | 
       | * Most things are transactions. All persistent state is in the
       | database. Think REST with CGI programs, but more efficient.
       | That's what makes this work. A transaction either runs to
       | successful completion, or fails and has no lasting effect.
       | Database transactions roll back on failures.
       | 
       | The Tandem concept lived on through several changes of ownership
       | and hardware. Unfortunately, it ended up at HP in the Itanium
       | era, where it seems to have died off.
       | 
       | It's a good architecture. The back ends of banks still look much
       | like that, because that's where the money is. But not many
       | programmers think that way.
        
         | spockz wrote:
         | Not to take away from your main point: The only reason it is
         | hard to get ECC in a desktop is because it is used as customer
         | segmentation, not because it if technically hard or because it
         | would drive the actual cost of the hardware up.
        
           | sitkack wrote:
           | ECC should be mandatory in consumer and cpus and memory. This
           | will be seen like cars with fins and not having seatbelts in
           | the future.
        
             | Animats wrote:
             | I have a desktop where CPU, OS and motherboard all support
             | it. But ECC memory wa hard to find. Memory with useless
             | LEDs, though, is easily available.
        
               | spockz wrote:
               | That is because it doesn't make sense producing a product
               | that cannot be used at all. It just doesn't work in
               | consumer boards due to lack of support for it in consumer
               | CPUs. Again due to artificial customer segmentation.
        
               | c0balt wrote:
               | Most ryzen CPUs have supported some ECC RAM for multiple
               | years by now. The HED platforms, like Thread Ripper, did
               | too. It just hasn't really been advertised as much
               | because most consumers don't appear to be willing to pay
               | the higher cost.
        
             | PhilipRoman wrote:
             | Ok, I'll bite - what tangible benefit would ECC give to the
             | average consumer? I'd wager in the real world 1000x more
             | data loss/corruption happens due to HDD/SSD failure with no
             | backups.
             | 
             | Personally I genuinely don't care about ECC ram and I would
             | not pay more than $10 additional price to get it.
        
               | adastra22 wrote:
               | Most users experience data loss due to ECC these days.
               | They just might not attribute it to cosmic rays. It's
               | kinda hard to tell ECC data loss apart from intermittent
               | hardware failure. It can be just as catastrophic though,
               | if the bit flip hits a critical bit of information and
               | ends up corrupting the disk entirely.
        
               | immibis wrote:
               | My Threadripper 7000 system with ECC DDR5 and MCE logging
               | reports a corrected bit error every few hours, but I've
               | got no idea if that's normal. I assume it was a tradeoff
               | for memory density.
        
               | MichaelZuo wrote:
               | This, memory densities are so high nowadays it's almost
               | guaranteed that a new computer bought in 2024 will hard
               | fault with actual consequences (crashing, corrupted data,
               | etc...) at least once a year due to lack of ECC.
        
         | sillywalk wrote:
         | > Databases reside on raw disks. There is no file system
         | underneath the databases.
         | 
         | The terminology of "filesystem" here is confusing. The original
         | database system was/is called Enscribe, and was/is similar to
         | VMS Record Management Services - it had different types of
         | structured files types, in addition to unstructured
         | unix/dos/windows stream-of-byte "flat" files. Around 1987
         | Tandem added NonStop SQL files. They're accessed through a
         | PATH: Volume.SubVolume.Filename, but depending on the file
         | type, there is different things you can do with them.
         | 
         | > If you want a flat file, it has to be in the database.
         | 
         | You could create unstructured files as well.
         | 
         | > Processes can be moved from one machine to another
         | 
         | Critical system processes are process-pairs, where a Primary
         | process does the work, but sends checkpoint messages to a
         | Backup process on another processor. If the Primary process
         | fails, the Backup process transparently takes over and becomes
         | the Primary. Any messages to the process-pair are automatically
         | re-routed.
         | 
         | > Unfortunately, it ended up at HP in the Itanium era, where it
         | seems to have died off.
         | 
         | It did get ported to Xeon processors around 10 years ago, and
         | is still around. Unlike OpenVMS, HPE still works on it, but as
         | I don't think there is even a link to it on the HPE website* .
         | It still runs on (standard?) HPE x86 servers connected to HPE
         | servers running Linux to provide storage/networking/etc.
         | Apparently it also runs supported under VMWare of some kind.
         | 
         | * Something something Greenlake?
        
           | Animats wrote:
           | > Critical system processes are process-pairs, where a
           | Primary process does the work, but sends checkpoint messages
           | to a Backup process on another processor. If the Primary
           | process fails, the Backup process transparently takes over
           | and becomes the Primary. Any messages to the process-pair are
           | automatically re-routed.
           | 
           | Right. Process migration was possible, but you're right in
           | that it didn't work like Xen.
           | 
           | > It still runs on (standard?) HPE x86 servers connected to
           | HPE servers running Linux to provide storage/networking/etc.
           | 
           | HP is apparently still selling some HPE gear. But it looks
           | like all that stuff transitions to "mature support" at the
           | end of 2025.[1] "Standard support for Integrity servers will
           | end December 31, 2025. Beyond Standard support, HPE Services
           | may provide HPE Mature Hardware Onsite Support, Service
           | dependent on HW spares availability." The end is near.
           | 
           | [1] https://www.hpe.com/psnow/doc/4aa3-9071enw?jumpid=in_hpes
           | ite...
        
             | sillywalk wrote:
             | It looks like that Mature Support stuff is all for
             | Integrity i.e. Itanium servers. As long as HPE still makes
             | x86 servers for Linux/Windows, I assume NonStop can tag
             | along.
        
               | Animats wrote:
               | Right, that's just the Itanium machines. I'm not current
               | on HP buzzwords.
               | 
               | The HP NonStop systems, Xeon versions, are here.[1] The
               | not-very-informative white paper is here.[2] Not much
               | about how they do it. Especially since they talk about
               | running "modern" software, like Java and Apache.
               | 
               | [1] https://www.hpe.com/us/en/compute/nonstop-
               | servers.html
               | 
               | [2] https://www.hpe.com/psnow/doc/4aa6-5326enw?jumpid=in_
               | pdfview...
        
               | lazide wrote:
               | As a side point - that is some _amazing_ lock in.
        
               | MichaelZuo wrote:
               | They were pretty much the only game in town, other than
               | IBM and smaller mainframe vendors, if you wanted actual
               | written, binding, guarantees of performance with penalty
               | clauses. (e.g. with real consequences for system failure,
               | such as being credited back X millions of dollars after Y
               | failure)
               | 
               | At least from what I heard pre-HP acquisition, so it's
               | not 'amazing lock in', just that, if you didn't want a
               | mainframe and needed such guarantees, there was literally
               | no other choice.
        
               | lazide wrote:
               | Notably, that _is_ amazing lock in. What else would it
               | look like?
        
               | MichaelZuo wrote:
               | Well if just price/performance alone is enough to
               | qualify... viz. IBM, Then the moment another mainframe
               | vendor decided to undercut them by say 20%, the lock in
               | would evaporate. Of course no mainframe vendor would
               | likely do so, but the latent possibility is always there.
               | 
               | Facebook is an example of 'amazing lock in' where it's
               | not theoretically possible for any potential competitor
               | to just negate it with the stroke of a pen.
        
         | kev009 wrote:
         | Yes, IBM mainframes employ or have analogous concepts to all of
         | this which may be one of many reasons they haven't disappeared.
         | A lot of it was built up over time whereas Tandem started from
         | the HA specification so the concepts and marketing are clearer.
         | 
         | Stratus was another interesting HA vendor, particularly the
         | earlier VOS systems as their modern systems are a bit more
         | pedestrian. http://www.teamfoster.com/stratus-computer
        
           | sillywalk wrote:
           | I present to you "Commercial Fault Tolerance: A Tale of Two
           | Systems" [2004][0] - a paper comparing the similarities and
           | differences towards reliability/available/integrity between
           | Tandem Nonstop and IBM Mainframe systems,
           | 
           | and the book "Reliable Computer Systems - Design and
           | Evaluation"[1] which has general info on reliability, and
           | specific looks at IBM Mainframe, Tandem, and Stratus, plus
           | AT&T switches and spaceflight computers.
           | 
           | [0] https://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Pap
           | ers...
           | 
           | [1] https://archive.org/download/reliablecomputer00siew/relia
           | ble...
        
           | mech422 wrote:
           | Yeah - Stratus rocked :-) The 'big battle' used to be between
           | Non-Stops more 'software based' fault tolerance VS. Stratus's
           | fully hardware level high availability. I used to love
           | demo'ing our Stratus systems to clients and let them pull
           | boards while the machine was running...Just don't pull 2 next
           | to each other :-)
           | 
           | Also, I think Stratus was the first (only?) computer IBM re-
           | badged at the time - IBM sold Stratus's as the Model 88, IIRC
        
         | adastra22 wrote:
         | > Unfortunately, it ended up at HP in the Itanium era, where it
         | seems to have died off.
         | 
         | My dad continues to maintain NonStop systems under the umbrella
         | of DXC. (Which is a spinoff of HP? Or something? Idk the
         | details.) He worked at Tandem back in the day, and has stayed
         | with it ever since. I think he'd love to retire, but he never
         | ends up as part of the layoffs that get sweet severance
         | packages, because he's literally irreplaceable.
         | 
         | The whole stack got moved to run on top of Linux, IIRC, with
         | all these features being emulated. It still exists though, for
         | the handful of customers that use it.
        
           | Sylamore wrote:
           | Kinda the other way around, the NonStop kernel can present a
           | Guardian personality or an OSS (Open Systems Services) linux-
           | like compatible personality. The OSS layer is basically
           | running on top of the NSK/Guardian native layer but allows
           | you to compile most linux software.
        
             | adastra22 wrote:
             | No, I meant the other way around. I don't know to what
             | degree it ever got released, but he spent years getting it
             | to work on "commodity" mainframe hardware running Linux, as
             | HP wanted to get out of the business of maintaining special
             | equipment and OS just for this customer.
        
         | Sylamore wrote:
         | Speaking of Tandem Databases, HP had released the SQL engine
         | behind SQL/MX[0] as open source (Trafodion) running in front of
         | Hadoop to the Apache Software Foundation but it appears they
         | have shutdown the project[1].
         | 
         | [0]: https://thenewstack.io/sql-hadoop-database-trafodion-
         | bridges...
         | 
         | [1]: https://attic.apache.org/projects/trafodion.html
        
         | mannyv wrote:
         | Oracle has had raw disk support for a long time. I'm pretty
         | sure it's the last 'mainstream' database that does.
        
       | vivzkestrel wrote:
       | completely unrelated to the topic written but i wanted to point
       | it out. there is some accessiblity issue with this page. The
       | arrow keys up and down do not scroll the page on Firefox 131.0.2
       | M1 Mac
        
       | hi-v-rocknroll wrote:
       | Stanford's Forsythe DC had a Tandem mainframe just inside the
       | main floor area. It was a short beast standing on its own about
       | 1.5m / 4' tall, and not in a 19" rack.
        
       | redbluff wrote:
       | As someone who has worked on nonstops for 35 years (and still
       | counting!) it's nice to see them get a mention on here. I even
       | have two at home, one a K2000 (MIPS) machine from the 90's and an
       | Itanium server from a the mid 10's. I am pretty sure the suburbs
       | lights dim when I fire them up :).
       | 
       | It's an interesting machine architecture to work on, especially
       | the "Guardian 90" personality, and quite amazing that you can run
       | late 70's based programs without a recompilation written for a
       | CPU using TTL logic on a MIPS, Itanium or X86 CPU; not all of
       | them mind you, and not if they were natively compiled. The note
       | on Stratus was quite interesting for a long time the only real
       | direct competitor Nonstop had in a real sense was Stratus. The
       | other thing that makes these systems interesting is they have a
       | unix like personality called "OSS" that allows you to run quite a
       | bit of POSIX style unix programs.
       | 
       | My favourite nonstop story was in the big LA earthquake (89?) a
       | friend of mine was working at a POS processor. When they returned
       | to the building the Tandem machine was lying on its side,
       | unplugged and still operating (these machines had their own
       | battery backup). The righted it, plugged everything back in and
       | the machine continued operating as though nothing happened. The
       | fact that pretty much all the network comms were down kind of
       | made this a moot point, but it was fascinating none the less.
       | Pulling a CPU board, network board or disc controller or disc -
       | all doable with no impact to transaction flow. The discs
       | themselves were both mirrored and shadowed, which back in the day
       | made these systems very expensive.
        
       | lostemptations5 wrote:
       | So if Tandem is so out of favour these days, what do people and
       | organizations use? AWS availability zones, etc?
        
       ___________________________________________________________________
       (page generated 2024-10-12 23:01 UTC)