[HN Gopher] Crash-only software: More than meets the eye (2006)
       ___________________________________________________________________
        
       Crash-only software: More than meets the eye (2006)
        
       Author : hui-zheng
       Score  : 37 points
       Date   : 2024-04-30 16:37 UTC (6 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | kayodelycaon wrote:
       | Crash-only is really hard to implement if another system is
       | involved that isn't crash-only. If you crash in the middle of a
       | network request, you may not know what state the other system is
       | in.
       | 
       | I've had to deal with buggy mainframe software whose error
       | messages had no relation to how much an operation succeeded. (And
       | no way to ask it after the fact...) Welcome to the special hell.
        
         | TheDudeMan wrote:
         | Idempotent APIs + sane timeouts + retries.
        
       | ashleyn wrote:
       | Probably should consider crash recovery as a second line of
       | defense against lost data, not the primary line of defense. What
       | are the stats on how often crash recovery failed?
        
         | cryptonector wrote:
         | Crashing more often would test that better.
        
         | eslaught wrote:
         | The entire point is that crash recovery fails because you
         | rarely test it. By making it the one and only code path, by
         | definition, you will be testing it all the time, so it is much
         | more likely to work in the first place.
         | 
         | (The obvious counterargument being that if there are different
         | _ways_ in which the software can crash, this is still not an
         | adequate defense.)
        
       | mjb wrote:
       | > Crash-only software is actually more reliable because it takes
       | into account from the beginning an unavoidable fact of computing
       | - unexpected crashes.
       | 
       | This is a critical point for reliable single-machine systems, and
       | for reliable distributed systems. Distributed systems avoid many
       | classes of crashes through redundancy, allowing the overall
       | system to recover (often with no impact) from the failure or
       | crash of a single node. This provides an additional path to crash
       | recovery: recovering from peers or replicas rather than from
       | local state. In turn, this can simplify the tracking of local
       | state (especially the kind of per-replica WAL or redo log
       | accounting that database systems have to do), leading to improved
       | performance and avoiding bugs.
       | 
       | But, as with single-system crashes, distributed systems need to
       | deal with their own reality: correlated failures. These can be
       | caused by correlated infrastructure failures (power, cooling,
       | etc), by operations (e.g. deploying buggy software), or by the
       | very data they're processing (e.g. a "poison pill" that crashes
       | all the redundant nodes at once). And so, like the crash-only
       | case with single-system software, reliable distributed systems
       | need to be designed to recover from these correlated failure
       | cases.
       | 
       | The constants are interestingly different, though. Single-system
       | annual interrupt rates (AIR) are typically in the 1-10% range,
       | while systems spread over multiple datacenters can feasibly see
       | correlated failure rates several orders of magnitude lower. This
       | could argue that having a "bad day" recovery path that's more
       | expensive than regular node recovery is OK. Or, it could argue
       | that the only feasible way of making sure that "bad day" recovery
       | works is to exercise it often (which goes back to the crash-only
       | argument).
        
       | Liftyee wrote:
       | In embedded systems watchdog timers are often used as a crash
       | mechanism outside of the software itself, which will crash the
       | program if it is not reset. I found this concept of crash-only
       | software pretty neat - time to see if I can apply it.
        
       ___________________________________________________________________
       (page generated 2024-04-30 23:00 UTC)