[HN Gopher] Crash-only software: More than meets the eye (2006)
___________________________________________________________________
Crash-only software: More than meets the eye (2006)
Author : hui-zheng
Score : 37 points
Date : 2024-04-30 16:37 UTC (6 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| kayodelycaon wrote:
| Crash-only is really hard to implement if another system is
| involved that isn't crash-only. If you crash in the middle of a
| network request, you may not know what state the other system is
| in.
|
| I've had to deal with buggy mainframe software whose error
| messages had no relation to how much an operation succeeded. (And
| no way to ask it after the fact...) Welcome to the special hell.
| TheDudeMan wrote:
| Idempotent APIs + sane timeouts + retries.
| ashleyn wrote:
| Probably should consider crash recovery as a second line of
| defense against lost data, not the primary line of defense. What
| are the stats on how often crash recovery failed?
| cryptonector wrote:
| Crashing more often would test that better.
| eslaught wrote:
| The entire point is that crash recovery fails because you
| rarely test it. By making it the one and only code path, by
| definition, you will be testing it all the time, so it is much
| more likely to work in the first place.
|
| (The obvious counterargument being that if there are different
| _ways_ in which the software can crash, this is still not an
| adequate defense.)
| mjb wrote:
| > Crash-only software is actually more reliable because it takes
| into account from the beginning an unavoidable fact of computing
| - unexpected crashes.
|
| This is a critical point for reliable single-machine systems, and
| for reliable distributed systems. Distributed systems avoid many
| classes of crashes through redundancy, allowing the overall
| system to recover (often with no impact) from the failure or
| crash of a single node. This provides an additional path to crash
| recovery: recovering from peers or replicas rather than from
| local state. In turn, this can simplify the tracking of local
| state (especially the kind of per-replica WAL or redo log
| accounting that database systems have to do), leading to improved
| performance and avoiding bugs.
|
| But, as with single-system crashes, distributed systems need to
| deal with their own reality: correlated failures. These can be
| caused by correlated infrastructure failures (power, cooling,
| etc), by operations (e.g. deploying buggy software), or by the
| very data they're processing (e.g. a "poison pill" that crashes
| all the redundant nodes at once). And so, like the crash-only
| case with single-system software, reliable distributed systems
| need to be designed to recover from these correlated failure
| cases.
|
| The constants are interestingly different, though. Single-system
| annual interrupt rates (AIR) are typically in the 1-10% range,
| while systems spread over multiple datacenters can feasibly see
| correlated failure rates several orders of magnitude lower. This
| could argue that having a "bad day" recovery path that's more
| expensive than regular node recovery is OK. Or, it could argue
| that the only feasible way of making sure that "bad day" recovery
| works is to exercise it often (which goes back to the crash-only
| argument).
| Liftyee wrote:
| In embedded systems watchdog timers are often used as a crash
| mechanism outside of the software itself, which will crash the
| program if it is not reset. I found this concept of crash-only
| software pretty neat - time to see if I can apply it.
___________________________________________________________________
(page generated 2024-04-30 23:00 UTC)