[HN Gopher] On rebooting: the unreasonable effectiveness of turn...
___________________________________________________________________
On rebooting: the unreasonable effectiveness of turning computers
off and on ag
Author : todsacerdoti
Score : 26 points
Date : 2022-05-24 21:08 UTC (1 hours ago)
(HTM) web link (keunwoo.com)
(TXT) w3m dump (keunwoo.com)
| tingletech wrote:
| The first law of tech support. Second law is check (unplug and
| re-plug) all the cables.
| sbf501 wrote:
| Opposite of the robustness principle:
|
| https://en.wikipedia.org/wiki/Robustness_principle
|
| It's funny, I spent some time developing tools for CPU
| architects. Both the concept in the OP's anecdote, and the above
| principle don't really apply because logic doesn't break in the
| same way source code breaks. You don't run a program in HDL, you
| synthesize a logic flow. One could concievably test all possible
| combinations of that logic for errors, but it becomes 2^N
| combinations where N is the number of inputs and the number of
| state elements. Since this cannot be tested because the space is
| huge (excluding hierarchical designs and emulation), you generate
| targeted test patterns (and many many mutexes) to pare down the
| space, and perhaps randomize some of the targeted tests to verify
| you don't execute out of bounds. And even "out of bounds" is
| defined by however smart the microarchitect was when they wrote
| the spec, and that can be wrong too.
|
| The only way to find and fix these bugs is to run trillions of
| test vectors (called "coverage") and hope you've passed all the
| known bugs, and not found any new ones.
|
| There are four decades of papers written on hardware validation,
| so I'm barely scratching the surface, but I think it's a very
| different perspective compared to how programmers approach the
| world. I think a lot of the bugs that OP is talking about fall
| into the hardware logic domain. There isn't really a fallback
| "throw" or "return status" that you can even check for. Just
| fault handlers (for the most part).
| layer8 wrote:
| The winning strategy of Mike's parable is also known as
| _offensive programming_ :
| https://en.m.wikipedia.org/wiki/Offensive_programming
| mjevans wrote:
| The missing link is RE-initilization or re-validation of the
| expected state.
|
| Offhand I remember some discussion of how old dialup friendly
| multiplayer games would transfer state. Differential state would
| be transferred. There might or might not be a checksum. There
| would be global state refreshes (either periodically or as
| bandwidth allowed).
|
| The global state refreshes are a different form of re-
| initilization. The current state discarded in favor of a
| canonically approved state.
| foobiekr wrote:
| We call this station keeping or anti entropy in our designs.
| RajT88 wrote:
| > Is this the best that anyone can do?
|
| No. It's the best that can sometimes be done _quickly_.
|
| Additionally, this doesn't mention the value of a good
| postmortem. Or the horrors of cloud computing, where restarting
| things is deemed a good enough mitigation because in the cloud
| _these things happen_ , and nobody pushes for a good postmortem
| and repair items.
| mannykannot wrote:
| The whole article here is an argument for the proposition that,
| in many cases, this is _not_ just the best compromise taking
| expediency into account, but the best thing to do, period
| (there are some caveats in the section titled
| "Complications.") You are entitled to a contrary view, but you
| have not said anything to counter the arguments presented here.
|
| The last two sections do include the value of postmortems (or
| forensic analysis, as the author puts it) though from the
| perspective that this is more feasible and effective on a
| system that promptly crashes when things go wrong.
___________________________________________________________________
(page generated 2022-05-24 23:00 UTC)