[HN Gopher] On rebooting: the unreasonable effectiveness of turn...
       ___________________________________________________________________
        
       On rebooting: the unreasonable effectiveness of turning computers
       off and on ag
        
       Author : todsacerdoti
       Score  : 26 points
       Date   : 2022-05-24 21:08 UTC (1 hours ago)
        
 (HTM) web link (keunwoo.com)
 (TXT) w3m dump (keunwoo.com)
        
       | tingletech wrote:
       | The first law of tech support. Second law is check (unplug and
       | re-plug) all the cables.
        
       | sbf501 wrote:
       | Opposite of the robustness principle:
       | 
       | https://en.wikipedia.org/wiki/Robustness_principle
       | 
       | It's funny, I spent some time developing tools for CPU
       | architects. Both the concept in the OP's anecdote, and the above
       | principle don't really apply because logic doesn't break in the
       | same way source code breaks. You don't run a program in HDL, you
       | synthesize a logic flow. One could concievably test all possible
       | combinations of that logic for errors, but it becomes 2^N
       | combinations where N is the number of inputs and the number of
       | state elements. Since this cannot be tested because the space is
       | huge (excluding hierarchical designs and emulation), you generate
       | targeted test patterns (and many many mutexes) to pare down the
       | space, and perhaps randomize some of the targeted tests to verify
       | you don't execute out of bounds. And even "out of bounds" is
       | defined by however smart the microarchitect was when they wrote
       | the spec, and that can be wrong too.
       | 
       | The only way to find and fix these bugs is to run trillions of
       | test vectors (called "coverage") and hope you've passed all the
       | known bugs, and not found any new ones.
       | 
       | There are four decades of papers written on hardware validation,
       | so I'm barely scratching the surface, but I think it's a very
       | different perspective compared to how programmers approach the
       | world. I think a lot of the bugs that OP is talking about fall
       | into the hardware logic domain. There isn't really a fallback
       | "throw" or "return status" that you can even check for. Just
       | fault handlers (for the most part).
        
       | layer8 wrote:
       | The winning strategy of Mike's parable is also known as
       | _offensive programming_ :
       | https://en.m.wikipedia.org/wiki/Offensive_programming
        
       | mjevans wrote:
       | The missing link is RE-initilization or re-validation of the
       | expected state.
       | 
       | Offhand I remember some discussion of how old dialup friendly
       | multiplayer games would transfer state. Differential state would
       | be transferred. There might or might not be a checksum. There
       | would be global state refreshes (either periodically or as
       | bandwidth allowed).
       | 
       | The global state refreshes are a different form of re-
       | initilization. The current state discarded in favor of a
       | canonically approved state.
        
         | foobiekr wrote:
         | We call this station keeping or anti entropy in our designs.
        
       | RajT88 wrote:
       | > Is this the best that anyone can do?
       | 
       | No. It's the best that can sometimes be done _quickly_.
       | 
       | Additionally, this doesn't mention the value of a good
       | postmortem. Or the horrors of cloud computing, where restarting
       | things is deemed a good enough mitigation because in the cloud
       | _these things happen_ , and nobody pushes for a good postmortem
       | and repair items.
        
         | mannykannot wrote:
         | The whole article here is an argument for the proposition that,
         | in many cases, this is _not_ just the best compromise taking
         | expediency into account, but the best thing to do, period
         | (there are some caveats in the section titled
         | "Complications.") You are entitled to a contrary view, but you
         | have not said anything to counter the arguments presented here.
         | 
         | The last two sections do include the value of postmortems (or
         | forensic analysis, as the author puts it) though from the
         | perspective that this is more feasible and effective on a
         | system that promptly crashes when things go wrong.
        
       ___________________________________________________________________
       (page generated 2022-05-24 23:00 UTC)