[HN Gopher] Silent Data Corruptions at Scale
       ___________________________________________________________________
        
       Silent Data Corruptions at Scale
        
       Author : zdw
       Score  : 53 points
       Date   : 2021-06-12 14:49 UTC (8 hours ago)
        
 (HTM) web link (muratbuffalo.blogspot.com)
 (TXT) w3m dump (muratbuffalo.blogspot.com)
        
       | paulsutter wrote:
       | Perhaps run a complex "idle process" that exercises the major
       | functional areas of the CPU and can detect such failures so that
       | such cores/cpus can be isolated / decommissioned. Really should
       | be part of Linux
       | 
       | Anyone know of a suitable program for such a process?
        
         | makomk wrote:
         | The general reckoning seems to be that it wouldn't be possible
         | to write such a program without internal knowledge available
         | only to the CPU manufacturer, and maybe not even then - modern
         | CPUs are too complicated and have too much stuff going on
         | that's not under direct control of the software running on
         | them,
        
         | imperialdrive wrote:
         | This article is a little scary to me, and your idea/solution
         | sounds pretty clever!
        
         | ot wrote:
         | I would assume that most large fleets have background periodic
         | tasks that perform basic self-checks (the fleets I know about
         | certainly do).
         | 
         | Using "idle" cycles is not a great idea though:
         | 
         | - They may seem "free", but in fact you would end up using more
         | power: CPU turns itself off during idle time, and you'd replace
         | that with an intensive process. Power (and, as a consequence,
         | cooling) is one of the main costs of a data center.
         | 
         | - Machines that are properly utilized (close to 100% CPU
         | utilization) would get less coverage, and those are the ones
         | that need it the most.
         | 
         | So it is better to allocate a certain percentage of your CPU
         | budget to self checks, based on risk and sensitivity of the
         | tests. And have some easy way to put a machine under stress
         | testing if it is suspected of having rare memory or CPU errors.
        
         | kevingadd wrote:
         | There are a few game engines that do this while running to
         | detect bad hardware. The resulting 'bad hardware flag' is
         | tracked and forwarded in crash reports to help sort them out
         | from the 'real' crash reports (caused by bugs), and the
         | information is also shown to the user when they hit an issue
         | ('you seem to have bad RAM', etc). It'd be cool to see this
         | turned into a reusable library that could be included by
         | various types of software that is more likely to be impacted by
         | bad hardware and can afford to burn a little cpu/gpu while it's
         | running - you wouldn't want it in background services, but a
         | game or high performance server app might be able to justify
         | it.
        
         | anotherhue wrote:
         | zfs scrub might be compelling.
        
         | willvarfar wrote:
         | There are programs, used for testing and "fuzzing" compilers,
         | that generate random programs based on a seed.
         | 
         | When a node runs a random program, it doesn't know if the
         | output is correct. But it could report the seed and result to a
         | central database.
         | 
         | Then, if you had a several nodes, and two nodes ran the same
         | seed and got different output, that would mean something was
         | wrong and needed investigation.
         | 
         | There are also programs for reducing such programs down to a
         | minimum test case. So once a discrepancy is found, it can be
         | reduced to some small program that recreates it.
         | 
         | I once worked on a compiler backend and a CI job generated
         | random C programs and compared x86 output. against the novel
         | cpu simulator. Any discrepancies found were auto reduced by
         | these tools and then a ticket was automatically created. Lots
         | of our bugs were found and fixed this way.
         | 
         | (My memory is we used C-reduce for the reductions. I can't
         | remember the tool we used for generating the test programs, but
         | there are several.)
        
       | hermitcrab wrote:
       | I thought this was going to be about the UK government.
        
       | jboggan wrote:
       | As someone building numerous Spark workloads this is quite
       | concerning.
        
       | TazeTSchnitzel wrote:
       | > introduce more [...] asserts statements
       | 
       | I agree that's a good idea, but don't people usually disable
       | asserts in production?
        
       ___________________________________________________________________
       (page generated 2021-06-12 23:01 UTC)