[HN Gopher] Silent Data Corruptions at Scale
___________________________________________________________________
Silent Data Corruptions at Scale
Author : zdw
Score : 53 points
Date : 2021-06-12 14:49 UTC (8 hours ago)
(HTM) web link (muratbuffalo.blogspot.com)
(TXT) w3m dump (muratbuffalo.blogspot.com)
| paulsutter wrote:
| Perhaps run a complex "idle process" that exercises the major
| functional areas of the CPU and can detect such failures so that
| such cores/cpus can be isolated / decommissioned. Really should
| be part of Linux
|
| Anyone know of a suitable program for such a process?
| makomk wrote:
| The general reckoning seems to be that it wouldn't be possible
| to write such a program without internal knowledge available
| only to the CPU manufacturer, and maybe not even then - modern
| CPUs are too complicated and have too much stuff going on
| that's not under direct control of the software running on
| them,
| imperialdrive wrote:
| This article is a little scary to me, and your idea/solution
| sounds pretty clever!
| ot wrote:
| I would assume that most large fleets have background periodic
| tasks that perform basic self-checks (the fleets I know about
| certainly do).
|
| Using "idle" cycles is not a great idea though:
|
| - They may seem "free", but in fact you would end up using more
| power: CPU turns itself off during idle time, and you'd replace
| that with an intensive process. Power (and, as a consequence,
| cooling) is one of the main costs of a data center.
|
| - Machines that are properly utilized (close to 100% CPU
| utilization) would get less coverage, and those are the ones
| that need it the most.
|
| So it is better to allocate a certain percentage of your CPU
| budget to self checks, based on risk and sensitivity of the
| tests. And have some easy way to put a machine under stress
| testing if it is suspected of having rare memory or CPU errors.
| kevingadd wrote:
| There are a few game engines that do this while running to
| detect bad hardware. The resulting 'bad hardware flag' is
| tracked and forwarded in crash reports to help sort them out
| from the 'real' crash reports (caused by bugs), and the
| information is also shown to the user when they hit an issue
| ('you seem to have bad RAM', etc). It'd be cool to see this
| turned into a reusable library that could be included by
| various types of software that is more likely to be impacted by
| bad hardware and can afford to burn a little cpu/gpu while it's
| running - you wouldn't want it in background services, but a
| game or high performance server app might be able to justify
| it.
| anotherhue wrote:
| zfs scrub might be compelling.
| willvarfar wrote:
| There are programs, used for testing and "fuzzing" compilers,
| that generate random programs based on a seed.
|
| When a node runs a random program, it doesn't know if the
| output is correct. But it could report the seed and result to a
| central database.
|
| Then, if you had a several nodes, and two nodes ran the same
| seed and got different output, that would mean something was
| wrong and needed investigation.
|
| There are also programs for reducing such programs down to a
| minimum test case. So once a discrepancy is found, it can be
| reduced to some small program that recreates it.
|
| I once worked on a compiler backend and a CI job generated
| random C programs and compared x86 output. against the novel
| cpu simulator. Any discrepancies found were auto reduced by
| these tools and then a ticket was automatically created. Lots
| of our bugs were found and fixed this way.
|
| (My memory is we used C-reduce for the reductions. I can't
| remember the tool we used for generating the test programs, but
| there are several.)
| hermitcrab wrote:
| I thought this was going to be about the UK government.
| jboggan wrote:
| As someone building numerous Spark workloads this is quite
| concerning.
| TazeTSchnitzel wrote:
| > introduce more [...] asserts statements
|
| I agree that's a good idea, but don't people usually disable
| asserts in production?
___________________________________________________________________
(page generated 2021-06-12 23:01 UTC)