[HN Gopher] Computer chips are mercurial: Rare miscalculations f...
___________________________________________________________________
Computer chips are mercurial: Rare miscalculations frequent at
cloud hyperscale
Author : pasttense01
Score : 33 points
Date : 2021-06-05 20:56 UTC (2 hours ago)
(HTM) web link (www.theregister.com)
(TXT) w3m dump (www.theregister.com)
| jrockway wrote:
| Previously: https://news.ycombinator.com/item?id=27378624
| makomk wrote:
| If this is process node dependent and has started turning up
| relatively frequently in places like Facebook and Google, I
| wonder if it's more common in current-gen AMD CPUs than others...
| the timing would seem to fit, and they're on the smallest process
| node of anything currently in datacenter use.
| H8crilA wrote:
| I've seen this for years, it is not new. You have to be running
| at the scale of tens/hundreds of petabytes of data to
| experience really quirky anomalies.
|
| The failure rate and failure modes vary from model to model and
| even from batch to batch. The biggest issue is not that
| failures exist, but that they're hard to trigger in testing and
| when they trigger they do not cause a crash. The CPUs that any
| actually used computation is running on have already been
| through a lot of testing (manufacturer, cloud provider), the
| failures have to slip through all of them.
|
| And the failures can be really nasty: imagine some distributed
| DB server damaging large % of database metadata records because
| 1 instruction malfunctioned, but didn't crash the process. By
| the time you know something's wrong you cannot recover 70% of
| the data because you have no idea which data blocks correspond
| to which logical rows (remember, the metadata is garbled),
| except through a manual process that can take weeks or months.
| Or something like a v-table "miss" where instead of something
| like Table.Info() the CPU calls Table.Drop() because that
| function happens to be exactly 64 bytes lower in the v-table
| and has similar enough signature for the call to succeed. Those
| are two real examples.
| FartyMcFarter wrote:
| How do companies deal with this in practice? Running 3 or
| more redundant servers which vote on the correct result of an
| operation?
| H8crilA wrote:
| You don't deal with this, other than fixing the damage and
| removing the bad machines.
|
| The only way forward is expanded testing, that's what the
| paper from Google is (also) about. I think this issue will
| always be with us, to a larger or smaller extent.
|
| There's probably a ton of data corruption out there that
| happens to be in places that doesn't really cause big
| problems.
___________________________________________________________________
(page generated 2021-06-05 23:01 UTC)