[HN Gopher] Computer chips are mercurial: Rare miscalculations f...
       ___________________________________________________________________
        
       Computer chips are mercurial: Rare miscalculations frequent at
       cloud hyperscale
        
       Author : pasttense01
       Score  : 33 points
       Date   : 2021-06-05 20:56 UTC (2 hours ago)
        
 (HTM) web link (www.theregister.com)
 (TXT) w3m dump (www.theregister.com)
        
       | jrockway wrote:
       | Previously: https://news.ycombinator.com/item?id=27378624
        
       | makomk wrote:
       | If this is process node dependent and has started turning up
       | relatively frequently in places like Facebook and Google, I
       | wonder if it's more common in current-gen AMD CPUs than others...
       | the timing would seem to fit, and they're on the smallest process
       | node of anything currently in datacenter use.
        
         | H8crilA wrote:
         | I've seen this for years, it is not new. You have to be running
         | at the scale of tens/hundreds of petabytes of data to
         | experience really quirky anomalies.
         | 
         | The failure rate and failure modes vary from model to model and
         | even from batch to batch. The biggest issue is not that
         | failures exist, but that they're hard to trigger in testing and
         | when they trigger they do not cause a crash. The CPUs that any
         | actually used computation is running on have already been
         | through a lot of testing (manufacturer, cloud provider), the
         | failures have to slip through all of them.
         | 
         | And the failures can be really nasty: imagine some distributed
         | DB server damaging large % of database metadata records because
         | 1 instruction malfunctioned, but didn't crash the process. By
         | the time you know something's wrong you cannot recover 70% of
         | the data because you have no idea which data blocks correspond
         | to which logical rows (remember, the metadata is garbled),
         | except through a manual process that can take weeks or months.
         | Or something like a v-table "miss" where instead of something
         | like Table.Info() the CPU calls Table.Drop() because that
         | function happens to be exactly 64 bytes lower in the v-table
         | and has similar enough signature for the call to succeed. Those
         | are two real examples.
        
           | FartyMcFarter wrote:
           | How do companies deal with this in practice? Running 3 or
           | more redundant servers which vote on the correct result of an
           | operation?
        
             | H8crilA wrote:
             | You don't deal with this, other than fixing the damage and
             | removing the bad machines.
             | 
             | The only way forward is expanded testing, that's what the
             | paper from Google is (also) about. I think this issue will
             | always be with us, to a larger or smaller extent.
             | 
             | There's probably a ton of data corruption out there that
             | happens to be in places that doesn't really cause big
             | problems.
        
       ___________________________________________________________________
       (page generated 2021-06-05 23:01 UTC)