[HN Gopher] Cores that don't count
       ___________________________________________________________________
        
       Cores that don't count
        
       Author : canthandle
       Score  : 48 points
       Date   : 2021-06-26 07:10 UTC (15 hours ago)
        
 (HTM) web link (muratbuffalo.blogspot.com)
 (TXT) w3m dump (muratbuffalo.blogspot.com)
        
       | YetAnotherNick wrote:
       | > A deterministic AES mis-computation, which was "self-
       | inverting": encrypting and decrypting on the same core yielded
       | the identity function, but decryption elsewhere yielded
       | gibberish.
       | 
       | This is such a unlikely thing to happen that it is likely a fault
       | in software that is used to validate this.
        
         | throwaway2048 wrote:
         | all it takes is something that reliably bitflips a key holding
         | register
        
       | mrtweetyhack wrote:
       | MS will send you more malware in Win11
        
       | sennight wrote:
       | Detection is certainly the harder part - but the rest of it is
       | pretty well-trodden ground. Some POWER9 users noticed that linux
       | was under-reporting their core count and further investigation
       | showed that the CPU's RAS had identified some cores that were
       | acting a little funky - so it automatically dropped them into a
       | guard partition and logged the action in the persistent circular
       | buffer that nobody had bothered writing open source code to
       | watch. Of course IBM had already written a ton of software that
       | had it covered in AIX. I've noticed the same thing on a lot of
       | other platforms, I'm looking at an HP MicroServer right now that
       | has low level hardware error logging that never makes it past the
       | bios firmware - and I've been meaning to fix that for about two
       | years now. It might be a good idea to tackle that issue before we
       | start getting too clever with solutions.
       | 
       | Dynamic reconfiguration: Basic building blocks for autonomic
       | computing on IBM pSeries servers
       | 
       | https://sci-hub.se/10.1147/sj.421.0029
        
       | extrapickles wrote:
       | What needs to happen is to bring ECC to all levels of a chips
       | logic to solve this. ARM vs RISC-V vs x64 doesn't address the
       | problem as nothing in them inherently solves the problem. Making
       | adders that add with ECC will.
       | 
       | If all of the logic also operates on ECC with the data, chip
       | yields will also be improved. Say an core of the chip only
       | produces the correct result 99% of the time, currently you have
       | to disable that core. With ECC logic, you can still use it, as it
       | doesn't matter if it has an additional 1% chance of a bit flip,
       | as all of your logic is now immune to single bitflips. For
       | mission critical logic/applications, one can scale up the ECC so
       | its immune to more bitflips before an error is introduced.
        
         | Someone wrote:
         | I'm fairly sure you can't do that. For example, suppose you
         | have logic that uses some inputs to compute an output:
         | A, B = C
         | 
         | Add ECC bits to the inputs, and you want                 Aa, Bb
         | = Cc
         | 
         | Now, if you want this to detect errors made by the "=" part,
         | you can't do this as "drop the ECC bits, compute the result,
         | compute the ECC bits of the result".
         | 
         | So, how do you compute the ECC bits from only the Aa and Bb
         | bits without having to compute the C part?
         | 
         | Depending on the ECC logic chosen, that might be doable for bit
         | shifts, but for addition? For multiplication? For IEEE float
         | square roots?
        
         | jeffbee wrote:
         | What leads us to believe that there is not already fault
         | detection in execution units? We have really no idea what's
         | going on at the gate level in CPUs.
        
           | convolvatron wrote:
           | i know Intel and AMD like to hide hardware features - but
           | given the design cost and the non-negligible overhead..one
           | would suppose that they would at least surface a counter in
           | the documentation..and probably even...you know, market it as
           | a feature.
        
         | Dylan16807 wrote:
         | Is there a way to add ECC into an ALU without making it
         | massively slower?
        
       | evancox100 wrote:
       | The conclusion is an utter joke: "Maybe this will lead to
       | abondonment of complex deep-optimizing chipsets like Intel
       | chipsets, and make simpler chipsets, like ARM chipsets, more
       | popular for datacenter deployments."
        
         | rejectedandsad wrote:
         | Why is that a joke? There's a reason why AWS is pushing
         | Gravitron so hard (and they are pretty damn fast)
        
           | Nacraile wrote:
           | Why would one presume that the reason has anything to do with
           | computation error rates, rather than something obvious and
           | mundane like "Graviton instances are more profitable"?
        
             | rejectedandsad wrote:
             | Controlling the entire stack, from control plane to core
             | layout is absolutely better for isolating computation error
             | rates if they occur.
        
       | etaioinshrdlu wrote:
       | Maybe we should run the code twice on different cpu architecures
       | and only accept the results if they are identical. I've heard of
       | high reliability environments doing this, but maybe there are
       | cases for doing it in web/IT as well.
       | 
       | This would help catch a large variety of possible errors,
       | including but not limited to cpu bit flips, cpu bugs, memory
       | errors.
        
         | cmeacham98 wrote:
         | In my experience, achieving perfect determinism across two
         | architectures is a problem that cannot be easily hand waved
         | away.
         | 
         | For 99% of cases it is fine, but when you run into that last 1%
         | it hurts.
        
         | [deleted]
        
         | Dylan16807 wrote:
         | Maybe, but in most situations you're probably better off adding
         | some more developer time to work on software bugs than you are
         | doubling your computation cost.
        
           | slver wrote:
           | This kind of problems we're discussing and software bugs are
           | not the same thing. A memory bit flip is not really a
           | software bug is it
        
             | Dylan16807 wrote:
             | Of course they're not the same thing. But they both make
             | things go wrong. There are situations where hardware bugs
             | are important, and there are situations where hardware bugs
             | don't make a meaningful difference.
        
       | einpoklum wrote:
       | Cores that don't count = floating-point coprocessor cores? :-)
       | 
       | But seriously, though,
       | 
       | > I think fail-silent CEEs is weaker than the adversary Byzantine
       | failure model.
       | 
       | Of course they're weaker than byzantine failures. There's time
       | locality, and the failure in themselves are not particularly hard
       | to detect if some other core checks the results (although that
       | obviously doesn't happen after every single computation).
        
         | [deleted]
        
         | hypertele-Xii wrote:
         | wait... consensus algorithms exist? can they work for human
         | brains?
         | 
         | experiment :
         | 
         | tell a bunch of people that if they form consensus about, say,
         | a color, they all get $5 (or whatever). have them attempt to
         | reach consensus only using the exact mechanisms of a consensus
         | algorithm. (research what they are)
        
           | ithkuil wrote:
           | You need to specify further environment parameters: Is the
           | communication channel lossy? Can all the participants be
           | trusted? Is there a time limit?
        
           | PaulDavisThe1st wrote:
           | behind the scenes, someone offers one of the participants $10
           | to prevent consensus.
        
           | remram wrote:
           | You mean if people perform a successful vote on something
           | they don't care about, you'll give each $5? You don't think
           | anyone could do it in under a minute?
        
             | hypertele-Xii wrote:
             | Yeah but like not face-to-face. Using the algorithm, but
             | running on humans. Like the old human computers.
        
       | jsnell wrote:
       | Previous discussion:
       | 
       | https://news.ycombinator.com/item?id=27378624
       | 
       | https://news.ycombinator.com/item?id=27408398
        
       ___________________________________________________________________
       (page generated 2021-06-26 23:02 UTC)