[HN Gopher] Cores that don't count
___________________________________________________________________
Cores that don't count
Author : canthandle
Score : 48 points
Date : 2021-06-26 07:10 UTC (15 hours ago)
(HTM) web link (muratbuffalo.blogspot.com)
(TXT) w3m dump (muratbuffalo.blogspot.com)
| YetAnotherNick wrote:
| > A deterministic AES mis-computation, which was "self-
| inverting": encrypting and decrypting on the same core yielded
| the identity function, but decryption elsewhere yielded
| gibberish.
|
| This is such a unlikely thing to happen that it is likely a fault
| in software that is used to validate this.
| throwaway2048 wrote:
| all it takes is something that reliably bitflips a key holding
| register
| mrtweetyhack wrote:
| MS will send you more malware in Win11
| sennight wrote:
| Detection is certainly the harder part - but the rest of it is
| pretty well-trodden ground. Some POWER9 users noticed that linux
| was under-reporting their core count and further investigation
| showed that the CPU's RAS had identified some cores that were
| acting a little funky - so it automatically dropped them into a
| guard partition and logged the action in the persistent circular
| buffer that nobody had bothered writing open source code to
| watch. Of course IBM had already written a ton of software that
| had it covered in AIX. I've noticed the same thing on a lot of
| other platforms, I'm looking at an HP MicroServer right now that
| has low level hardware error logging that never makes it past the
| bios firmware - and I've been meaning to fix that for about two
| years now. It might be a good idea to tackle that issue before we
| start getting too clever with solutions.
|
| Dynamic reconfiguration: Basic building blocks for autonomic
| computing on IBM pSeries servers
|
| https://sci-hub.se/10.1147/sj.421.0029
| extrapickles wrote:
| What needs to happen is to bring ECC to all levels of a chips
| logic to solve this. ARM vs RISC-V vs x64 doesn't address the
| problem as nothing in them inherently solves the problem. Making
| adders that add with ECC will.
|
| If all of the logic also operates on ECC with the data, chip
| yields will also be improved. Say an core of the chip only
| produces the correct result 99% of the time, currently you have
| to disable that core. With ECC logic, you can still use it, as it
| doesn't matter if it has an additional 1% chance of a bit flip,
| as all of your logic is now immune to single bitflips. For
| mission critical logic/applications, one can scale up the ECC so
| its immune to more bitflips before an error is introduced.
| Someone wrote:
| I'm fairly sure you can't do that. For example, suppose you
| have logic that uses some inputs to compute an output:
| A, B = C
|
| Add ECC bits to the inputs, and you want Aa, Bb
| = Cc
|
| Now, if you want this to detect errors made by the "=" part,
| you can't do this as "drop the ECC bits, compute the result,
| compute the ECC bits of the result".
|
| So, how do you compute the ECC bits from only the Aa and Bb
| bits without having to compute the C part?
|
| Depending on the ECC logic chosen, that might be doable for bit
| shifts, but for addition? For multiplication? For IEEE float
| square roots?
| jeffbee wrote:
| What leads us to believe that there is not already fault
| detection in execution units? We have really no idea what's
| going on at the gate level in CPUs.
| convolvatron wrote:
| i know Intel and AMD like to hide hardware features - but
| given the design cost and the non-negligible overhead..one
| would suppose that they would at least surface a counter in
| the documentation..and probably even...you know, market it as
| a feature.
| Dylan16807 wrote:
| Is there a way to add ECC into an ALU without making it
| massively slower?
| evancox100 wrote:
| The conclusion is an utter joke: "Maybe this will lead to
| abondonment of complex deep-optimizing chipsets like Intel
| chipsets, and make simpler chipsets, like ARM chipsets, more
| popular for datacenter deployments."
| rejectedandsad wrote:
| Why is that a joke? There's a reason why AWS is pushing
| Gravitron so hard (and they are pretty damn fast)
| Nacraile wrote:
| Why would one presume that the reason has anything to do with
| computation error rates, rather than something obvious and
| mundane like "Graviton instances are more profitable"?
| rejectedandsad wrote:
| Controlling the entire stack, from control plane to core
| layout is absolutely better for isolating computation error
| rates if they occur.
| etaioinshrdlu wrote:
| Maybe we should run the code twice on different cpu architecures
| and only accept the results if they are identical. I've heard of
| high reliability environments doing this, but maybe there are
| cases for doing it in web/IT as well.
|
| This would help catch a large variety of possible errors,
| including but not limited to cpu bit flips, cpu bugs, memory
| errors.
| cmeacham98 wrote:
| In my experience, achieving perfect determinism across two
| architectures is a problem that cannot be easily hand waved
| away.
|
| For 99% of cases it is fine, but when you run into that last 1%
| it hurts.
| [deleted]
| Dylan16807 wrote:
| Maybe, but in most situations you're probably better off adding
| some more developer time to work on software bugs than you are
| doubling your computation cost.
| slver wrote:
| This kind of problems we're discussing and software bugs are
| not the same thing. A memory bit flip is not really a
| software bug is it
| Dylan16807 wrote:
| Of course they're not the same thing. But they both make
| things go wrong. There are situations where hardware bugs
| are important, and there are situations where hardware bugs
| don't make a meaningful difference.
| einpoklum wrote:
| Cores that don't count = floating-point coprocessor cores? :-)
|
| But seriously, though,
|
| > I think fail-silent CEEs is weaker than the adversary Byzantine
| failure model.
|
| Of course they're weaker than byzantine failures. There's time
| locality, and the failure in themselves are not particularly hard
| to detect if some other core checks the results (although that
| obviously doesn't happen after every single computation).
| [deleted]
| hypertele-Xii wrote:
| wait... consensus algorithms exist? can they work for human
| brains?
|
| experiment :
|
| tell a bunch of people that if they form consensus about, say,
| a color, they all get $5 (or whatever). have them attempt to
| reach consensus only using the exact mechanisms of a consensus
| algorithm. (research what they are)
| ithkuil wrote:
| You need to specify further environment parameters: Is the
| communication channel lossy? Can all the participants be
| trusted? Is there a time limit?
| PaulDavisThe1st wrote:
| behind the scenes, someone offers one of the participants $10
| to prevent consensus.
| remram wrote:
| You mean if people perform a successful vote on something
| they don't care about, you'll give each $5? You don't think
| anyone could do it in under a minute?
| hypertele-Xii wrote:
| Yeah but like not face-to-face. Using the algorithm, but
| running on humans. Like the old human computers.
| jsnell wrote:
| Previous discussion:
|
| https://news.ycombinator.com/item?id=27378624
|
| https://news.ycombinator.com/item?id=27408398
___________________________________________________________________
(page generated 2021-06-26 23:02 UTC)