[HN Gopher] Cores that don't count [pdf]
       ___________________________________________________________________
        
       Cores that don't count [pdf]
        
       Author : rajeevk
       Score  : 159 points
       Date   : 2021-06-03 08:46 UTC (14 hours ago)
        
 (HTM) web link (sigops.org)
 (TXT) w3m dump (sigops.org)
        
       | tylfin wrote:
       | Can't reproduce the issue after a few minutes? Sorry wont-fix,
       | mercurial core.
       | 
       | Joking aside, it's really neat to see the scale and
       | sophistication of error detection appearing in these data
       | centers.
        
       | guidovranken wrote:
       | Site is down, archive link:
       | https://web.archive.org/web/20210602080638/https://sigops.or...
       | 
       | What stands out to me:
       | 
       | - "Mercurial cores are extremely rare" but "we observe on the
       | order of a few mercurial cores per several thousand machines". On
       | average one core per 1000 machines is faulty? That's quite a high
       | rate.
       | 
       | - Vendors surely must know about this? If not by testing then
       | through experiencing the failures in their company servers.
       | 
       | - I've read the whole paper and I see no mention of them even
       | reaching out to vendors about this issue. Their are strong
       | incentives on both sides to solve or mitigate this issue so why
       | aren't they working together?
        
         | thanatos519 wrote:
         | https://www.goodreads.com/quotes/691547-wherever-i-m-going-i...
        
         | tails4e wrote:
         | That rate sounds too high. Typically scan test gives 99.9%
         | logic coverage. This me as random defects must hit this exact
         | subset of logic to cause a fault undetectable by production
         | test. Given defect rates are low, 1 in 1000 having a fault that
         | got past theses tests seems too high. Unless of course Intel
         | does not use scan test, and has a more functional type test
         | method, though even then I'd imagine they must have a high
         | coverage rate.
        
         | ffff1312 wrote:
         | I highly assume that they must have reached out and reported
         | these issues to, presumably, Intel as the biggest player here.
         | Likely they're just not disclosing these numbers, and generally
         | not many are talking about these and potentially other CPU
         | issue in public due to NDAs. Either way, it's quite amazing to
         | dig down so deep in the production stack that you must conclude
         | that it's the CPU at fault here. I presume academic research
         | might have a hard time on this given the scale needed to run
         | into these issues, but hopefully we'll see more on this
         | research in future.
        
           | scottlamb wrote:
           | > Either way, it's quite amazing to dig down so deep in the
           | production stack that you must conclude that it's the CPU at
           | fault here.
           | 
           | Google's internal production stack is much more amenable to
           | that kind of digging than public cloud products:
           | 
           | * You can easily find out what machine a given borg task was
           | running on. In fact, not just your own borg job but anyone's.
           | You can query live state, or you can use Dremel to look up
           | history.
           | 
           | * Similarly, even as a client of Bigtable or Spanner, you can
           | find out the specific tabletservers/spanservers operating on
           | a portion of your database and what machines they're running
           | on. (Not as easy to cross this layer and get to the relevant
           | D servers actually storing the data but I think it's all
           | checksummed here anyway.) If your team has your own
           | partition, you can see tabletserver/spanserver debug logs
           | yourself also.
           | 
           | * There's a convenient frontend for looking up a bunch of
           | diagnostic info for the machine, including failures of borg
           | tasks (were other people's tasks crashing at the same time
           | mine did? what was their crash message?), syslog-level stuff,
           | other machine diagnostics like ECC / MCE errors, and repair
           | history (swapped this DIMM, next attempt will swap this CPU).
           | 
           | It's not unusual for application teams to suspect a machine
           | and basically vote it off the island (I don't want my jobs
           | running here anymore, I cast a vote for it to be repaired /
           | Office Spaced). It's more rare for them to really take the
           | time to really understand the problem in detail like "core 34
           | sometimes returns incorrect results on this computation",
           | although there's nothing in particular stopping them from
           | doing so (other than lack of expertise and a long list of
           | other things to do). The platforms team gets involved
           | sometimes and really digs in--iirc in one bug they mentioned
           | sending a CPU back to the vendor to examine with an electron
           | microscope.
           | 
           | I'm not sure what lessons that offers for a public cloud
           | where that kind of transparency isn't realistic...
        
             | H8crilA wrote:
             | So much for non disclosure agreements. Or are they not
             | respected at Google?
        
               | scottlamb wrote:
               | I'm not disclosing anything new. Google's official SRE
               | book, research publications, and conference presentations
               | describe the systems I mentioned in more detail.
        
         | redis_mlc wrote:
         | Not sure why you're surprised. Ever seen displays with dead
         | pixels? Same thing.
         | 
         | Usually CPUs are speed-binned and tested for how many cores
         | work, then given different part numbers (and prices.)
         | 
         | Increasing temperature or stressing them will reveal more
         | problems.
        
       | gbrown_ wrote:
       | I'd love to see more details on the defective parts, particularly
       | counts of CPU model (anonymized if needs be) and counts of which
       | part of the architecture exhibited faults.
       | 
       | From working in HPC I've handled reports of things like FMA units
       | producing incorrect results or random appearance of NaNs. Were it
       | not for the fact that we knew these things could happen and
       | customer's intimate knowledge of their codes I dread to think
       | how'd "normal" operations would track these issues down. Bad
       | parts went back to the CPU manufacturer and further testing
       | typically confirmed the fault. But that end of the process was
       | pretty much a black box to anyone but the CPU manufacturer. I'd
       | be keen to know more about this too.
        
       | dekhn wrote:
       | I worked on this problem for the past year at Google. It's a
       | fascinating problem. In my subarea I focused on accelerators
       | (like GPUs) running machine learning training.
       | 
       | Many users report problems like "NaN" during training- at some
       | point, the gradients blow up and the job crashes. Sometimes these
       | are caused by specific examples, or numerical errors on the part
       | of the model developer, but sometimes, they are the result of
       | errors from bad cores (during matrix multiplication, embedding
       | lookup, vector op, whatever).
       | 
       | ML is usually pretty tolerant of small amounts of added noise
       | (especially if it's got nice statistcal properties) and some
       | training jobs will ride through a ton of uncorrected and
       | undetected errors with few problems. It's a very challenging
       | field to work in because it's hard to know if your nan is because
       | of your model or your chip.
        
         | gbrown_ wrote:
         | Are you able to say anything about the distribution across
         | hardware? E.G. Is there any correlation across such faulty
         | parts with serials/production dates or is it very
         | random/insufficient frequency to say?
        
       | dataflow wrote:
       | This is fascinating. I feel like the most straightforward (but
       | hardly efficient) solution is to provide a way for kernels to ask
       | CPUs to "mirror" pairs of cores, and have the CPUs internally
       | check that the behaviors are identical? Seems like a good way to
       | avoid large scale data corruption until we develop better
       | techniques...
        
         | jacques_chester wrote:
         | Tandem used to do this. By descent the technology wound up with
         | HPE.
         | 
         | Their Tech Reports are worth a sample and fortunately they're
         | online: https://www.hpl.hp.com/hplabs/index/Tandem
         | 
         | Probably the best one to start at:
         | https://www.hpl.hp.com/techreports/tandem/TR-90.5.pdf
        
           | electricshampo1 wrote:
           | Thanks for the reference.
        
         | meepmorp wrote:
         | Mainframes do this. They'll also disable the failing CPUs and
         | place a service call to IBM to get someone to swap out the
         | part.
        
           | dataflow wrote:
           | Wow that's cool. It'd be quite interesting if the conclusion
           | ends up being that we should go back to mainframes...
        
         | smallpipe wrote:
         | That's called dual core lockstep and it's very common in
         | automotive and other applications where reliability is
         | paramount.
        
           | dataflow wrote:
           | Yeah I didn't know! And I just realized this is mentioned in
           | the paper just a little further below where I paused. It
           | seems like it would significantly affect anything shared
           | (like L3 cache)... would Intel and AMD have appetite for
           | adding this kind of thing to x86?
        
             | yaantc wrote:
             | The pair in lockstep is "close", in that it only includes
             | the core and deterministic private resources like core
             | private caches. Shared resources like a L3 cache are
             | outside of the whole pair, and can be seen as accessed by
             | the pair. All output is from the pair and checked for
             | consistency (same for both cores in lockstep) before going
             | out.
             | 
             | Not directly related but some platforms supporting lockstep
             | are flexible: you can use a pair as either 2 cores (perf)
             | or a single logical one (lockstep).
        
       | temac wrote:
       | I don't completely understand the perception that standard non
       | hardened high-perf CPU, especially in an industry and more
       | specifically in a segment that has been reported as consistently
       | cutting a few corners in recent years (maybe _somehow_ less than
       | client CPUs, but still), should somehow be exempt of silent
       | defaults, because... magic?
       | 
       | If you want extremely high reliability, for critical
       | applications, you use other CPUs. Of course, they are slower.
       | 
       | So the only interesting info that remains is that the defect rate
       | seems way too high and maybe the quality decreasing in recent
       | years. In which case, when you are Google, you probably could and
       | should complain (strongly) to your CPU vendors, because likely
       | their testing is lacking and their engineering margins too low...
       | (at least if that's really the silicon that is at fault, and not
       | say for example the MB)
       | 
       | Now of course it's a little late for the existing ones, but still
       | the sudden realization that "OMG CPU do sometimes fail, with a
       | variety of modes, and for a variety of reasons" (including,
       | surprise(?!), aging) seems, if not concentrating on the defect
       | rate, naive. And the potential risk of sometimes having high rate
       | errors was already a very well known esp. in the presence of
       | software changes and/or heterogenous software and/or heterogenous
       | hardware, due to the existence of logical CPU bugs, sometimes
       | also resulting in silent data corruption, and sometimes also with
       | non-deterministic-like behaviors (so can as well work on a core
       | but not another because of "random" memory controller pressure
       | and delays, and the next time with the two cores reversed)
        
         | ng55QPSK wrote:
         | I think, the main point is: We have reached a time in which
         | there are no guarantees anymore that your HW works (until
         | recently we only had no guarantee that the SW works).
         | 
         | Correctly.
        
           | antonvs wrote:
           | When was there ever a guarantee that HW worked?
           | 
           | ECC memory was deprecated in consumer machines 15 years ago
           | or more. This was a conscious industry choice that
           | reliability in hardware could be sacrificed to other
           | concerns. That's just an example.
        
             | dataflow wrote:
             | People use ECC to protect against misbehavior that is
             | random both spatially and temporally. That is, it's not
             | meant to protect against the same transistors producing
             | incorrect outputs consistently/systematically. Put another
             | way, we felt we had a rather safe guarantee that
             | consistent/systematic misbehavior of the same portion of
             | hardware would be either testable (like with memory
             | diagnostics) or nonexistent. This paper is tearing apart
             | that assumption.
        
         | jeffbee wrote:
         | Why assume it was the manufacturer silently cutting the
         | engineering safety margins rather that the customer asking them
         | to do it?
        
           | temac wrote:
           | Because benchmarks sale and are easy to perform whereas
           | characterising spurious failures is hard and unreliable CPUs
           | are not in the interest of users of server products (and
           | arguably not in the interest of users of other products as
           | well but I can see gamers willing to trade stability for
           | marginal perf improvement)
        
             | jeffbee wrote:
             | OK but we're right now discussing the findings of an
             | organization that owns (or at least once owned) acres of
             | factory-overclocked servers.
        
       | yummypaint wrote:
       | I wonder about the larger feedback loops between hardware error
       | checking in software and the optimizations hardware manufacturers
       | are making at the fab. Presumably more robust software would
       | result in buggier cores being shipped, but would this actually
       | result in more net computation per dollar spent on processors?
        
       | sehugg wrote:
       | Maybe we need something like SMART but for CPU cores?
        
         | mywittyname wrote:
         | A lot of these errors are subtle, making it incredibly
         | difficult to generate a test suite to find them.
        
         | gbrown_ wrote:
         | We have machine-check exceptions, the purpose of this paper is
         | to draw attention to erroneous behavior that is silent.
        
       | smallpipe wrote:
       | Modern embedded cores have self-testing code that detects
       | anywhere from 50% to 90%[1] of faults in the hardware, including
       | from ageing.
       | 
       | If google and the other hyperscalers complain enough, there's no
       | reason Intel couldn't give them some self test to run every hour
       | or so.
       | 
       | [1] depends on how complex the CPU is, how long you accept to run
       | the self testing code, and how well it was done.
        
         | gmueckl wrote:
         | These periodic self tests are required for some safety critical
         | applications. My reaction to this paper is that this approach
         | might have to be used in data centers as well. Unfortunately,
         | it can't be done without help from the CPU designers because
         | the test sensitivity relies on knowledge of the exact
         | implementation details of the underlying hardware units.
        
       | yummypaint wrote:
       | Fault tolerance seems to be the fundamental issue looming in the
       | background of both traditional and quantum computing at the
       | moment. Silicon is already at the point where there are only a
       | dozen or so dopant atoms per gate, so a fluctuation of one or two
       | atoms can be enough to impact behavior. It's amazing to me that
       | with billions of transistors things work as well as they do. At
       | some point it might be good to try to re-approach computation
       | from some kind of error-prone analogue of the turing machine.
        
         | minikites wrote:
         | Is silicon cheap enough that we can do what we do for
         | spacecraft and have multiple processors (where the majority
         | "wins") for fault tolerance?
        
           | icegreentea2 wrote:
           | I think it honestly all depends on what the dominant causal
           | factors and how this scales with node size. Effectively, if
           | unreliability increases at the same rate or faster than the
           | performance increase as node size decreases, and 'high
           | reliability' compute can be easily and generally segregated
           | from other compute, then it would probably be easier just to
           | not decrease node size rather than parallelize at the
           | chip/core level. Certainly, the software cost would be much
           | easier.
        
           | colechristensen wrote:
           | You could probably do something like using idle cores (or
           | idle hyperthreads) to duplicate instructions on an
           | opportunistic basis to verify outcomes on a less than
           | complete basis. There would be thermal and power consequences
           | but some situations care about that less than others.
        
             | haneefmubarak wrote:
             | Unless the idle hyperthreads are on different cores, you'd
             | most likely have the same execution results. Using idle
             | cores could be interesting, but your thermal and power
             | budget would be shared so your overall performance would
             | still decrease.
             | 
             | This is probably difficult to do at a fine grained level,
             | but I imagine that coarser synchronization and checks (both
             | in software) could provide the necessary assurances that
             | code executing on a single core is consistent with that of
             | other cores.
        
           | maweki wrote:
           | Cheap is probably not the question. The synchronization and
           | communication overhead is enormous.
           | 
           | Would you check that every register assignment matches? Or
           | every page write?
        
             | mirker wrote:
             | You can have a log of N register "transactions", where N is
             | large enough to hide core-to-core communication. If any of
             | N transactions fail due to mismatch between cores, you roll
             | back and throw exceptions.
             | 
             | A lot of this logic is already in out of order execution
             | (e.g., Tomasulo algorithm). Memory has ECC and is probably
             | a different problem.
        
           | mywittyname wrote:
           | If data integrity issues become a problem, it might be
           | cheaper to mark certain cores as fault-tolerant and provide
           | the OS with mechanisms to denote certain threads as requiring
           | fault-tolerance.
        
             | yummypaint wrote:
             | This is a good idea. In a sense this is somewhat available
             | now when choosing whether to run certain things on gpu vs
             | cpu. My understanding is gpus tend to play faster and
             | looser since people don't tend to notice a few occasional
             | weird pixels in a single frame. What if it could be made
             | finer grained by the instruction?
        
           | toast0 wrote:
           | I think there would be a pretty large hardware cost to ensure
           | the input signals come to both processors at the same clock
           | everytime on the many high speed interfaces a modern CPU is
           | using.
           | 
           | And you'd need to eliminate use of non-deterministic CPU
           | local data, like RDRAND and on die temperature, power, etc
           | sensors. Most likely, you'd want to run the CPUs at fixed
           | clock speed to avoid any differences in settling time when
           | switching speeds.
           | 
           | This could probably effectively fine broken CPUs (although
           | you wouldn't know which of the pair was broken), but you
           | could still have other broken devices resulting in bad
           | computations. It might be better to run calculations on two
           | separate nodes and compare; possibly only for important
           | calculations.
        
             | mirker wrote:
             | It's not necessary to serialize the full execution to
             | detect errors. On an out of order processor, there is
             | already buffering of results happening that is eventually
             | serialized to visible state. To check errors, you could
             | just have one buffer per processor and compare results
             | before serialization, raising an error on mismatch between
             | processors. Serialization is merely indicating that both
             | _visible_ executions have agreed up to that instruction but
             | it still allows for some local/internal disagreements. For
             | example, two instructions can finish in opposite orders
             | across cores and that is fine as long as they are
             | serialized in order.
             | 
             | As for settling times, those are random anyway. Processors
             | are binned according to how good the settling times ended
             | up being. It's unlikely to have two homogeneous chips.
        
           | jeffbee wrote:
           | The economics will never favor this approach. Customers will
           | not choose to pay double to avoid the 1-in-a-million chance
           | of occasionally getting a slightly wrong answer.
        
             | mikepurvis wrote:
             | Does it have to be double? I know it's not a direct
             | analogue, but parity schemes like RAID 6 or ECC RAM don't
             | double the cost.
             | 
             | So the question is, how do you check these results without
             | actually doing them twice? Is there a role here for
             | frameworks or OS to impose sanity checks? Obviously we
             | already have assertions, but something gentler than a
             | panic, where it says "this is suspect, let's go back and
             | confirm."
        
             | theevilsharpie wrote:
             | > Customers will not choose to pay double to avoid the
             | 1-in-a-million chance of occasionally getting a slightly
             | wrong answer.
             | 
             | With today's high-speed multi-core processors, a 1-in-a-
             | million chance of a computation error would mean tens to
             | hundreds of thousands of errors per second.
        
               | eric__cartman wrote:
               | I can imagine most consumers that do any sort of work
               | with their computer would appreciate close to 100%
               | stability when they need to get work done.
               | 
               | That's usually why no one that depends on their computers
               | to work day in and day out overclocks their components.
               | The marginal performance gains aren't worth the added
               | unreliability and added power/heat/noise footprint.
        
               | sokoloff wrote:
               | The lack of adoption/demand for just ECC RAM by consumers
               | would seem to be an argument in the opposing direction.
               | (Yes, it's not widely available currently, but I think
               | it's safe to safe that availability is driven by
               | predictions about adoption given past market behavior.)
        
               | temac wrote:
               | Who decided there is a lack of _consumer_ demand? There
               | is a lack of OEM demand for sure which is driven by the
               | fact that most companies are willing to sell crap if it
               | can save 1 cent per product. The average consumer does
               | not even know this pb exists. Adding to that the
               | artificial market seg by Intel which is absolutely stupid
               | and the consummer actually _can not_ buy a consummer CPU
               | that supports ECC. The situation is then locked into a
               | vicious circle where all the components have ridiculous
               | premiums and lower volumes.
        
               | sokoloff wrote:
               | Do you think the OEMs and Intel generally ignore what
               | consumers demand (and are willing to pay for)? I don't.
        
               | a1369209993 wrote:
               | Yes. For example, Intel Management Engine.
        
               | ncmncm wrote:
               | Consumers are not the customer. System integrators are
               | the customer. They are motivated to minimize the number
               | of distinct manufacturing targets. Consumers have no
               | choice but to take what is offered.
        
           | einpoklum wrote:
           | > Is silicon cheap enough etc.
           | 
           | No, it is not. You can always trade off performance for
           | reliability by repeating your computations several times,
           | preferably with some variation in the distribution/timing of
           | work to avoid the same potential hardware failure pattern.
        
         | dataflow wrote:
         | > Silicon is already at the point where there are only a dozen
         | or so dopant atoms per gate, so a fluctuation of one or two
         | atoms can be enough to impact behavior.
         | 
         | How in the world do they get such a precise number of atoms to
         | land on billions of transistors? It seems so hard for even one
         | transistor.
        
       | freehrtradical wrote:
       | The article references Dixit et al. for an example of a root
       | cause investigation of a CEE which is an interesting read:
       | https://arxiv.org/pdf/2102.11245.pdf
       | 
       | > After a few iterations, it became obvious that the computation
       | of Int(1.153)=0 as an input to the math.pow function in Scala
       | would always produce a result of 0 on Core 59 of the CPU.
       | However, if the computation was attempted with a different input
       | value set Int(1.152)=142 the result was accurate.
        
       | sfvisser wrote:
       | > A deterministic AES mis-computation, which was "selfinverting":
       | encrypting and decrypting on the same core yielded the identity
       | function, but decryption elsewhere yielded gibberish.
       | 
       | Incredible
        
         | buildbot wrote:
         | This specifically stood out to me - it implies it is possible
         | to hide weakened/bad crypto via manufacturing bugs
        
           | muricula wrote:
           | I'm not sure about that. I think it's just that the AES
           | hardware was busted somehow and didn't actually perform the
           | AES algorithm correctly. The intel AES hardware just
           | deterministically performs the algorithm, so Intel can't just
           | weaken the algorithm somehow, at least if you're not worrying
           | about local side channels.
        
         | twic wrote:
         | Depends on the mode. AES-CTR encrypts an increasing counter to
         | make a keystream, then xors that with the plaintext to make the
         | ciphertext, or with the ciphertext to make the plaintext. Any
         | consistent error in encryption will lead to a consistently
         | wrong keystream, which will round-trip successfully.
         | 
         | It's possible other modes have this property because of the
         | structure of the cipher itself, but that's way out of my
         | league.
        
           | muricula wrote:
           | The intel AES instructions work at a block level. You build
           | the mode on top of the block level primitives intel gives
           | you. https://en.wikipedia.org/wiki/AES_instruction_set
        
         | formerly_proven wrote:
         | Sounds like a corrupted S-box or something like that in the
         | hardware implementing AES-NI
        
       ___________________________________________________________________
       (page generated 2021-06-03 23:01 UTC)