[HN Gopher] Cores that don't count [pdf]
___________________________________________________________________
Cores that don't count [pdf]
Author : rajeevk
Score : 159 points
Date : 2021-06-03 08:46 UTC (14 hours ago)
(HTM) web link (sigops.org)
(TXT) w3m dump (sigops.org)
| tylfin wrote:
| Can't reproduce the issue after a few minutes? Sorry wont-fix,
| mercurial core.
|
| Joking aside, it's really neat to see the scale and
| sophistication of error detection appearing in these data
| centers.
| guidovranken wrote:
| Site is down, archive link:
| https://web.archive.org/web/20210602080638/https://sigops.or...
|
| What stands out to me:
|
| - "Mercurial cores are extremely rare" but "we observe on the
| order of a few mercurial cores per several thousand machines". On
| average one core per 1000 machines is faulty? That's quite a high
| rate.
|
| - Vendors surely must know about this? If not by testing then
| through experiencing the failures in their company servers.
|
| - I've read the whole paper and I see no mention of them even
| reaching out to vendors about this issue. Their are strong
| incentives on both sides to solve or mitigate this issue so why
| aren't they working together?
| thanatos519 wrote:
| https://www.goodreads.com/quotes/691547-wherever-i-m-going-i...
| tails4e wrote:
| That rate sounds too high. Typically scan test gives 99.9%
| logic coverage. This me as random defects must hit this exact
| subset of logic to cause a fault undetectable by production
| test. Given defect rates are low, 1 in 1000 having a fault that
| got past theses tests seems too high. Unless of course Intel
| does not use scan test, and has a more functional type test
| method, though even then I'd imagine they must have a high
| coverage rate.
| ffff1312 wrote:
| I highly assume that they must have reached out and reported
| these issues to, presumably, Intel as the biggest player here.
| Likely they're just not disclosing these numbers, and generally
| not many are talking about these and potentially other CPU
| issue in public due to NDAs. Either way, it's quite amazing to
| dig down so deep in the production stack that you must conclude
| that it's the CPU at fault here. I presume academic research
| might have a hard time on this given the scale needed to run
| into these issues, but hopefully we'll see more on this
| research in future.
| scottlamb wrote:
| > Either way, it's quite amazing to dig down so deep in the
| production stack that you must conclude that it's the CPU at
| fault here.
|
| Google's internal production stack is much more amenable to
| that kind of digging than public cloud products:
|
| * You can easily find out what machine a given borg task was
| running on. In fact, not just your own borg job but anyone's.
| You can query live state, or you can use Dremel to look up
| history.
|
| * Similarly, even as a client of Bigtable or Spanner, you can
| find out the specific tabletservers/spanservers operating on
| a portion of your database and what machines they're running
| on. (Not as easy to cross this layer and get to the relevant
| D servers actually storing the data but I think it's all
| checksummed here anyway.) If your team has your own
| partition, you can see tabletserver/spanserver debug logs
| yourself also.
|
| * There's a convenient frontend for looking up a bunch of
| diagnostic info for the machine, including failures of borg
| tasks (were other people's tasks crashing at the same time
| mine did? what was their crash message?), syslog-level stuff,
| other machine diagnostics like ECC / MCE errors, and repair
| history (swapped this DIMM, next attempt will swap this CPU).
|
| It's not unusual for application teams to suspect a machine
| and basically vote it off the island (I don't want my jobs
| running here anymore, I cast a vote for it to be repaired /
| Office Spaced). It's more rare for them to really take the
| time to really understand the problem in detail like "core 34
| sometimes returns incorrect results on this computation",
| although there's nothing in particular stopping them from
| doing so (other than lack of expertise and a long list of
| other things to do). The platforms team gets involved
| sometimes and really digs in--iirc in one bug they mentioned
| sending a CPU back to the vendor to examine with an electron
| microscope.
|
| I'm not sure what lessons that offers for a public cloud
| where that kind of transparency isn't realistic...
| H8crilA wrote:
| So much for non disclosure agreements. Or are they not
| respected at Google?
| scottlamb wrote:
| I'm not disclosing anything new. Google's official SRE
| book, research publications, and conference presentations
| describe the systems I mentioned in more detail.
| redis_mlc wrote:
| Not sure why you're surprised. Ever seen displays with dead
| pixels? Same thing.
|
| Usually CPUs are speed-binned and tested for how many cores
| work, then given different part numbers (and prices.)
|
| Increasing temperature or stressing them will reveal more
| problems.
| gbrown_ wrote:
| I'd love to see more details on the defective parts, particularly
| counts of CPU model (anonymized if needs be) and counts of which
| part of the architecture exhibited faults.
|
| From working in HPC I've handled reports of things like FMA units
| producing incorrect results or random appearance of NaNs. Were it
| not for the fact that we knew these things could happen and
| customer's intimate knowledge of their codes I dread to think
| how'd "normal" operations would track these issues down. Bad
| parts went back to the CPU manufacturer and further testing
| typically confirmed the fault. But that end of the process was
| pretty much a black box to anyone but the CPU manufacturer. I'd
| be keen to know more about this too.
| dekhn wrote:
| I worked on this problem for the past year at Google. It's a
| fascinating problem. In my subarea I focused on accelerators
| (like GPUs) running machine learning training.
|
| Many users report problems like "NaN" during training- at some
| point, the gradients blow up and the job crashes. Sometimes these
| are caused by specific examples, or numerical errors on the part
| of the model developer, but sometimes, they are the result of
| errors from bad cores (during matrix multiplication, embedding
| lookup, vector op, whatever).
|
| ML is usually pretty tolerant of small amounts of added noise
| (especially if it's got nice statistcal properties) and some
| training jobs will ride through a ton of uncorrected and
| undetected errors with few problems. It's a very challenging
| field to work in because it's hard to know if your nan is because
| of your model or your chip.
| gbrown_ wrote:
| Are you able to say anything about the distribution across
| hardware? E.G. Is there any correlation across such faulty
| parts with serials/production dates or is it very
| random/insufficient frequency to say?
| dataflow wrote:
| This is fascinating. I feel like the most straightforward (but
| hardly efficient) solution is to provide a way for kernels to ask
| CPUs to "mirror" pairs of cores, and have the CPUs internally
| check that the behaviors are identical? Seems like a good way to
| avoid large scale data corruption until we develop better
| techniques...
| jacques_chester wrote:
| Tandem used to do this. By descent the technology wound up with
| HPE.
|
| Their Tech Reports are worth a sample and fortunately they're
| online: https://www.hpl.hp.com/hplabs/index/Tandem
|
| Probably the best one to start at:
| https://www.hpl.hp.com/techreports/tandem/TR-90.5.pdf
| electricshampo1 wrote:
| Thanks for the reference.
| meepmorp wrote:
| Mainframes do this. They'll also disable the failing CPUs and
| place a service call to IBM to get someone to swap out the
| part.
| dataflow wrote:
| Wow that's cool. It'd be quite interesting if the conclusion
| ends up being that we should go back to mainframes...
| smallpipe wrote:
| That's called dual core lockstep and it's very common in
| automotive and other applications where reliability is
| paramount.
| dataflow wrote:
| Yeah I didn't know! And I just realized this is mentioned in
| the paper just a little further below where I paused. It
| seems like it would significantly affect anything shared
| (like L3 cache)... would Intel and AMD have appetite for
| adding this kind of thing to x86?
| yaantc wrote:
| The pair in lockstep is "close", in that it only includes
| the core and deterministic private resources like core
| private caches. Shared resources like a L3 cache are
| outside of the whole pair, and can be seen as accessed by
| the pair. All output is from the pair and checked for
| consistency (same for both cores in lockstep) before going
| out.
|
| Not directly related but some platforms supporting lockstep
| are flexible: you can use a pair as either 2 cores (perf)
| or a single logical one (lockstep).
| temac wrote:
| I don't completely understand the perception that standard non
| hardened high-perf CPU, especially in an industry and more
| specifically in a segment that has been reported as consistently
| cutting a few corners in recent years (maybe _somehow_ less than
| client CPUs, but still), should somehow be exempt of silent
| defaults, because... magic?
|
| If you want extremely high reliability, for critical
| applications, you use other CPUs. Of course, they are slower.
|
| So the only interesting info that remains is that the defect rate
| seems way too high and maybe the quality decreasing in recent
| years. In which case, when you are Google, you probably could and
| should complain (strongly) to your CPU vendors, because likely
| their testing is lacking and their engineering margins too low...
| (at least if that's really the silicon that is at fault, and not
| say for example the MB)
|
| Now of course it's a little late for the existing ones, but still
| the sudden realization that "OMG CPU do sometimes fail, with a
| variety of modes, and for a variety of reasons" (including,
| surprise(?!), aging) seems, if not concentrating on the defect
| rate, naive. And the potential risk of sometimes having high rate
| errors was already a very well known esp. in the presence of
| software changes and/or heterogenous software and/or heterogenous
| hardware, due to the existence of logical CPU bugs, sometimes
| also resulting in silent data corruption, and sometimes also with
| non-deterministic-like behaviors (so can as well work on a core
| but not another because of "random" memory controller pressure
| and delays, and the next time with the two cores reversed)
| ng55QPSK wrote:
| I think, the main point is: We have reached a time in which
| there are no guarantees anymore that your HW works (until
| recently we only had no guarantee that the SW works).
|
| Correctly.
| antonvs wrote:
| When was there ever a guarantee that HW worked?
|
| ECC memory was deprecated in consumer machines 15 years ago
| or more. This was a conscious industry choice that
| reliability in hardware could be sacrificed to other
| concerns. That's just an example.
| dataflow wrote:
| People use ECC to protect against misbehavior that is
| random both spatially and temporally. That is, it's not
| meant to protect against the same transistors producing
| incorrect outputs consistently/systematically. Put another
| way, we felt we had a rather safe guarantee that
| consistent/systematic misbehavior of the same portion of
| hardware would be either testable (like with memory
| diagnostics) or nonexistent. This paper is tearing apart
| that assumption.
| jeffbee wrote:
| Why assume it was the manufacturer silently cutting the
| engineering safety margins rather that the customer asking them
| to do it?
| temac wrote:
| Because benchmarks sale and are easy to perform whereas
| characterising spurious failures is hard and unreliable CPUs
| are not in the interest of users of server products (and
| arguably not in the interest of users of other products as
| well but I can see gamers willing to trade stability for
| marginal perf improvement)
| jeffbee wrote:
| OK but we're right now discussing the findings of an
| organization that owns (or at least once owned) acres of
| factory-overclocked servers.
| yummypaint wrote:
| I wonder about the larger feedback loops between hardware error
| checking in software and the optimizations hardware manufacturers
| are making at the fab. Presumably more robust software would
| result in buggier cores being shipped, but would this actually
| result in more net computation per dollar spent on processors?
| sehugg wrote:
| Maybe we need something like SMART but for CPU cores?
| mywittyname wrote:
| A lot of these errors are subtle, making it incredibly
| difficult to generate a test suite to find them.
| gbrown_ wrote:
| We have machine-check exceptions, the purpose of this paper is
| to draw attention to erroneous behavior that is silent.
| smallpipe wrote:
| Modern embedded cores have self-testing code that detects
| anywhere from 50% to 90%[1] of faults in the hardware, including
| from ageing.
|
| If google and the other hyperscalers complain enough, there's no
| reason Intel couldn't give them some self test to run every hour
| or so.
|
| [1] depends on how complex the CPU is, how long you accept to run
| the self testing code, and how well it was done.
| gmueckl wrote:
| These periodic self tests are required for some safety critical
| applications. My reaction to this paper is that this approach
| might have to be used in data centers as well. Unfortunately,
| it can't be done without help from the CPU designers because
| the test sensitivity relies on knowledge of the exact
| implementation details of the underlying hardware units.
| yummypaint wrote:
| Fault tolerance seems to be the fundamental issue looming in the
| background of both traditional and quantum computing at the
| moment. Silicon is already at the point where there are only a
| dozen or so dopant atoms per gate, so a fluctuation of one or two
| atoms can be enough to impact behavior. It's amazing to me that
| with billions of transistors things work as well as they do. At
| some point it might be good to try to re-approach computation
| from some kind of error-prone analogue of the turing machine.
| minikites wrote:
| Is silicon cheap enough that we can do what we do for
| spacecraft and have multiple processors (where the majority
| "wins") for fault tolerance?
| icegreentea2 wrote:
| I think it honestly all depends on what the dominant causal
| factors and how this scales with node size. Effectively, if
| unreliability increases at the same rate or faster than the
| performance increase as node size decreases, and 'high
| reliability' compute can be easily and generally segregated
| from other compute, then it would probably be easier just to
| not decrease node size rather than parallelize at the
| chip/core level. Certainly, the software cost would be much
| easier.
| colechristensen wrote:
| You could probably do something like using idle cores (or
| idle hyperthreads) to duplicate instructions on an
| opportunistic basis to verify outcomes on a less than
| complete basis. There would be thermal and power consequences
| but some situations care about that less than others.
| haneefmubarak wrote:
| Unless the idle hyperthreads are on different cores, you'd
| most likely have the same execution results. Using idle
| cores could be interesting, but your thermal and power
| budget would be shared so your overall performance would
| still decrease.
|
| This is probably difficult to do at a fine grained level,
| but I imagine that coarser synchronization and checks (both
| in software) could provide the necessary assurances that
| code executing on a single core is consistent with that of
| other cores.
| maweki wrote:
| Cheap is probably not the question. The synchronization and
| communication overhead is enormous.
|
| Would you check that every register assignment matches? Or
| every page write?
| mirker wrote:
| You can have a log of N register "transactions", where N is
| large enough to hide core-to-core communication. If any of
| N transactions fail due to mismatch between cores, you roll
| back and throw exceptions.
|
| A lot of this logic is already in out of order execution
| (e.g., Tomasulo algorithm). Memory has ECC and is probably
| a different problem.
| mywittyname wrote:
| If data integrity issues become a problem, it might be
| cheaper to mark certain cores as fault-tolerant and provide
| the OS with mechanisms to denote certain threads as requiring
| fault-tolerance.
| yummypaint wrote:
| This is a good idea. In a sense this is somewhat available
| now when choosing whether to run certain things on gpu vs
| cpu. My understanding is gpus tend to play faster and
| looser since people don't tend to notice a few occasional
| weird pixels in a single frame. What if it could be made
| finer grained by the instruction?
| toast0 wrote:
| I think there would be a pretty large hardware cost to ensure
| the input signals come to both processors at the same clock
| everytime on the many high speed interfaces a modern CPU is
| using.
|
| And you'd need to eliminate use of non-deterministic CPU
| local data, like RDRAND and on die temperature, power, etc
| sensors. Most likely, you'd want to run the CPUs at fixed
| clock speed to avoid any differences in settling time when
| switching speeds.
|
| This could probably effectively fine broken CPUs (although
| you wouldn't know which of the pair was broken), but you
| could still have other broken devices resulting in bad
| computations. It might be better to run calculations on two
| separate nodes and compare; possibly only for important
| calculations.
| mirker wrote:
| It's not necessary to serialize the full execution to
| detect errors. On an out of order processor, there is
| already buffering of results happening that is eventually
| serialized to visible state. To check errors, you could
| just have one buffer per processor and compare results
| before serialization, raising an error on mismatch between
| processors. Serialization is merely indicating that both
| _visible_ executions have agreed up to that instruction but
| it still allows for some local/internal disagreements. For
| example, two instructions can finish in opposite orders
| across cores and that is fine as long as they are
| serialized in order.
|
| As for settling times, those are random anyway. Processors
| are binned according to how good the settling times ended
| up being. It's unlikely to have two homogeneous chips.
| jeffbee wrote:
| The economics will never favor this approach. Customers will
| not choose to pay double to avoid the 1-in-a-million chance
| of occasionally getting a slightly wrong answer.
| mikepurvis wrote:
| Does it have to be double? I know it's not a direct
| analogue, but parity schemes like RAID 6 or ECC RAM don't
| double the cost.
|
| So the question is, how do you check these results without
| actually doing them twice? Is there a role here for
| frameworks or OS to impose sanity checks? Obviously we
| already have assertions, but something gentler than a
| panic, where it says "this is suspect, let's go back and
| confirm."
| theevilsharpie wrote:
| > Customers will not choose to pay double to avoid the
| 1-in-a-million chance of occasionally getting a slightly
| wrong answer.
|
| With today's high-speed multi-core processors, a 1-in-a-
| million chance of a computation error would mean tens to
| hundreds of thousands of errors per second.
| eric__cartman wrote:
| I can imagine most consumers that do any sort of work
| with their computer would appreciate close to 100%
| stability when they need to get work done.
|
| That's usually why no one that depends on their computers
| to work day in and day out overclocks their components.
| The marginal performance gains aren't worth the added
| unreliability and added power/heat/noise footprint.
| sokoloff wrote:
| The lack of adoption/demand for just ECC RAM by consumers
| would seem to be an argument in the opposing direction.
| (Yes, it's not widely available currently, but I think
| it's safe to safe that availability is driven by
| predictions about adoption given past market behavior.)
| temac wrote:
| Who decided there is a lack of _consumer_ demand? There
| is a lack of OEM demand for sure which is driven by the
| fact that most companies are willing to sell crap if it
| can save 1 cent per product. The average consumer does
| not even know this pb exists. Adding to that the
| artificial market seg by Intel which is absolutely stupid
| and the consummer actually _can not_ buy a consummer CPU
| that supports ECC. The situation is then locked into a
| vicious circle where all the components have ridiculous
| premiums and lower volumes.
| sokoloff wrote:
| Do you think the OEMs and Intel generally ignore what
| consumers demand (and are willing to pay for)? I don't.
| a1369209993 wrote:
| Yes. For example, Intel Management Engine.
| ncmncm wrote:
| Consumers are not the customer. System integrators are
| the customer. They are motivated to minimize the number
| of distinct manufacturing targets. Consumers have no
| choice but to take what is offered.
| einpoklum wrote:
| > Is silicon cheap enough etc.
|
| No, it is not. You can always trade off performance for
| reliability by repeating your computations several times,
| preferably with some variation in the distribution/timing of
| work to avoid the same potential hardware failure pattern.
| dataflow wrote:
| > Silicon is already at the point where there are only a dozen
| or so dopant atoms per gate, so a fluctuation of one or two
| atoms can be enough to impact behavior.
|
| How in the world do they get such a precise number of atoms to
| land on billions of transistors? It seems so hard for even one
| transistor.
| freehrtradical wrote:
| The article references Dixit et al. for an example of a root
| cause investigation of a CEE which is an interesting read:
| https://arxiv.org/pdf/2102.11245.pdf
|
| > After a few iterations, it became obvious that the computation
| of Int(1.153)=0 as an input to the math.pow function in Scala
| would always produce a result of 0 on Core 59 of the CPU.
| However, if the computation was attempted with a different input
| value set Int(1.152)=142 the result was accurate.
| sfvisser wrote:
| > A deterministic AES mis-computation, which was "selfinverting":
| encrypting and decrypting on the same core yielded the identity
| function, but decryption elsewhere yielded gibberish.
|
| Incredible
| buildbot wrote:
| This specifically stood out to me - it implies it is possible
| to hide weakened/bad crypto via manufacturing bugs
| muricula wrote:
| I'm not sure about that. I think it's just that the AES
| hardware was busted somehow and didn't actually perform the
| AES algorithm correctly. The intel AES hardware just
| deterministically performs the algorithm, so Intel can't just
| weaken the algorithm somehow, at least if you're not worrying
| about local side channels.
| twic wrote:
| Depends on the mode. AES-CTR encrypts an increasing counter to
| make a keystream, then xors that with the plaintext to make the
| ciphertext, or with the ciphertext to make the plaintext. Any
| consistent error in encryption will lead to a consistently
| wrong keystream, which will round-trip successfully.
|
| It's possible other modes have this property because of the
| structure of the cipher itself, but that's way out of my
| league.
| muricula wrote:
| The intel AES instructions work at a block level. You build
| the mode on top of the block level primitives intel gives
| you. https://en.wikipedia.org/wiki/AES_instruction_set
| formerly_proven wrote:
| Sounds like a corrupted S-box or something like that in the
| hardware implementing AES-NI
___________________________________________________________________
(page generated 2021-06-03 23:01 UTC)