[HN Gopher] July 2024 Update on Instability Reports on Intel Cor...
___________________________________________________________________
July 2024 Update on Instability Reports on Intel Core 13th/14th Gen
Desktop CPUs
Author : acrispino
Score : 307 points
Date : 2024-07-22 20:55 UTC (1 days ago)
(HTM) web link (community.intel.com)
(TXT) w3m dump (community.intel.com)
| firebaze wrote:
| Nice that Intel acknowledges there are problems with that CPU
| generation. If I read this right, the CPUs have been supplied
| with a too-high voltage across the board, with some tolerating
| the higher voltages for longer, others not so much.
|
| Curious to see how this develops in terms of fixing defective
| silicon.
| tux3 wrote:
| Remains to be seen how the microcode patch affects performance,
| and how these CPUs that have been affected by over-voltage to the
| point of instability will have aged in 6 months, or a few years
| from now.
|
| More voltage generally improves stability, because there is more
| slack to close timing. Instability with high voltage suggests
| dangerous levels. A software patch can lower the voltage from
| this point on, but it can't take back any accumulated fatigue.
| giantg2 wrote:
| I was recently looking at building and buying a couple systems.
| I've always liked Intel. I went AMD this time.
|
| It seemed like the base frequencies vs boost frequencies were
| much farther apart on Intel than with most of the AMDs. This
| was especially true on the laptops were cooling is a larger
| concern. So I suspect they were pushing limits.
|
| Also, the performance core vs efficiency core stuff seemed kind
| of gimmicky with so few performance cores and so many
| efficiency cores. Like look at this 20 core processor! Oh wait,
| it's really an 8 core when it comes to performance. Hard to
| compare that to a 12 core 3D cached Ryzen with even higher
| clock...
|
| I will say, it seems intel might still have some advantages. It
| seems AMD had an issue supporting ECC with the current
| chipsets. I almost went Intel because of it. I ended up
| deciding that DDR5 built in error correction was enough for me.
| The performance graphs also seem to indicate a smoother
| throughput suggesting more efficient or elegant execution (less
| blocking?). But on the average the AMDs seem to be putting out
| similar end results even if the graph is a bit more "spikey".
| nullindividual wrote:
| > It seems AMD had an issue supporting ECC with the current
| chipsets.
|
| AMD has the advantage with regards to ECC. Intel doesn't
| support ECC at all on consumer chips, you need to go Xeon.
| AMD supports it on all chips, but it is up to the motherboard
| vendor to (correctly) implement. You can get consumer-class
| AM4/5 boards that have ECC support.
| jwond wrote:
| Actually some of the 13th and 14th gen Intel Core
| processors support ECC.
| mackal wrote:
| Intel has always had randomly supported ECC on desktop
| CPUs. Sometimes it was just a few low end SKUs, sometimes
| higher end SKUs. 14th gen it appears i9s and i7s do,
| didn't check i5s, but i3s did not.
| ThatMedicIsASpy wrote:
| You need W680 boards (starting at around 500 bucks) for ECC
| on desktop intel chips.
| giantg2 wrote:
| I was seeing them around $400 (still expensive).
| giantg2 wrote:
| My understanding is that it's screwed up for multiple
| vendors and chipsets. The boards might say they support it,
| but there are some updates saying it's not. It seemed
| extremely hard to find any that actually supported it. It
| was actually easier to find new Intel boards supporting
| ECC.
| paulmd wrote:
| yeah wendell put out a video a few weeks ago exploring a
| bunch of problems with asrock rack-branded server-market
| B650 motherboards and basically the ECC situation was
| exactly what everyone warns about: the various BIOS
| versions wandered between "works, but doesn't forward the
| errors", "doesn't work, and doesn't forward the errors",
| and (excitingly) "doesn't work and doesn't even post". We
| are a year and a half after zen4 launched and there
| barely are any server-branded boards to begin with, and
| even _those boards_ don 't work right.
|
| https://youtu.be/RdYToqy05pI?t=503
|
| I don't know how many times it has to be said but
| "doesn't explicitly disable" is not the same thing as
| "support". There are lots of other enablement steps that
| are required to get ECC to work properly, and they really
| need to be explicitly tested with each release (which if
| it is "not explicitly disabled", it's not getting
| tested). Support means you can complain to someone when
| it doesn't work right.
|
| AMD churns AGESA really, really hard and it breaks all
| the time. Partners have to try and chase the upstream and
| sometimes it works and sometimes it doesn't. Elmor
| (Asus's Bios Guy) talked about this on Overclock.net back
| around 2017-2018 when AMD was launching X399 and talked
| about some of the troubles there and with AM4.
|
| That said, the current situation has seemingly lit a fire
| under the board partners, with Intel out of commission
| and all these customers _desperate_ for an alternative to
| their W680 /raptor lake systems (which do support ecc
| _officially_ , btw) in these performance-sensitive niches
| or power-limited datacenter layouts, they are finally
| cleaning up the mess like, within the last 3 weeks or so.
| They've very quickly gone from not caring about these
| boards to seeing a big market opportunity.
|
| https://www.youtube.com/watch?v=n1tXJ8HZcj4
|
| can't believe how many times I've explained in the last
| month that yes, people do actually run 13700Ks in the
| datacenter... with ECC... and actually it's probably some
| pretty big names in fact. A previous video dropped the
| tidbit that one of the major affected customers is
| Citadel Capital - and yeah, those are the guys who used
| to get special EVEREST and BLACK OPS skus from intel for
| the same thing. Client platform is better at that, the
| very best sapphire rapids or epyc -F or -X3D sku is going
| to be like 75% of the performance at best. It's also the
| fastest thing available for serving NVMe flash storage
| (and Intel specifically targeted this, the Xeon E-2400
| series with the C266 chipset can talk NVMe SAS _natively_
| on its chipset with up to 4 slimsas ports...)
|
| it's somewhere in this one I think:
| https://www.youtube.com/watch?v=5KHCLBqRrnY
| justinclift wrote:
| The new EPYC processors for AM5 though look like they'll
| be ok for ECC ram though, at least in the coming months
| onwards.
| paulmd wrote:
| Yeah I think that's the bright spot, now that there's a
| branded offering for server-flavored Ryzen now maybe
| there is a permanent justification for doing proper
| validation.
|
| I just feel vindicated lol, it always comes up that "well
| works fine for me!" and the reality is it's a _total_
| crapshoot with even server-branded boards often not
| working. There is zero chance your gigabyte UD3 or
| whatever is going to be consistently supported across
| bios and often it will not be.
|
| And AMD is really really tied to AGESA releases, so it's
| fairly important on that side. Although I guess maybe
| we're seeing now what happens if you let too much be
| abstracted away... but on the other hand partners were
| blowing up AMD chips last year too.
|
| If you're comfortable always testing, and always having
| the possibility of there being some big AGESA problem and
| ecc being broken on the new versions... ok I guess.
|
| There is a reason the i3 chips were perennial favorites
| for edge servers and NASs. And I think it's really,
| really hard to overstate the long-term damage from
| reputation loss here. Intel, meltdown aside, was always
| no-drama in terms of reliability. Other than C2000/C3000,
| I guess.
| mananaysiempre wrote:
| > AMD supports [ECC RAM] on all chips
|
| There was a strange happening with AMD laptop CPUs
| ("APUs"): the non-soldered DDR5 variants of the 7x40's were
| advertised to support ECC RAM on AMD's website up until a
| couple months before any actual laptops were sold, then
| that was silently changed and ECC is only on the PRO models
| now. I still don't know if this is a straightforward
| manufacturing or chipset issue of some kind or a sign of
| market segmentation to come.
|
| (I'm quite salty I couldn't get my Framework 13 with ECC
| RAM because of this.)
| ploek wrote:
| > AMD supports it on all chips
|
| Unfortunately not. I can't say for current gen, but the
| 5000 series APUs like the 5600G do _not_ support ECC. I
| know, I tried...
|
| But yes, most Ryzen CPUs do have ECC functionality, and
| have had it since the 1000 series, even if not officially
| supported. Official support for ECC is only on Ryzen PRO
| parts.
| smolder wrote:
| ECC support wasn't good initially on AM5, but there are now
| Epyc branded chips for the AM5 socket which officially
| support ECC DDR5. They come in the same flavors as the
| Ryzen 7xx0 chips, but are branded as Epyc.
| fomine3 wrote:
| More E-core is reasonable for multi threaded application
| performance. It's efficient for power and die area as the
| name indicates, so they can implement more E-cores than
| P-cores for the same power/area budget. It's not suitable for
| who need many single threaded performance cores like VM
| server, but I don't know is there any major consumer usage
| requires such performance.
| giantg2 wrote:
| I can sort of see that. The way I saw it explained as them
| being much lower clock and having a pretty small shared
| cache. I could see E cores as being great for running
| background processes and stuff. All the benchmarks seem to
| show the AMDs with 2/3rd the cores being around the same
| performance and with similar power draw. I'm not putting
| them down. I'm just saying it seems gimmicky to say "look
| at our 20 core!" with the implicit idea that people will
| compare that with an AMD 12 core seeing 20>12, but not
| seeing the other factors like cost and benchmarks.
| philistine wrote:
| It's the megahertz wars all over again!
| cyanydeez wrote:
| Computers have taught is the rubric of truth: Numbers go
| Up.
| Sohcahtoa82 wrote:
| > but I don't know is there any major consumer usage
| requires such performance.
|
| Gaming.
|
| There are some games that will benefit from greater single-
| core performance.
| akira2501 wrote:
| > so few performance cores and so many efficiency cores
|
| I was baffled by this too but what they don't make clear is
| the performance cores have hyperthreading the efficiency
| cores do not.
|
| So what they call 2P+4E actually becomes an 8 core system as
| far as something like /proc/cpuinfo is concerned. They're
| also the same architecture so code compiled for a particular
| architecture will run on either core set and can be moved
| from one to the other as the scheduler dictates.
| Dylan16807 wrote:
| > They're also the same architecture so code compiled for a
| particular architecture will run on either core set and can
| be moved from one to the other as the scheduler dictates.
|
| I don't know if that has done more good than harm, since
| they ripped AVX-512 out for multiple generations to ensure
| parity.
| Dylan16807 wrote:
| > Like look at this 20 core processor! Oh wait, it's really
| an 8 core when it comes to performance.
|
| The E cores are about half as fast as the P cores depending
| on use case, at about 30% of the size. If you have a program
| that can use more than 8 cores, then that 8P+12E CPU should
| approach a 14P CPU in speed. (And if it can't use more than 8
| cores then P versus E doesn't matter.) (Or if you meant
| 4P+16E then I don't think those exist.)
|
| > Hard to compare that to a 12 core 3D cached Ryzen with even
| higher clock...
|
| Only half of those cores properly get the advantage of the 3D
| cache. And I doubt those cores have a higher clock.
|
| AMD's doing quite well but I think you're exaggerating a good
| bit.
| ComputerGuru wrote:
| > If you have a program that can use more than 8 cores,
| then that 8P+12E CPU should approach a 14P CPU in speed
|
| Only if you use work stealing queues or (this is
| ridiculously unlikely) run multithreaded algorithms that
| are aware of the different performance and split the work
| unevenly to compensate.
| Dylan16807 wrote:
| Or if you use a single queue... which I would expect to
| be the default.
|
| Blindly dividing work units across cores sounds like a
| terrible strategy for a general program that's sharing
| those cores with who-knows-what.
| ComputerGuru wrote:
| It's a common strategy for small tasks where the overhead
| of dispatching the task greatly exceeds the computation
| of it. It's also a better way to maximize L1/L2 cache hit
| rates by improving memory locality.
|
| Eg you have 100M rows and you want to cluster them by a
| distance function (naively), running dist(arr[i], arr[j])
| is crazy fast, the problem is just that you have so many
| of them. It is faster to run it on one core than dispatch
| it from one queue to multiple cores, but best to assign
| the work ahead of time to n cores and have them crunch
| the numbers.
| Dylan16807 wrote:
| It has always been a bad idea to dispatch so naively
| _and_ dispatch to the same number of threads as you have
| cores. What if a couple cores are busy, and you spend
| almost twice as much time as you need waiting for the
| calculation to finish? I don 't know how much software
| does that, and most of it can be easily fixed to dispatch
| half a million rows at a time and get better performance
| on all computers.
|
| Also on current CPUs it'll be affected by hyperthreading
| and launch 28 threads, which would probably work out
| pretty well overall.
| chmod775 wrote:
| > What if a couple cores are busy
|
| If you don't pin them to cores, the OS is still free to
| assign threads to cores as it pleases. Assuming the
| scheduler is somewhat fair, threads will progress at
| roughly the same rate.
| Sohcahtoa82 wrote:
| > run multithreaded algorithms that are aware of the
| different performance and split the work unevenly to
| compensate.
|
| This is what the Intel Thread Director [0] solves.
|
| For high-intensity workloads, it will prioritize
| assigning them to P-cores.
|
| [0] https://www.intel.com/content/www/us/en/support/artic
| les/000...
| ComputerGuru wrote:
| Then you no longer have 14 cores in this example, but
| only len(P) cores. Also most code written in the wild
| isn't going to use an architecture-specific library for
| this.
| dwattttt wrote:
| The P cores being presented as two logical cores and E
| cores presented as a single logical core results in this
| kind of split already.
| giantg2 wrote:
| Yeah, the 20 core Intels are benchmarking about the same as
| the 12 core AMD X3Ds. But many people just see 20>12.
| Either one is more than fine for most people.
|
| "Oh wait, it's really an 8 core when it comes to
| performance [cores]". So yes, should not be an 8 core all
| together, but like you said about 14 cores, or 12 with the
| 3D cache.
|
| "And I doubt those cores have a higher clock."
|
| I'm not sure what we're comparing them to. They should be
| capable of higher clock than the E cores. I thought all the
| AMD cores had the ability to hit the max frequency (but not
| necessarily at the same time). And some of the cores might
| not be able to take advantage of the 3D cache, but that
| doesn't limit their frequency, from my understanding.
| Dylan16807 wrote:
| > I'm not sure what we're comparing them to. They should
| be capable of higher clock than the E cores.
|
| Oh, just higher clocked than the E cores. Yeah that's
| true, but if you're using that many cores at once you
| probably only care about total speed.
|
| You said 12 core with higher clock versus 8, so I thought
| you were comparing to the performance cores.
|
| > I thought all the AMD cores had the ability to hit the
| max frequency (but not necessarily at the same time).
|
| The cores under the 3D cache have a notable clock penalty
| on existing CPUs.
|
| > And some of the cores might not be able to take
| advantage of the 3D cache, but that doesn't limit their
| frequency, from my understanding.
|
| Right, but my point is it's misleading to call out higher
| core count _and_ the advantages of 3D stacking. The 3D
| stacking mostly benefits the cores it 's on top of, which
| is 6-8 of them on existing CPUs.
| giantg2 wrote:
| "The cores under the 3D cache have a notable clock
| penalty on existing CPUs."
|
| Interesting. I can't find any info on that. It seems that
| makes sense though since the 7900X is 50 TDP higher than
| the 7900X3D.
|
| "Right, but my point is it's misleading to call out
| higher core count and the advantages of 3D stacking"
|
| Yeah, that makes sense. I didn't realize there was a
| clock penalty on some of the cores with the 3D cache and
| that only some cores could use it.
| LtdJorge wrote:
| It's due to the stacked cache being harder to cool and
| not supporting as high of a voltage. So the 3D CCD clocks
| lower, but for some workloads it's still faster (mainly
| ones dealing with large buffers, like games, most compute
| heavy benchmarks fit in normal caches and the non 3D
| V-Cache variants take the win).
| hellotomyrars wrote:
| It's kind of funny and reminiscent of the AMD bulldozer
| days where they had a ton of cores compared to the
| contemporary Intel chips, especially at low/mid price
| points but the AMD chips were laughably underwhelming for
| single core performance which was even more important
| then.
|
| I can't speak to the Intel chips because I've been out of
| the Intel game for a long time but my 5700X3D does seem
| to happily run all cores at max clock speed.
| jiggawatts wrote:
| A major differentiator is that Intel CPUs with E cores don't
| allow the use of AVX-512, but all current AMD CPUs do. The
| new Zen 5 chips will run circles around Intel for any such
| workload. Video encoding, 3D rendering, and AI come to mind.
| For developers: many database engines can use AVX-512
| automatically.
| paulmd wrote:
| > Remains to be seen how the microcode patch affects
| performance
|
| intel is claiming 4% performance hit on the final patch
| https://youtu.be/wkrOYfmXhIc?t=308
| oasisbob wrote:
| Maybe a stretch - but this reminds me of blood sugar regulation
| for people with type 1 diabetes.
|
| Too low is dangerous because you lose rational thought, and the
| ability to maintain consciousness or self-recover. However,
| despite not having the immediate dangers of being low, having
| high blood sugar over time is the condition which causes long-
| term organ damage.
| loufe wrote:
| Intel cannot afford to be anything but outstanding in terms of
| customer experience right now. They are getting assaulted on all
| fronts and need to do a lot to improve their image to stay
| competitive.
| Joel_Mckay wrote:
| Their acquisition of Altera seemed to harm both companies
| irreparably.
|
| Any company can reach a state where the Process people take
| over, and the Product people end up at other firms.
|
| Intel could have grown a pair, and spun the 32 core RISC-V DSP
| SoC + gpu for mobile... but there is little business incentive
| to do so.
|
| Like any rotting whale, they will be stinking up the place for
| a long time yet. =)
| beacon294 wrote:
| Could you elaborate on the process people versus product
| people?
| basementcat wrote:
| I would argue the fabrication process people at Intel are
| core to their business. Without the ability to reliably
| manufacture chips, they're dead in the water.
| Joel_Mckay wrote:
| You mean manufacturing "working chips" is supposed to be
| their business.
|
| It is just performance art with proofing wafers unless
| the designs work =3
| bgmeister wrote:
| I assume they're referring to Steve Jobs' comments in this
| (Robert Cringely IIRC) interview:
| https://www.youtube.com/watch?v=l4dCJJFuMsE (not a great
| copy, but should be good enough)
| beacon294 wrote:
| Oh yeah, this got rehashed as builders versus talkers
| too. Yeah, there's a lot of this creative vibe type
| dividing. It's pretty complicated, I don't even think
| individual people operate the same when placed in a
| different context. Usually their output is a result of
| their incentives, so typically management failure or
| technical architect failure.
| Joel_Mckay wrote:
| Partly true, Steve Jobs had a charismatic tone when
| describing these problems in public.
|
| Have a great day, =3
| Joel_Mckay wrote:
| It is an old theory that accurately points out
| Marketing/Sales division people inevitably out-compete
| product innovation people in a successful firm.
|
| https://en.wikipedia.org/wiki/Competitive_exclusion_princip
| l...
|
| And yes, the Steve Jobs interview does document how this
| almost destroyed Apples core business. =)
| scrlk wrote:
| Intel should take a page out of HP's book when it came to
| dealing with a bug in the HP-35 (first pocket scientific
| calculator):
|
| > The HP-35 had numerical algorithms that exceeded the
| precision of most mainframe computers at the time. During
| development, Dave Cochran, who was in charge of the algorithms,
| tried to use a Burroughs B5500 to validate the results of the
| HP-35 but instead found too little precision in the former to
| continue. IBM mainframes also didn't measure up. This forced
| time-consuming manual comparisons of results to mathematical
| tables. A few bugs got through this process. For example: 2.02
| ln ex resulted in 2 rather than 2.02. When the bug was
| discovered, HP had already sold 25,000 units which was a huge
| volume for the company. In a meeting, Dave Packard asked what
| they were going to do about the units already in the field and
| someone in the crowd said "Don't tell?" At this Packard's
| pencil snapped and he said: "Who said that? We're going to tell
| everyone and offer them, a replacement. It would be better to
| never make a dime of profit than to have a product out there
| with a problem". It turns out that less than a quarter of the
| units were returned. Most people preferred to keep their buggy
| calculator and the notice from HP offering the replacement.
|
| https://www.hpmuseum.org/hp35.htm
| basementcat wrote:
| I wonder if Mr. Packard's answer would have been different if
| a recall would have bankrupted the company or necessitated
| layoff of a substantial percentage of staff.
| scrlk wrote:
| I can't speak for Dave Packard (or Bill Hewlett) - but I
| will try to step in to their shoes:
|
| 1) HP started off in test and measurement equipment
| (voltmeters, oscilloscopes etc.) and built a good
| reputation up. This was their primary business at the time.
|
| 2) The customer base of the HP-35 and test and measurement
| equipment would have a pretty good overlap.
|
| Suppose the bug had been covered up, found, and then the
| news about the cover up came to light? Would anyone trust
| HP test and measurement equipment after that? It would
| probably destroy the company.
| rasz wrote:
| Or potential of killing couple hundred passengers, or few
| astronauts. Oh, wait...
| wnevets wrote:
| Are the CPUs that received elevated operating voltage permanently
| damaged?
| Pet_Ant wrote:
| This is the most pressing question. If it was just a microcode
| issue a cooloff and power cycle ought to at least reset things
| but according to Wendel from Level 1 Tech, that doesn't seem to
| always be the case.
| kevingadd wrote:
| The problem is that running at too high of a voltage for
| sustained periods can cause physical degradation of the chip
| in some cases. Hopefully not here!
| chmod775 wrote:
| > can cause physical degradation of the chip in some cases.
|
| Not in some cases. Chips always physically degrade
| regardless of voltage. Higher voltages will make it happen
| faster.
| Pet_Ant wrote:
| Why do chis degrade? Is this due to the whiskers I've
| heard about?
| cesarb wrote:
| > Why do chis degrade? Is this due to the whiskers I've
| heard about?
|
| No, tin whiskers are a separate issue, which happens
| mostly outside the chips. The keyword you're looking for
| is electromigration
| (https://en.wikipedia.org/wiki/Electromigration).
| layer8 wrote:
| Not instantly it seems, but there have been reports of
| degradation over time. It will be a case-by-case thing.
| userbinator wrote:
| Possible electromigration damage, yes.
| NBJack wrote:
| I was concerned this would happen to them, given how much power
| was being pushed through their chips to keep them competitive. I
| get the impression their innovation has either truly slowed down,
| or AMD thought enough 'moves' ahead with their
| tech/marketing/patents to paint them into a corner.
|
| I don't think Intel is done though, at least not yet.
| magicalhippo wrote:
| There was recently[1] some talk about how the 13th/14th gen
| mobile chips also had similar issues, though Intel insisted it's
| something else.
|
| Will be interesting to see how that pans out.
|
| [1]: https://news.ycombinator.com/item?id=41026123
| tardy_one wrote:
| For server CPUs there's not a similar problem or they realize
| server purchasers may be less willing to tolerate it? I'm not
| all that thrilled with the prospect of buying Intels especially
| when wondering about waiting to 5 year out replacement compared
| to a few generations ago, but AMD server choices can be a bit
| limited and I'm not really sure how to evaluate if there may be
| increasing surprises more across the board.
| sirn wrote:
| Are you talking about Xeon Scalable? Although they share the
| same core design as the desktop counterpart (Xeon Scalable
| 4th Gen shares the same Golden Cove as 12th Gen, Xeon
| Scalable 5th Gen shares the same Raptor Cove as 13th/14th
| Gen), they're very different from the desktop counterpart
| (monolithic vs tile/EMIB-based, ring bus vs mesh, power gate
| vs FIVR), and often running in a more conservative
| configuration (lower max clock, more conservative V/F curves,
| etc.). There has been a rumor about Xeon Scalable 5th Gen
| having the same issue, but it's more of a gossip rather than
| a data point.
|
| The issue does happen with desktop chips that are being used
| in a server context when pairing with workstation chipset
| such as W680. However, there haven't been any reports of Xeon
| E-2400/E-3400 (which is essentially a desktop chip repurposed
| as a server) with C266 having these issues, though it may be
| because there hasn't been a large deployment of these chips
| on the server just yet (or even if there are, it's still too
| early to tell).
|
| Do note that even without this particular issue, Xeon
| Scalable 4th Gen (Sapphire Rapids) is not a good chip
| (speaking from experience, I'm running w-3495x). It has
| plenty of issues such as slow clock ramp, high latency, high
| idle power draw, and the list goes on. While Xeon Scalable
| 5th Gen (Emerald Rapids) seems to have fixed most of these
| issues, Zen 4 EPYC is still a much better choice.
| tedunangst wrote:
| The mobile issue seems more anecdote than data? Almost as if
| people on Reddit heard the 13/14 CPUs were bad, then their
| laptop crashed, and they decided "it happened to me too".
| magicalhippo wrote:
| Well it's not just[1] redditors from what I can gather:
|
| _Now Alderon Games reports that Raptor Lake crashes impact
| Intel 's 13th and 14th-Gen processors in laptops as well._
|
| _" Yes we have several laptops that have failed with the
| same crashes. It's just slightly more rare then the desktop
| CPU faults," the dev posted._
|
| These are the guys who publicly claimed[2] Intel sold
| defective chips based on the desktop chips crashing.
|
| [1]: https://www.tomshardware.com/pc-components/cpus/dev-
| reports-...
|
| [2]: https://www.tomshardware.com/pc-components/cpus/game-
| publish...
| sirn wrote:
| The problem may exist, but Alderon Games' report on the
| mobile chip is more of an anecdote here because there's not
| enough data points (unlike their desktop claims), and the
| only SKU they give (13900HX) is actually a desktop chip in
| a mobile package (BGA instead of LGA, so we're back into
| the original issue). So in the end, even with Alderon's
| claims, there's really not enough data points to come to a
| conclusion on the mobile side of things.
| Sakos wrote:
| Why are you downplaying it too?
|
| > "The laptops crash in the exact same way as the desktop
| parts including workloads under Unreal Engine,
| decompression, ycruncher or similar. Laptop chips we have
| seen failing include but not limited to 13900HX etc.,"
| Cassells said.
|
| > "Intel seems to be down playing the issues here most
| likely due to the expensive costs related to BGA rework
| and possible harm to OEMs and Partners," he continued.
| "We have seen these crashes on Razer, MSI, Asus Laptops
| and similar used by developers in our studio to work on
| the game. The crash reporting data for my game shows a
| huge amount of laptops that could be having issues."
|
| https://old.reddit.com/r/hardware/comments/1e13ipy/intel_
| is_...
| sirn wrote:
| I'm not denying that the problem exists, but I don't
| think Alderon provided enough data to come to a
| conclusion, unlike on the desktop, where it's supported
| by other parties in addition to Alderon's data (where you
| can largely point to 14900KS/K/non-K/T,
| 13900KS/K/non-K/T, 14700K, and 13700K being the one
| affected)
|
| Right now, the only example given is HX (which is a
| repackaged desktop chip[^], as mentioned), so I'm not
| denying that the problem is happening on HX based on
| their claims (and it makes a lot of sense that HX is
| affected! See below), but what about H CPUs? What about P
| CPUs? What about U CPUs? The difference in impact between
| "only HX is impacted" and "HX/H/P/U parts are all
| affected" is a few orders of magnitude (a very top-end
| 13th Gen mobile SKUs versus every 13th Gen mobile SKUs).
| Currently, we don't have enough data how widespread the
| issue is, and that makes it difficult to assess who is
| impacted by this issue from this data alone.
|
| [^]: HX is the only mobile CPU with B0 stepping, which is
| the same as desktop 13th/14th Gen, while the mobile H/P/U
| family are J0 and Q0, which are essentially a higher
| clocked 12th Gen (i.e., using Golden Cove rather than
| Raptor Cove)
| paulmd wrote:
| Alderon are the people claiming 100% of units fail which
| doesn't seem supported by anyone else either. Wendell and
| GN seem to have scoped the issue to around 10-25% across
| multiple different sources.
|
| Like they are the most extreme claimants at this point.
| _Are_ they really credible?
| paulmd wrote:
| Ok, I take it back, this looks pretty indicative of a
| low-load problem and evidently failure rates are _much_
| higher in that scenario.
|
| https://www.youtube.com/watch?v=yYfBxmBfq7k
| christkv wrote:
| The amount of current their chips pull on full boost is pretty
| crazy. It would definitively not surprise me if some could get
| damaged by extensive boosting.
| brynet wrote:
| Curious why Intel announced this on their community forums,
| rather than somewhere more official.
| samtheprogram wrote:
| Optics / stock price
| guywithahat wrote:
| That's probably where people are mostly likely to understand
| it. A lot of companies do this, especially while they're still
| learning things.
| wmf wrote:
| These days people are more likely to see the announcement on
| YouTube, TikTok, or Twitter.
| slaymaker1907 wrote:
| The first two require a lot more effort in video editing
| than creating a forum post. Plus, it's just going to be
| digested and regurgitated for the masses by people much
| better at communicating technical information.
| cyanydeez wrote:
| Sobweird hearing high noise channels as the prefwrred
| distributiom
| paulmd wrote:
| they did that too
|
| https://youtu.be/wkrOYfmXhIc
| beart wrote:
| Based on what I know about corporations, it's entirely
| plausible that the folks posting the information don't actually
| have access to the communication channels you are referring to.
| I don't even know how I would issue an official communication
| at my own company if the need ever came up... so you go with
| what you have.
| langsoul-com wrote:
| Note how they mentioned its still going to be tested with
| various partners before released.
|
| Ie we think this might solve it, but if it doesn't we can roll
| back with the least amount of PR attention.
| Covzire wrote:
| Just want to say, I'm incredibly happy with my 7800X3D. It runs
| ~70C max like Intel chips used to and with a $35 air cooler and
| it's on average the fastest chip for gaming workloads right now.
| amiga-workbench wrote:
| I'm also very happy with my 5800X3D, it was wonderful value
| back when AM5 had just released and DDR5/Motherboards still
| cost an arm and a leg.
|
| The energy efficiency is much appreciated in the UK with our
| absurd price of electricity.
| SushiHippie wrote:
| Same, in my BIOS I can activate a "ECO Mode", which lets me
| decide if I want to run my 7950x on full 170W TDP, 105W TDP
| or 60W TDP.
|
| I benchmarked it, the difference between 170 and 105 is
| basically zero, and the difference to 60W is just a few
| percent of a performance hit, but way worth it, as it's
| ~0.3EUR/kWh over here.
| aruametello wrote:
| (if you are running windows)
|
| you might want to check a tool called PBO2Tunner
| (https://www.cybermania.ws/apps/pbo2-tuner/), you can tweak
| values like EDC,TDC and PPT (power limit) from the GUI, and
| it also accepts command line commands so you can automate
| those tasks.
|
| I made scripts that "cap" the power consumption of the cpu
| based on what applications are running. (i.e. only going
| all in on certain games, dynamically swaping between
| 65-90-120-180w handmade profiles)
|
| i made with power saving in mind given the idle power
| consumption is rather high on modern ryzens.
|
| edit: actually made a mistake given that PBO2Tunner is for
| Zen3 cpus, and you mentioned Zen4.
| fefe23 wrote:
| So on one hand they are saying it's voltage (i.e. something
| external, not their fault, bad mainboard manufacturers!).
|
| On the other hand they are saying they will fix it in microcode.
| How is that even possible?
|
| Are they saying that their CPUs are signaling the mainboards to
| give them too much voltage?
|
| Can someone make sense of this? It reminds me of Steve Jobs' You
| Are Holding It Wrong moment.
| cqqxo4zV46cp wrote:
| The "you're holding it wrong!"angle is all your take. They
| don't make that claim.
| k12sosse wrote:
| "OK, great, let's give everybody a case" lives on
| ls612 wrote:
| The claim seems to be that the microcode on the CPU is in
| certain circumstances requesting the wrong (presumably too
| high) voltage from the motherboard. If that is the case fixing
| the microcode will solve the issue going forward but won't help
| people whose chips have already been damaged by excessive
| voltage.
| pitaj wrote:
| > Are they saying that their CPUs are signaling the mainboards
| to give them too much voltage?
|
| Yes that's exactly what they said.
| dboreham wrote:
| So it's a 737 MAX problem: the software is running a control
| loop that doesn't have deflection limits. So it tells the
| stabilizer (or voltage reg in this case) to go hard nose
| down.
| nahnahno wrote:
| lol what a stretch of an analogy
| wtallis wrote:
| The voltage supplied by the motherboard isn't supposed to be
| constant. The CPU is continuously varying the voltage it's
| requesting, based primarily on the highest frequency any of the
| CPU cores are trying to run at. The motherboard is supposed to
| know what the resistive losses are from the VRMs to the CPU
| socket, so that it can deliver the requested voltage at the CPU
| socket itself. There's room for either party to screw up: the
| CPU could ask for too much voltage in some scenarios, or the
| motherboard's voltage regulation could be poorly calibrated (or
| deliberately skewed by overclocking presets).
|
| On top of all this mess: these products were part of Intel's
| repeated attempts to move the primary voltage rail (the one
| feeding the CPU cores) to use on-die voltage regulators (DLVR).
| They're present in silicon but unused. So it's not entirely
| surprising if the fallback plan of relying solely on external
| voltage regulation wasn't validated thoroughly enough.
| basementcat wrote:
| My guess is something like the following:
|
| Modern CPU's are incredibly complex machines with a
| ridiculously large amount of possible configuration states (too
| large to exhaustively test after manufacture or sim during
| design), e.g. a vector multiply in flight with an AES encode in
| flight with x87 sincos, etc. Each operation is going to draw a
| certain amount of current. It is impractical to guarantee each
| functional unit with the required current but the supply rails
| are sized for a "reasonable worst case".
|
| Perhaps an underestimate was mistakenly made somewhere and not
| caught until recently. Therefore the fix might be to modify the
| instruction dispatcher (via microcode) to guarantee that
| certain instruction configurations cannot happen (e.g. let the
| x87 sincos stall until the vector multiply is done) to reduce
| pressure on the voltage regulator.
| hedgehog wrote:
| It's worse than that, thermal management is part of the
| puzzle. Think of that as heat generation happening across
| three dimensions (X + Y + time) along with diffusion in 3D
| through the package.
| CoastalCoder wrote:
| It's an interesting idea, but there's a caveat: time flows
| in just one direction.
| aseipp wrote:
| Saying "elevated voltage causes damage" is not attributing
| blame to anyone. In the very next sentence, they then attribute
| the reason for that elevated voltage to their own microcode,
| and so it is responsible for the damage. I literally do not
| know how they could be any clearer on that.
| TazeTSchnitzel wrote:
| After watching https://youtube.com/watch?v=gTeubeCIwRw and some
| related content, I personally don't believe it's an issue fixable
| with microcode. I guess we'll see.
| jpk wrote:
| Because HN doesn't provide link previews, I'd recommend adding
| some information about the content to your comment. Otherwise
| we have to click through to YouTube for the comment to make any
| sense.
|
| That said, the video is the GamersNexus one where they talk
| about an unverified claim that this is a fabrication process
| issue caused by oxidation between atomic deposition layers. If
| that's the case, then yeah, microcode can only do so much. But
| like Steve says in the video, the oxidation theory has yet to
| be proven and they're just reporting what they have so far
| ahead of the Zen 5 reviews coming soon.
| mjevans wrote:
| Hopefully Intel ships them, and allows them to, test and
| publish benchmarks with the current pre-release microcode
| revision for review comparison.
| acrispino wrote:
| An Intel employee is posting on reddit:
| https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...
|
| A recent YouTube video by GamersNexus speculated the cause of
| instability might be a manufacturing issue. The employee's
| response follows.
|
| _Questions about manufacturing or Via Oxidation as reported by
| Tech outlets:_
|
| _Short answer: We can confirm there was a via Oxidation
| manufacturing issue (addressed back in 2023) but it is not
| related to the instability issue._
|
| _Long answer: We can confirm that the via Oxidation
| manufacturing issue affected some early Intel Core 13th Gen
| desktop processors. However, the issue was root caused and
| addressed with manufacturing improvements and screens in 2023. We
| have also looked at it from the instability reports on Intel Core
| 13th Gen desktop processors and the analysis to-date has
| determined that only a small number of instability reports can be
| connected to the manufacturing issue._
|
| _For the Instability issue, we are delivering a microcode patch
| which addresses exposure to elevated voltages which is a key
| element of the Instability issue. We are currently validating the
| microcode patch to ensure the instability issues for 13th /14th
| Gen are addressed_
| hsbauauvhabzb wrote:
| So they were producing defective CPUs, identified & addressed
| the issue but didn't issue a recall, defect notice or public
| statement relating to the issue?
|
| Good to know.
| thelastparadise wrote:
| Dude's gonna be canned so hard.
| Dylan16807 wrote:
| It sounds like their analysis is that the oxidation issue is
| comfortably below the level of "defective".
|
| No product will ever be perfect. You don't need to do a
| recall for a sufficiently rare problem.
|
| And in case anyone skims, I will be extra clear, this is
| based on the claim that the oxidation is _separate_ from the
| real problem here.
| abracadaniel wrote:
| They could recall the defective batch. All of the cpus with
| that defect will fail from it. The seem to have been
| content to hope no one noticed.
| Dylan16807 wrote:
| What makes you think there was a "defective batch"? What
| makes you think all the CPUs affected by that production
| issue will fail from it?
|
| That description sounds to me like it affected the entire
| production line for months. It's only worth a recall if a
| sufficient percent of those CPUs will fail. (I don't want
| to argue about what particular percent that should be.)
| hsbauauvhabzb wrote:
| My CPU was unstable for months, I spent tens of hours and
| hundreds on equipment to troubleshoot (I _never_ thought
| my CPU would be the cause). Had I of known this, I would
| have scrutinised the cpu a lot faster than what I did.
|
| Intel not making a public statement about potentially
| defective products could have been done with good PR spin
| 'we detected an issue, believe the defect rate will be <
| 0.25%, here's a test suite you can run, call if you think
| you're one of the .25!' But they didn't.
|
| I'm never buying an intel product again. Fuck intel.
| Dylan16807 wrote:
| This comment chain is talking about the oxidation in
| particular, and specifically the situation where the
| oxidation is _not_ the cause of the instability in the
| title. That 's the only way they "identified & addressed
| the issue but didn't issue a recall".
|
| Do you have a reason to think the oxidation is the cause
| of your problems?
|
| Did you not read my first post trying to clarify the two
| separate issues?
|
| Am I misunderstanding something?
| hsbauauvhabzb wrote:
| Oxidisation in the context of CPU fabrication sounds
| pretty bad, I find it hard to believe it would have no
| impact on CPU stability regardless of what Intels PR team
| to say while minimizing any actual impacts caused.
|
| Edit: it sounds like intel have been aware of stability
| issues for some time and have said nothing, I'm not sure
| we have any reason to trust anything they say moving
| forward, relating to oxidisation or any other claims they
| make.
| Dylan16807 wrote:
| Well they didn't notice it for a good while, so it's
| really hard to say how much impact it had.
|
| And at a certain point if you barely believe anything
| they say, then you shouldn't be using their statement to
| get mad about. The complaint you're making depends on
| very particular parts of their statement being true but
| other very particular parts being not true. I don't think
| we have the evidence to do that right now.
| hsbauauvhabzb wrote:
| > Well they didn't notice it for a good while, so it's
| really hard to say how much impact it had.
|
| That negates any arguments you had related to failure
| rates.
|
| > The complaint you're making depends on very particular
| parts of their statement being true but other very
| particular parts being not true
|
| Er, I'm not even sure how to respond to this. GamersNexus
| has indicated they know about the oxidisation issue,
| intel *subsequently* confirm it was known internally but
| no public statement was made until now. I'm not
| unreasonably cherry picking parts of their statement and
| then drawing unreasonable conclusions. Intel have very
| clearly demonstrated they would have preferred to not
| disclose an issue in fabrication processes which very
| probably caused defective CPUs, they have demonstrated
| untrustworthy behaviour related to this entire thing
| (L1techs and GN are breaking the defective cpu story
| following leaks from major intel clients who have
| indicated that intel is basically refusing to cooperate).
|
| Intel has known about these issues for some time and said
| nothing. They have cost organisations and individuals
| time and money. Nothing they say now can be trusted
| unless it involves them admitting fault.
| Dylan16807 wrote:
| > That negates any arguments you had related to failure
| rates.
|
| I mean it's hard for _us_ to say, without sufficient
| data. But Intel might have that much data.
|
| Also what argument about failure rates? The one where I
| said "if" about failure rates?
|
| > Er, I'm not even sure how to respond to this.
| GamersNexus has indicated they know about the oxidisation
| issue, intel _subsequently_ confirm it was known
| internally but no public statement was made until now.
|
| GamersNexus thinks the oxidation might be the cause of
| the instability everyone is having. Intel claims
| otherwise.
|
| Intel has no reason to lie about this detail. It doesn't
| matter if the issue is oxidation versus something else.
|
| Also the issue Intel admits to can't be the problem with
| 14th gen, because it only happened to 13th gen chips.
|
| > Intel has known about these issues for some time and
| said nothing. Nothing they say now can be trusted unless
| it involves them admitting fault.
|
| If you don't trust what Intel said today at all, then you
| can't make good claims about what they knew or didn't
| know. You're picking and choosing what you believe to an
| extent I can't support.
| wslh wrote:
| It is the Pentium FDIV drama all over again! [1]. It is even
| in chapter 4 of the Andrew Grove's book!
|
| [1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug
| PedroBatista wrote:
| Good for Intel to finally "figure it out" but I'm not 100% sure
| microcode is 100% of the problem. As in everything complex
| enough, the "problem" can actually be many compounded problems,
| MB vendors "special" tune comes to mind.
|
| But this is already a mess very hard to clean since I feel many
| of these CPUs will die in an year or 2 because of these problems
| today but by then nobody will remember this and an RMA will be
| "difficult" to say the least.
| johnklos wrote:
| You're right - at least partly. If the issue is that Intel was
| too aggressive with voltages, they can use microcode updates as
| 1) an excuse to rejigger the power levels and voltages the BIOS
| uses as part of the update, and 2) they can have the processor
| itself be more conservative with the voltages and clocking it
| calculates itself.
|
| Anything Intel announces, in my experience, is half true, so
| I'm interested to see what's actually true and what Intel will
| just forget to mention or will outright hide.
| nubinetwork wrote:
| They already tried bios updates when they pushed out the "intel
| defaults" a couple months ago...
| wmf wrote:
| Firmware and microcode aren't the same thing.
| jeffbee wrote:
| Very true and that's why it is odd that microcode has been
| mentioned here. Surely they mean PCU software (Pcode), or
| code for whatever they are calling the PCU these days.
| wmf wrote:
| I assume Intel's "microcode" updates include the PCU code,
| maybe some ME code, and whatever other little cores are
| hiding in the chip.
| jeffbee wrote:
| Well, do they? The operating system can provide microcode
| updates to a running CPU. Can the operating system patch
| the PCU, too?
|
| When I look at a "BIOS update" it usually seems to
| include UEFI, peripheral option ROMs, ME updates, and
| microcode. So if the PCU is getting patched I would think
| of it as a BIOS update. I think the ergonomics will be
| indistinguishable for end users.
| nicman23 wrote:
| firmware can include microcode though
| tedunangst wrote:
| Except they didn't.
| https://www.pcworld.com/article/2326812/intel-is-not-recomme...
| ChrisArchitect wrote:
| (updated from other post about mobile crashes)
|
| Related:
|
| Complaints about crashing 13th,14th Gen Intel CPUs now have data
| to back them up
|
| https://news.ycombinator.com/item?id=40962736
|
| Intel is selling defective 13-14th Gen CPUs
|
| https://news.ycombinator.com/item?id=40946644
|
| Intel's woes with Core i9 CPUs crashing look worse than we
| thought
|
| https://news.ycombinator.com/item?id=40954500
|
| Warframe devs report 80% of game crashes happen on Intel's Core
| i9 chips
|
| https://news.ycombinator.com/item?id=40961637
| tedunangst wrote:
| Not a dupe.
| silisili wrote:
| That one is mobile, this one is desktop, which they claim are
| different causes.
| HeliumHydride wrote:
| https://scholar.harvard.edu/files/mickens/files/theslowwinte...
|
| "Unfortunately for John, the branches made a pact with Satan and
| quantum mechanics [...] In exchange for their last remaining bits
| of entropy, the branches cast evil spells on future genera- tions
| of processors. Those evil spells had names like "scaling- induced
| voltage leaks" and "increasing levels of waste heat" [...] the
| branches, those vanquished foes from long ago, would have the
| last laugh."
|
| "John was terrified by the collapse of the parallelism bubble,
| and he quickly discarded his plans for a 743-core processor that
| was dubbed The Hydra of Destiny and whose abstract Platonic ideal
| was briefly the third-best chess player in Gary, Indiana.
| Clutching a bottle of whiskey in one hand and a shot- gun in the
| other, John scoured the research literature for ideas that might
| save his dreams of infinite scaling. He discovered several papers
| that described software-assisted hardware recovery. The basic
| idea was simple: if hardware suffers more transient failures as
| it gets smaller, why not allow software to detect erroneous
| computations and re-execute them? This idea seemed promising
| until John realized THAT IT WAS THE WORST IDEA EVER. Modern
| software barely works when the hardware is correct, so relying on
| software to correct hardware errors is like asking Godzilla to
| prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD
| TO RISING PROP- ERTY VALUES IN TOKYO. It's better to stop scaling
| your transistors and avoid playing with monsters in the first
| place, instead of devising an elaborate series of monster checks-
| and-balances and then hoping that the monsters don't do what
| monsters are always going to do because if they didn't do those
| things, they'd be called dandelions or puppy hugs."
| mattnewton wrote:
| I haven't read this piece before but I just knew it was going
| to be written by Mickens about halfway through your comment.
| throwup238 wrote:
| The "mickens" in the URL on the first line was a dead
| giveaway :-)
| yieldcrv wrote:
| > According to my dad, flying in airplanes used to be fun...
| Everybody was attractive ....
|
| this is how I feel about electric car supercharging stations at
| the moment. There is a definitely a privilege aspect, which
| some attractive people are beneficiaries of in a predictable
| way, as well as other expensive maintenance for their health
| and attraction.
|
| so I could see myself saying the same thing to my children
| officeplant wrote:
| I'm ruining that trend by charging my E-Transit in nice
| places and dressing poorly.
| tibbydudeza wrote:
| Thanks - it is rather funny.
| tpurves wrote:
| I think it's telling that they are delaying the microcode patch
| until _after_ all the reviewers publish their Zen5 reviews and
| the comparisons of those chips against current Raptorlake
| performance.
| zenonu wrote:
| Why even publish a comparison? Raptor Lake processors aren't a
| functioning product to benchmark against.
| tankenmate wrote:
| Because if publishers don't publish then they don't make
| money.
| AnthonyMouse wrote:
| Because the benchmarks will still exist on the sites after
| the microcode is released and a lot of the sites won't bother
| to go back and update them with the accurate performance
| level.
| Night_Thastus wrote:
| "Elevated operating voltage" my foot.
|
| We've already seen examples of this happening on non-OC'd server-
| style motherboards that perfectly adhere to the intel spec. This
| isn't like ASUS going 'hur dur 20% more voltage' and frying
| chips. If that's all it was it would be obvious.
|
| Lowering voltage may help mitigate the problem, but it sure as
| shit isn't the cause.
| dwattttt wrote:
| They also admit a microcode algorithm produces incorrect
| requests for voltages, it doesn't sound like they're trying to
| shift the blame; ASUS doesn't write that microcode
| sirn wrote:
| It's worth noting that W680 boards are not a server board,
| they're a workstation board, and often times they're
| overclockable (or even overclocked by default). Wendell
| actually showed the other day that the ASUS W680 board was
| feeding 253W into a 35W (106W boost) 13700T CPU by default[1].
|
| Supermicro and ASRock Rack do sell W680 as a server (because it
| took Intel a really long time to release C266), but while
| they're strictly to the spec, some boards are really not meant
| for K CPUs. For example, the Supermicro MBI-311A-1T2N is only
| certified for a non-TVB E/T CPUs, and trying to run the K CPU
| on these can result in the board plumbing 1.55V into the CPU
| during the single core load (where 1.4V would already be on the
| higher side)[2].
|
| In this particular case, the "non-OC'd server-style
| motherboard" doesn't really mean anything (even more so in the
| context of this announcement).
|
| [1]: https://x.com/tekwendell/status/1814329015773086069
|
| [2]: https://x.com/Buildzoid1/status/1814520745810100666
| paulmd wrote:
| Specifically I think the concerns are around idle voltage and
| overshoot at this point, which is indeed something configured
| by OEMs.
|
| edit: BZ just put out a video talking about running Minecraft
| servers destroying CPUs reliably, topping out at 83C, normally
| in the 50s, running 3600 speeds. Which is a clear issue with
| low-thread loads.
|
| https://m.youtube.com/watch?v=yYfBxmBfq7k
| userbinator wrote:
| Reminds me of Sudden Northwood Death Syndrome, 2002.
|
| Looks like history may be repeating itself, or at least rhyming
| somewhat.
|
| Back then, CPUs ran on fixed voltages and frequencies and only
| overclockers discovered the limits. Even then, it was rare to
| find reports of CPUs killed via overvolting, unless it was to an
| extreme extent --- thermal throttling, instability, and shutdown
| (THERMTRIP) seemed to occur before actual damage, preventing the
| latter from happening.
|
| Now, with CPU manufacturers attempting to squeeze all the
| performance they can, they are essentially doing this
| overclocking/overvolting automatically and dynamically in
| firmware (microcode), and it's not surprising that some bug or
| (deliberate?) ignorance that overlooked reliability may have
| pushed things too far. Intel may have been more conservative with
| the absolute maximum voltages until recently, and of course small
| process sizes with higher potential for electromigration are a
| source of increased fragility.
|
| Also anecdotal, but I have an 8th-gen mobile CPU that has been
| running hard against the thermal limits (100C) 24/7 for over 5
| years (stock voltage, but with power limits all unlocked), and it
| is still 100% stable. This and other stories of CPUs in use for
| many years with clogged or even detached heatsinks seem to
| contribute to the evidence that high voltage is what kills CPUs,
| and neither heat nor frequency.
|
| Edit: I just looked up the VCore maximum for the 13th/14th
| processors - the datasheet says 1.72V! That is far more than I
| expected for a 10nm process. For comparison, a 1st-gen i7 (45nm)
| was specified at 1.55V absolute maximum, and in the 32nm version
| they reduced that to 1.4V; then for the 22nm version it went up
| slightly to 1.52V.
| slaymaker1907 wrote:
| Interesting, I hadn't heard about the Pentium overlocking
| issues. My theory on the current issue that running chips for
| long periods of time at 100C is not good for chip longevity,
| but voltages could also be an issue. I came up with this theory
| last summer when I built my rig with a 13900k, though I was
| doing it with the intention of trying to set things up so the
| CPU could last 10 years.
|
| Anecdotally, my CPU has been a champ and I haven't noticed any
| stability issues despite doing both a lot of gaming and a lot
| of compiling on it. I lost a bit of performance but not much
| setting a power limit of 150W.
| cyanydeez wrote:
| I believe the first round of Intel excuses here blamed the
| motherboard manufacturers for trying to "auto" overclock these
| CPUs.
| unregistereddev wrote:
| > Back then, CPUs ran on fixed voltages and frequencies and
| only overclockers discovered the limits. Even then, it was rare
| to find reports of CPUs killed via overvolting, unless it was
| to an extreme extent --- thermal throttling, instability, and
| shutdown (THERMTRIP) seemed to occur before actual damage,
| preventing the latter from happening.
|
| Oh the memories. I had a Thunderbird-core Athlon with a stock
| frequency of (IIRC) 1050Mhz. It was stable at 1600Mhz, and I
| ran it that way for years. I was able to get it to 1700Mhz, but
| then my CPU's stability depended on ambient temperatures. When
| the room got hot in the summer my workstation would randomly
| kernel panic.
| whalesalad wrote:
| If I didn't just recently invest in 128gb of DDR4 I'd jump ship
| to AMD/AM5. My 13900k has been (knock on wood) solid though -
| with 24/7 uptime since July 2023.
| thangngoc89 wrote:
| I guess you're lucky. I own 2 machines for small scale CNN
| training, one 13900k and one 14900k. I have to throttle the CPU
| performances to 90% for stable running. This cost me about 1
| hour / 100 hours of training.
| whalesalad wrote:
| Are you using any motherboard overclocking stuff? A lot of
| mobo's are pushing these chips pretty hard right out of the
| box.
|
| I have mine at a factory setting that Intel would suggest,
| not the asus multi core enhancement crap. noctua dh15 cooler.
| It's really been a stable setup.
| thangngoc89 wrote:
| I didn't setup anything in BIOS. But my motherboard are
| from asus. I will look into this. Thanks for your
| suggestion.
| Dunati wrote:
| My 13900k has definitely degraded over time. I was
| running bus defaults for everything and the pc was fine
| for several months. When I started getting crashes it
| took me a long time to diagnose it as a CPU problem.
| Changing the mobo vdroop setting made the problem go away
| for a while, but it came back. I then got it stable again
| by dropping the core multipliers down to 54x, but then a
| couple months later I had to drop to 53x. I just got an
| rma replacement and it had made it 12 hours without
| issue.
| J_Shelby_J wrote:
| I evaluated ddr4 vs ddr5 a year ago, and it wasn't worth it.
| Chasing FPS and the cost to hit the same speed in ddr5 was just
| too high, and I'm glad I did. I'm on a 13700k and I'm also very
| stable. However, with the stock XMP profile for my ram I was
| very much not stable and getting errors and bsods within
| minutes on an occp burn in test. All I had to do was roll back
| the memory clock speed a few hundred mhz.
| phire wrote:
| I find it hard to believe that it actually is a microcode issue.
|
| Mostly because Intel has way too much motivation to pass it off
| as a microcode issue, as they can fix a microcode issue for free,
| by pushing out a patch. If it's an actual hardware issue, then
| Intel will be forced to actually recall all the faulty CPUs,
| which could cost them billions.
|
| The other reason, is that it took them way too long to give
| details. If it's as simple as a buggy microcode requesting an
| out-of-spec voltage from the motherboard, they should have been
| able to diagnose the problem extremely quickly and fix it in just
| a few weeks. They would have detected the issue as soon as they
| put voltage logging on the motherboard's VRM. And according to
| some sources, Intel have apparently been shipping non-faulty CPUs
| for months now (since April, from memory), and those don't have
| an updated microcode.
|
| This long delay and silence feels like they spent months of R&D
| trying to create a workaround, create a new voltage spec to
| provide the lowest voltage possible. Low enough to work around a
| hardware fault on as many units as possible, without too large of
| a performance regression, or creating new errors on other CPUs
| because of undervolting.
|
| I suspect that this microcode update will only "fix" the crashes
| for some CPUs. My prediction is that in another month Intel will
| claim there are actually two completely independent issues, and
| reluctantly issue a recall for anything not fixed by the
| microcode.
| worthless-trash wrote:
| I believe that the waters may be muddied enough that they wont
| have to do a full recall and only if you 'provide evidence' the
| system is still crashing.
| RedShift1 wrote:
| As I understand it, there are multiple voltages inside the CPU,
| so just monitoring the motherboard VRM won't cut it.
|
| That said I too am very skeptical. I just issued a moratorium
| on the purchase of anything Intel 13th/14th gen in our company
| and waiting for some actual proof that the issue is fully
| resolved.
| phire wrote:
| It's complicated.
|
| On Raptor lake, there are a few integrated voltage regulators
| to which provide new voltages for specialised uses (like the
| E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the
| current draw on those regulators is pretty low. The bulk of
| the power comes directly from motherboard VRMs on one of
| several rails with no internal regulation. Most of the power
| draw is grouped onto just two rails, VccGT for the GPU, and
| VccCore (also known as VccIA in other generations) which
| powers all the P-cores, all the E-cores and, the ring bus and
| the last-level cache.
|
| Which means all cores share the same voltage, and it's
| trivial to monitor externally.
|
| I guess it's possible the bug could be with only of the
| integrated voltage regulators, but those seem to only power
| various IO devices, and I struggle to see how they could
| trigger this type of instability.
| BeeOnRope wrote:
| What's special about the E core's L2 cache such that it
| gets on-chip regulated voltage?
| jfindley wrote:
| The months of R&D to create a workaround could simply be
| because the subset of motherboards which trigger this issue are
| doing something borderline/unexpected with their voltage
| management, and finding a workaround for that behaviour in CPU
| microcode is non-trivial. Not all motherboard models appear to
| trigger the fault, which suggests that motherboard behaviour is
| at least a contributing factor to the problem.
| ploxiln wrote:
| I think this issue was sort of cracked-open and popularized
| recently by this particular video from Level1Techs:
| https://www.youtube.com/watch?v=QzHcrbT5D_Y
|
| Towards the middle of the video it brings up some very
| interesting evidence, from online game server farms that use
| 13900 and 14900 variants for their high single-core
| performance for the cost, but with server-grade motherboards
| and chipsets that do not do any overclocking, and would be
| considered "conservative". But these environments show a very
| high statistical failure rate for these particular CPU
| models. This suggests that some high percentage of CPUs
| produced are affected, and it's long run-time over which the
| problem can develop, not just enthusiast/gamer motherboards
| pushing high power levels.
| starspangled wrote:
| All modern CPUs come out of the factory with many many bugs.
| The errata you see published are only the ones that they find
| after shipping (if you're lucky, they might not even publish
| all errata). Many bugs are fixed in testing and qualification
| before shipping.
|
| That's how CPU design goes. The way that is done is by pushing
| as much to firmware as possible, adding chicken switches and
| fallback paths, and all sorts of ways to intercept regular
| operation and replace it with some trap to microcode or flush
| or degraded operation.
|
| Applying fixes and workaround might cost quite a bit of
| performance (think spectre disabling of some kinds of branch
| predictors for an obvious very big one). And in some cases you
| even see in published errata they leave some theoretical
| correctness bugs unfixed entirely. Where is the line before
| accepting returns? Very blurry and unclear.
|
| Almost certainly, huge parts of their voltage regulation (which
| goes along with frequency, thermal, and logic throttling) will
| be highly configurable. Quite likely it's run by entirely
| programmable microcontrollers on chip. Things that are baked
| into silicon might be voltage/droop sensors, temperature
| sensors, etc., and those could behave unexpectedly, although
| even then there might be redundancy or ways to compensate for
| small errors.
|
| I don't see they "passed it off" as a microcode issue, just
| said that a microcode patch could fix it. As you see it's very
| hard from the outside to know if something can be reasonably
| fixed by microcode or to call it a "microcode issue". Most
| things can be fixed with firmware/microcode patches, by design.
| And many things are. For example if some voltage sensor circuit
| on the chip behaved a bit differently than expected in the
| design but they could correct it by adding some offsets to a
| table, then the "issue" is that silicon deviates from the model
| / design and that can not be changed, but firmware update would
| be a perfectly good fix, to the point they might never bother
| to redo the sensor even if they were doing a new spin of the
| masks.
|
| On the voltage issue, they did not say it was requesting an out
| of spec voltage, they said it was incorrect. This is not
| necessarily detectable out of context. Dynamic voltage and
| frequency scaling and all the analog issues that go with it are
| fiendishly complicated, voltage requested from a regulator is
| not what gets seen at any given component of the chip, loads,
| switching, capacitance, frequency, temperature, etc., can all
| conspire to change these things. And modern CPUs run as close
| to absolute minimum voltage/timing guard bands as possible to
| improve efficiency, and they boost up to as high voltages as
| they can to increase performance. A small bug or error in some
| characterization data in this very complicated algorithm of
| many variables and large multi dimensional tables could easily
| cause voltage/timing to go out of spec and cause instability.
| And it does not necessarily leave some nice log you can debug
| because you can't measure voltage from all billion components
| in the chip on a continuous basis.
|
| And some bugs just take a while to find and fix. I'm not a
| tester per se but I found a logic bug in a CPU (not Intel but
| commercial CPU) that was quickly reproducible and resulted in a
| very hard lockup of a unit in the core, but it still took weeks
| to find it. Imagine some ephemeral analog bug lurking in a
| dusty corner of their operating envelope.
|
| Then you actually have to develop the fix, then you have to run
| that fix through quite a rigorous testing process and get
| reasonable confidence that it solves the problem, before you
| would even make this announcement to say you've solved it. Add
| N more weeks for that.
|
| So, not to say a dishonest or bad motivation from Intel is out
| of the question. But it seems impossible to make such
| speculations from the information we have. This announcement
| would be quite believable to me.
| ChoGGi wrote:
| I agree with most of what you said, so cherry picking one
| thingy to reply to isn't my intention, but
|
| "And some bugs just take a while to find and fix."
|
| I think it's less that it took awhile to find the bug/etc,
| more so that they've been pretty much radio silent for six
| months. AMD had the issue with burning 7 series CPUs, they
| were quick to at least put out a statement that they'll make
| customers whole again.
| sqeaky wrote:
| > As you see it's very hard from the outside to know if
| something can be reasonably fixed by microcode or to call it
| a "microcode issue
|
| They claimed:
|
| > a microcode algorithm resulting in incorrect voltage
| requests to the processor.
| gwbas1c wrote:
| It's most likely both a hardware issue and a microcode issue.
|
| Making CPUs is kind-of like sorting eggs. When they're made,
| they all have slightly different characteristics and get placed
| into bins (IE, "binned") based on how they meet the specs.
|
| To oversimplify, the cough "better" chips are sold at higher
| prices because they can run at higher clock speeds and/or
| handle higher voltages. If there's a spec of dust on the die, a
| feature gets turned off and the chip is sold for a lower price.
|
| In this case, this is most likely an edge case _that would not
| be a defect_ if shipping microcode already handled it.
| (Although it is appropriate to ask if it would result in
| effected chips going into a lower-price bin if they are
| effected.)
| nequo wrote:
| > If there's a spec of dust on the die, a feature gets turned
| off and the chip is sold for a lower price.
|
| Do you mean that if a 13900KS CPU has a manufacturing defect,
| it gets downgraded and sold as 13900F or something else
| according to the nature of the defect?
| gwbas1c wrote:
| It's been almost 20 years since I worked in the industry,
| so I don't want to make assumptions about _specific
| products._
|
| When I was in the industry, it would be things like
| disabling caches, disabling cores, ect. I don't remember
| specific products, though.
|
| Likewise, some die can handle higher voltages, clock
| speeds, ect.
| colejohnson66 wrote:
| Yes. It's called the silicon lottery.
| burnte wrote:
| > I find it hard to believe that it actually is a microcode
| issue.
|
| They learned a lot from the Pentium disaster, even if it's a
| hardware issue, they can address it with microcode at least,
| which is just as good.
| yencabulator wrote:
| Except normally the result of a microcode workaround is that
| the chip no longer performs at its claimed/previously-
| measured level. Not "as good" by any standard.
|
| For example, Intel CPU + Spectre mitigation is not "as good"
| as a CPU that didn't have the vulnerability in the first
| place.
| sqeaky wrote:
| At least with spectre applying the mitigation was a choice.
| You could turn it off and game at full speed, while turning
| it on for servers and web browsing for safety.
|
| This is busted or working.
| burnte wrote:
| Microcode changes don't have to affect performance
| negatively. Do you have any evidence this one will? If it's
| a voltage algorithm failure, then I would expect that they
| could run it as advertised with corrected microcode.
| Unstable power is a massive issue for electronics like this
| and I have no problem believing their explanation. Bad
| power causes all sorts of weird issues.
| yencabulator wrote:
| If it was a microcode bug to begin with, fixing the bug
| wouldn't need to degrade performance. If it was e.g. a
| bad sensor, that you can "correct" well enough by
| postprocessing, it doesn't need to degrade performance.
| But if it's essentially incorrect binning -- the hardware
| can't function as they thought it would, use microcode to
| limit e.g. voltage to the range where it works right --
| then that will degrade performance.
| xyst wrote:
| Wonder what Linus has to say on this. Dude knows how to rip into
| crappy Intel products
| weberer wrote:
| Torvalds or the Youtube guy?
| happosai wrote:
| Yes
| aruametello wrote:
| I can imagine both will bash intel a bit.
|
| "Linus Tech Tips" for the gaming crowd situation (loss of
| "paid for" premium performance) and Torvalds for the
| hardware vendor lack of transparency with the community.
| salamo wrote:
| Is there any info on how to diagnose this problem? Having just
| put together a computer with the 14900KF, I _really_ don 't want
| to swap it out if not necessary.
| sudosysgen wrote:
| Running a full memtest overnight and a day of Prime95 with
| validation is the traditional way of sussing out instability.
| paulmd wrote:
| it's also a terrible stability test these days for the same
| reasons Wendell talks about with cinebench in his video with
| Ian (and Ian agrees too). Doesn't work like 90% of the chip -
| it's purely a cache/avx benchmark. You can have a completely
| unstable frontend and it'll just work fine because prime95
| fits in icache and doesn't need the decoder, and it's just
| vector op, vector op, vector op forever.
|
| You can have a system that's 24/7 prime95 stable that crashes
| as soon as you exit out, because it tests so very little of
| it. That's actually not uncommon due to the changes in
| frequency state that happen once the chip idles down... and
| it's been this way for more than a decade, speedstep used to
| be one of the things overclockers would turn off because it
| posed so many problems vs just a stable constant frequency
| load.
| J_Shelby_J wrote:
| OCCP burn in test with AVX and XMP disabled.
|
| Tbh, XMP is probably the cause of most modern crashes on gaming
| rigs. It does not guarantee stability. After finding a stable
| cpu frequency, enable xmp and roll back the memory frequency
| until you have no errors in occp. The whole thing can be done
| in 20 minutes and your machine will have 24/7/365 uptime.
| LtdJorge wrote:
| This is good advice for overclocking, but how does it help
| with the 13th/14th Gen issue? The issue is not due to clocks,
| or at least doesn't appear to be.
| Fabricio20 wrote:
| There is no reliable way to diagnose this issue with the 14th
| gen, the chip slowly degrades over time and you start getting
| more and more (usually gpu driver under windows) crashes. I
| believe the easy way might be to run decompression stress tests
| if I remember correctly from Wendell's (Level1Techs) video.
|
| I highly recommend going into your motherboard right now and
| manually setting your configurations to the current intel
| recommendation to prevent it from degrading to the point where
| you'd need to RMA it. I have a 14900K and it took about 2.5
| months before it started going south and it was getting worse
| by the DAY for me. Intel has closed my RMA ticket since
| changing the bios settings to very-low-compared-to-what-the-
| original-is has made the system stable again, so I guess I have
| a 14900K that isn't a high end chip anymore.
|
| Below are the configs intel provided to me on my RMA ticket
| that have made my clearly degraded chip stable again:
|
| CEP (Current Excursion Protection)> Enable. eTVB (Enhanced
| Thermal Velocity boost)> Enable. TVB (Thermal Velocity
| boost)>Enable. TVB Voltage Optimization> Enable. ICCMAX
| Unilimited bit>Disable. TjMAX Offset> 0. C-States (Including
| C1E) >Enable. ICCMAX> 249A. ICCMAX_APP>200A. Power limit 1
| (PL1)>125W. Power limit 2 (PL2)>188W
| Havoc wrote:
| > Intel is delivering a microcode patch which addresses the root
| cause of exposure to elevated voltages.
|
| That's great news for intel. If that's correct. If not that'll be
| a PR bloodbath
| eigenform wrote:
| by "microcode" i assume they meant "pcode" for the PCU? (but they
| decided not to make that distinction here for whatever reason?)
| ChoGGi wrote:
| Hmm, mid August is after the new Ryzens are out, I wonder how bad
| of a performance hit this microcode update will bring?
|
| And will it actually fix the issue?
|
| https://www.youtube.com/watch?v=QzHcrbT5D_Y
| uticus wrote:
| Dumb question: let's say I am in charge of procurement for a
| significant amount of machines, do I not have the option of
| ordering machines from three generations back? Are older (proven
| reliable) processors just not available because they're no longer
| made, like my 1989 Camry?
| wmf wrote:
| Yeah, 12th gen is probably still available.
| cdchn wrote:
| I built a system last fall with an i9-13900K and have been having
| the weirdest crashing problems with certain games that I never
| had problems with before. NEVER been able to track it down, no
| thermal issues, no overclocking, all updated drivers and BIOS.
| Maybe this is finally the answer I've been looking for.
___________________________________________________________________
(page generated 2024-07-23 23:11 UTC)