[HN Gopher] July 2024 Update on Instability Reports on Intel Cor...
       ___________________________________________________________________
        
       July 2024 Update on Instability Reports on Intel Core 13th/14th Gen
       Desktop CPUs
        
       Author : acrispino
       Score  : 307 points
       Date   : 2024-07-22 20:55 UTC (1 days ago)
        
 (HTM) web link (community.intel.com)
 (TXT) w3m dump (community.intel.com)
        
       | firebaze wrote:
       | Nice that Intel acknowledges there are problems with that CPU
       | generation. If I read this right, the CPUs have been supplied
       | with a too-high voltage across the board, with some tolerating
       | the higher voltages for longer, others not so much.
       | 
       | Curious to see how this develops in terms of fixing defective
       | silicon.
        
       | tux3 wrote:
       | Remains to be seen how the microcode patch affects performance,
       | and how these CPUs that have been affected by over-voltage to the
       | point of instability will have aged in 6 months, or a few years
       | from now.
       | 
       | More voltage generally improves stability, because there is more
       | slack to close timing. Instability with high voltage suggests
       | dangerous levels. A software patch can lower the voltage from
       | this point on, but it can't take back any accumulated fatigue.
        
         | giantg2 wrote:
         | I was recently looking at building and buying a couple systems.
         | I've always liked Intel. I went AMD this time.
         | 
         | It seemed like the base frequencies vs boost frequencies were
         | much farther apart on Intel than with most of the AMDs. This
         | was especially true on the laptops were cooling is a larger
         | concern. So I suspect they were pushing limits.
         | 
         | Also, the performance core vs efficiency core stuff seemed kind
         | of gimmicky with so few performance cores and so many
         | efficiency cores. Like look at this 20 core processor! Oh wait,
         | it's really an 8 core when it comes to performance. Hard to
         | compare that to a 12 core 3D cached Ryzen with even higher
         | clock...
         | 
         | I will say, it seems intel might still have some advantages. It
         | seems AMD had an issue supporting ECC with the current
         | chipsets. I almost went Intel because of it. I ended up
         | deciding that DDR5 built in error correction was enough for me.
         | The performance graphs also seem to indicate a smoother
         | throughput suggesting more efficient or elegant execution (less
         | blocking?). But on the average the AMDs seem to be putting out
         | similar end results even if the graph is a bit more "spikey".
        
           | nullindividual wrote:
           | > It seems AMD had an issue supporting ECC with the current
           | chipsets.
           | 
           | AMD has the advantage with regards to ECC. Intel doesn't
           | support ECC at all on consumer chips, you need to go Xeon.
           | AMD supports it on all chips, but it is up to the motherboard
           | vendor to (correctly) implement. You can get consumer-class
           | AM4/5 boards that have ECC support.
        
             | jwond wrote:
             | Actually some of the 13th and 14th gen Intel Core
             | processors support ECC.
        
               | mackal wrote:
               | Intel has always had randomly supported ECC on desktop
               | CPUs. Sometimes it was just a few low end SKUs, sometimes
               | higher end SKUs. 14th gen it appears i9s and i7s do,
               | didn't check i5s, but i3s did not.
        
             | ThatMedicIsASpy wrote:
             | You need W680 boards (starting at around 500 bucks) for ECC
             | on desktop intel chips.
        
               | giantg2 wrote:
               | I was seeing them around $400 (still expensive).
        
             | giantg2 wrote:
             | My understanding is that it's screwed up for multiple
             | vendors and chipsets. The boards might say they support it,
             | but there are some updates saying it's not. It seemed
             | extremely hard to find any that actually supported it. It
             | was actually easier to find new Intel boards supporting
             | ECC.
        
               | paulmd wrote:
               | yeah wendell put out a video a few weeks ago exploring a
               | bunch of problems with asrock rack-branded server-market
               | B650 motherboards and basically the ECC situation was
               | exactly what everyone warns about: the various BIOS
               | versions wandered between "works, but doesn't forward the
               | errors", "doesn't work, and doesn't forward the errors",
               | and (excitingly) "doesn't work and doesn't even post". We
               | are a year and a half after zen4 launched and there
               | barely are any server-branded boards to begin with, and
               | even _those boards_ don 't work right.
               | 
               | https://youtu.be/RdYToqy05pI?t=503
               | 
               | I don't know how many times it has to be said but
               | "doesn't explicitly disable" is not the same thing as
               | "support". There are lots of other enablement steps that
               | are required to get ECC to work properly, and they really
               | need to be explicitly tested with each release (which if
               | it is "not explicitly disabled", it's not getting
               | tested). Support means you can complain to someone when
               | it doesn't work right.
               | 
               | AMD churns AGESA really, really hard and it breaks all
               | the time. Partners have to try and chase the upstream and
               | sometimes it works and sometimes it doesn't. Elmor
               | (Asus's Bios Guy) talked about this on Overclock.net back
               | around 2017-2018 when AMD was launching X399 and talked
               | about some of the troubles there and with AM4.
               | 
               | That said, the current situation has seemingly lit a fire
               | under the board partners, with Intel out of commission
               | and all these customers _desperate_ for an alternative to
               | their W680 /raptor lake systems (which do support ecc
               | _officially_ , btw) in these performance-sensitive niches
               | or power-limited datacenter layouts, they are finally
               | cleaning up the mess like, within the last 3 weeks or so.
               | They've very quickly gone from not caring about these
               | boards to seeing a big market opportunity.
               | 
               | https://www.youtube.com/watch?v=n1tXJ8HZcj4
               | 
               | can't believe how many times I've explained in the last
               | month that yes, people do actually run 13700Ks in the
               | datacenter... with ECC... and actually it's probably some
               | pretty big names in fact. A previous video dropped the
               | tidbit that one of the major affected customers is
               | Citadel Capital - and yeah, those are the guys who used
               | to get special EVEREST and BLACK OPS skus from intel for
               | the same thing. Client platform is better at that, the
               | very best sapphire rapids or epyc -F or -X3D sku is going
               | to be like 75% of the performance at best. It's also the
               | fastest thing available for serving NVMe flash storage
               | (and Intel specifically targeted this, the Xeon E-2400
               | series with the C266 chipset can talk NVMe SAS _natively_
               | on its chipset with up to 4 slimsas ports...)
               | 
               | it's somewhere in this one I think:
               | https://www.youtube.com/watch?v=5KHCLBqRrnY
        
               | justinclift wrote:
               | The new EPYC processors for AM5 though look like they'll
               | be ok for ECC ram though, at least in the coming months
               | onwards.
        
               | paulmd wrote:
               | Yeah I think that's the bright spot, now that there's a
               | branded offering for server-flavored Ryzen now maybe
               | there is a permanent justification for doing proper
               | validation.
               | 
               | I just feel vindicated lol, it always comes up that "well
               | works fine for me!" and the reality is it's a _total_
               | crapshoot with even server-branded boards often not
               | working. There is zero chance your gigabyte UD3 or
               | whatever is going to be consistently supported across
               | bios and often it will not be.
               | 
               | And AMD is really really tied to AGESA releases, so it's
               | fairly important on that side. Although I guess maybe
               | we're seeing now what happens if you let too much be
               | abstracted away... but on the other hand partners were
               | blowing up AMD chips last year too.
               | 
               | If you're comfortable always testing, and always having
               | the possibility of there being some big AGESA problem and
               | ecc being broken on the new versions... ok I guess.
               | 
               | There is a reason the i3 chips were perennial favorites
               | for edge servers and NASs. And I think it's really,
               | really hard to overstate the long-term damage from
               | reputation loss here. Intel, meltdown aside, was always
               | no-drama in terms of reliability. Other than C2000/C3000,
               | I guess.
        
             | mananaysiempre wrote:
             | > AMD supports [ECC RAM] on all chips
             | 
             | There was a strange happening with AMD laptop CPUs
             | ("APUs"): the non-soldered DDR5 variants of the 7x40's were
             | advertised to support ECC RAM on AMD's website up until a
             | couple months before any actual laptops were sold, then
             | that was silently changed and ECC is only on the PRO models
             | now. I still don't know if this is a straightforward
             | manufacturing or chipset issue of some kind or a sign of
             | market segmentation to come.
             | 
             | (I'm quite salty I couldn't get my Framework 13 with ECC
             | RAM because of this.)
        
             | ploek wrote:
             | > AMD supports it on all chips
             | 
             | Unfortunately not. I can't say for current gen, but the
             | 5000 series APUs like the 5600G do _not_ support ECC. I
             | know, I tried...
             | 
             | But yes, most Ryzen CPUs do have ECC functionality, and
             | have had it since the 1000 series, even if not officially
             | supported. Official support for ECC is only on Ryzen PRO
             | parts.
        
             | smolder wrote:
             | ECC support wasn't good initially on AM5, but there are now
             | Epyc branded chips for the AM5 socket which officially
             | support ECC DDR5. They come in the same flavors as the
             | Ryzen 7xx0 chips, but are branded as Epyc.
        
           | fomine3 wrote:
           | More E-core is reasonable for multi threaded application
           | performance. It's efficient for power and die area as the
           | name indicates, so they can implement more E-cores than
           | P-cores for the same power/area budget. It's not suitable for
           | who need many single threaded performance cores like VM
           | server, but I don't know is there any major consumer usage
           | requires such performance.
        
             | giantg2 wrote:
             | I can sort of see that. The way I saw it explained as them
             | being much lower clock and having a pretty small shared
             | cache. I could see E cores as being great for running
             | background processes and stuff. All the benchmarks seem to
             | show the AMDs with 2/3rd the cores being around the same
             | performance and with similar power draw. I'm not putting
             | them down. I'm just saying it seems gimmicky to say "look
             | at our 20 core!" with the implicit idea that people will
             | compare that with an AMD 12 core seeing 20>12, but not
             | seeing the other factors like cost and benchmarks.
        
               | philistine wrote:
               | It's the megahertz wars all over again!
        
               | cyanydeez wrote:
               | Computers have taught is the rubric of truth: Numbers go
               | Up.
        
             | Sohcahtoa82 wrote:
             | > but I don't know is there any major consumer usage
             | requires such performance.
             | 
             | Gaming.
             | 
             | There are some games that will benefit from greater single-
             | core performance.
        
           | akira2501 wrote:
           | > so few performance cores and so many efficiency cores
           | 
           | I was baffled by this too but what they don't make clear is
           | the performance cores have hyperthreading the efficiency
           | cores do not.
           | 
           | So what they call 2P+4E actually becomes an 8 core system as
           | far as something like /proc/cpuinfo is concerned. They're
           | also the same architecture so code compiled for a particular
           | architecture will run on either core set and can be moved
           | from one to the other as the scheduler dictates.
        
             | Dylan16807 wrote:
             | > They're also the same architecture so code compiled for a
             | particular architecture will run on either core set and can
             | be moved from one to the other as the scheduler dictates.
             | 
             | I don't know if that has done more good than harm, since
             | they ripped AVX-512 out for multiple generations to ensure
             | parity.
        
           | Dylan16807 wrote:
           | > Like look at this 20 core processor! Oh wait, it's really
           | an 8 core when it comes to performance.
           | 
           | The E cores are about half as fast as the P cores depending
           | on use case, at about 30% of the size. If you have a program
           | that can use more than 8 cores, then that 8P+12E CPU should
           | approach a 14P CPU in speed. (And if it can't use more than 8
           | cores then P versus E doesn't matter.) (Or if you meant
           | 4P+16E then I don't think those exist.)
           | 
           | > Hard to compare that to a 12 core 3D cached Ryzen with even
           | higher clock...
           | 
           | Only half of those cores properly get the advantage of the 3D
           | cache. And I doubt those cores have a higher clock.
           | 
           | AMD's doing quite well but I think you're exaggerating a good
           | bit.
        
             | ComputerGuru wrote:
             | > If you have a program that can use more than 8 cores,
             | then that 8P+12E CPU should approach a 14P CPU in speed
             | 
             | Only if you use work stealing queues or (this is
             | ridiculously unlikely) run multithreaded algorithms that
             | are aware of the different performance and split the work
             | unevenly to compensate.
        
               | Dylan16807 wrote:
               | Or if you use a single queue... which I would expect to
               | be the default.
               | 
               | Blindly dividing work units across cores sounds like a
               | terrible strategy for a general program that's sharing
               | those cores with who-knows-what.
        
               | ComputerGuru wrote:
               | It's a common strategy for small tasks where the overhead
               | of dispatching the task greatly exceeds the computation
               | of it. It's also a better way to maximize L1/L2 cache hit
               | rates by improving memory locality.
               | 
               | Eg you have 100M rows and you want to cluster them by a
               | distance function (naively), running dist(arr[i], arr[j])
               | is crazy fast, the problem is just that you have so many
               | of them. It is faster to run it on one core than dispatch
               | it from one queue to multiple cores, but best to assign
               | the work ahead of time to n cores and have them crunch
               | the numbers.
        
               | Dylan16807 wrote:
               | It has always been a bad idea to dispatch so naively
               | _and_ dispatch to the same number of threads as you have
               | cores. What if a couple cores are busy, and you spend
               | almost twice as much time as you need waiting for the
               | calculation to finish? I don 't know how much software
               | does that, and most of it can be easily fixed to dispatch
               | half a million rows at a time and get better performance
               | on all computers.
               | 
               | Also on current CPUs it'll be affected by hyperthreading
               | and launch 28 threads, which would probably work out
               | pretty well overall.
        
               | chmod775 wrote:
               | > What if a couple cores are busy
               | 
               | If you don't pin them to cores, the OS is still free to
               | assign threads to cores as it pleases. Assuming the
               | scheduler is somewhat fair, threads will progress at
               | roughly the same rate.
        
               | Sohcahtoa82 wrote:
               | > run multithreaded algorithms that are aware of the
               | different performance and split the work unevenly to
               | compensate.
               | 
               | This is what the Intel Thread Director [0] solves.
               | 
               | For high-intensity workloads, it will prioritize
               | assigning them to P-cores.
               | 
               | [0] https://www.intel.com/content/www/us/en/support/artic
               | les/000...
        
               | ComputerGuru wrote:
               | Then you no longer have 14 cores in this example, but
               | only len(P) cores. Also most code written in the wild
               | isn't going to use an architecture-specific library for
               | this.
        
               | dwattttt wrote:
               | The P cores being presented as two logical cores and E
               | cores presented as a single logical core results in this
               | kind of split already.
        
             | giantg2 wrote:
             | Yeah, the 20 core Intels are benchmarking about the same as
             | the 12 core AMD X3Ds. But many people just see 20>12.
             | Either one is more than fine for most people.
             | 
             | "Oh wait, it's really an 8 core when it comes to
             | performance [cores]". So yes, should not be an 8 core all
             | together, but like you said about 14 cores, or 12 with the
             | 3D cache.
             | 
             | "And I doubt those cores have a higher clock."
             | 
             | I'm not sure what we're comparing them to. They should be
             | capable of higher clock than the E cores. I thought all the
             | AMD cores had the ability to hit the max frequency (but not
             | necessarily at the same time). And some of the cores might
             | not be able to take advantage of the 3D cache, but that
             | doesn't limit their frequency, from my understanding.
        
               | Dylan16807 wrote:
               | > I'm not sure what we're comparing them to. They should
               | be capable of higher clock than the E cores.
               | 
               | Oh, just higher clocked than the E cores. Yeah that's
               | true, but if you're using that many cores at once you
               | probably only care about total speed.
               | 
               | You said 12 core with higher clock versus 8, so I thought
               | you were comparing to the performance cores.
               | 
               | > I thought all the AMD cores had the ability to hit the
               | max frequency (but not necessarily at the same time).
               | 
               | The cores under the 3D cache have a notable clock penalty
               | on existing CPUs.
               | 
               | > And some of the cores might not be able to take
               | advantage of the 3D cache, but that doesn't limit their
               | frequency, from my understanding.
               | 
               | Right, but my point is it's misleading to call out higher
               | core count _and_ the advantages of 3D stacking. The 3D
               | stacking mostly benefits the cores it 's on top of, which
               | is 6-8 of them on existing CPUs.
        
               | giantg2 wrote:
               | "The cores under the 3D cache have a notable clock
               | penalty on existing CPUs."
               | 
               | Interesting. I can't find any info on that. It seems that
               | makes sense though since the 7900X is 50 TDP higher than
               | the 7900X3D.
               | 
               | "Right, but my point is it's misleading to call out
               | higher core count and the advantages of 3D stacking"
               | 
               | Yeah, that makes sense. I didn't realize there was a
               | clock penalty on some of the cores with the 3D cache and
               | that only some cores could use it.
        
               | LtdJorge wrote:
               | It's due to the stacked cache being harder to cool and
               | not supporting as high of a voltage. So the 3D CCD clocks
               | lower, but for some workloads it's still faster (mainly
               | ones dealing with large buffers, like games, most compute
               | heavy benchmarks fit in normal caches and the non 3D
               | V-Cache variants take the win).
        
               | hellotomyrars wrote:
               | It's kind of funny and reminiscent of the AMD bulldozer
               | days where they had a ton of cores compared to the
               | contemporary Intel chips, especially at low/mid price
               | points but the AMD chips were laughably underwhelming for
               | single core performance which was even more important
               | then.
               | 
               | I can't speak to the Intel chips because I've been out of
               | the Intel game for a long time but my 5700X3D does seem
               | to happily run all cores at max clock speed.
        
           | jiggawatts wrote:
           | A major differentiator is that Intel CPUs with E cores don't
           | allow the use of AVX-512, but all current AMD CPUs do. The
           | new Zen 5 chips will run circles around Intel for any such
           | workload. Video encoding, 3D rendering, and AI come to mind.
           | For developers: many database engines can use AVX-512
           | automatically.
        
         | paulmd wrote:
         | > Remains to be seen how the microcode patch affects
         | performance
         | 
         | intel is claiming 4% performance hit on the final patch
         | https://youtu.be/wkrOYfmXhIc?t=308
        
         | oasisbob wrote:
         | Maybe a stretch - but this reminds me of blood sugar regulation
         | for people with type 1 diabetes.
         | 
         | Too low is dangerous because you lose rational thought, and the
         | ability to maintain consciousness or self-recover. However,
         | despite not having the immediate dangers of being low, having
         | high blood sugar over time is the condition which causes long-
         | term organ damage.
        
       | loufe wrote:
       | Intel cannot afford to be anything but outstanding in terms of
       | customer experience right now. They are getting assaulted on all
       | fronts and need to do a lot to improve their image to stay
       | competitive.
        
         | Joel_Mckay wrote:
         | Their acquisition of Altera seemed to harm both companies
         | irreparably.
         | 
         | Any company can reach a state where the Process people take
         | over, and the Product people end up at other firms.
         | 
         | Intel could have grown a pair, and spun the 32 core RISC-V DSP
         | SoC + gpu for mobile... but there is little business incentive
         | to do so.
         | 
         | Like any rotting whale, they will be stinking up the place for
         | a long time yet. =)
        
           | beacon294 wrote:
           | Could you elaborate on the process people versus product
           | people?
        
             | basementcat wrote:
             | I would argue the fabrication process people at Intel are
             | core to their business. Without the ability to reliably
             | manufacture chips, they're dead in the water.
        
               | Joel_Mckay wrote:
               | You mean manufacturing "working chips" is supposed to be
               | their business.
               | 
               | It is just performance art with proofing wafers unless
               | the designs work =3
        
             | bgmeister wrote:
             | I assume they're referring to Steve Jobs' comments in this
             | (Robert Cringely IIRC) interview:
             | https://www.youtube.com/watch?v=l4dCJJFuMsE (not a great
             | copy, but should be good enough)
        
               | beacon294 wrote:
               | Oh yeah, this got rehashed as builders versus talkers
               | too. Yeah, there's a lot of this creative vibe type
               | dividing. It's pretty complicated, I don't even think
               | individual people operate the same when placed in a
               | different context. Usually their output is a result of
               | their incentives, so typically management failure or
               | technical architect failure.
        
               | Joel_Mckay wrote:
               | Partly true, Steve Jobs had a charismatic tone when
               | describing these problems in public.
               | 
               | Have a great day, =3
        
             | Joel_Mckay wrote:
             | It is an old theory that accurately points out
             | Marketing/Sales division people inevitably out-compete
             | product innovation people in a successful firm.
             | 
             | https://en.wikipedia.org/wiki/Competitive_exclusion_princip
             | l...
             | 
             | And yes, the Steve Jobs interview does document how this
             | almost destroyed Apples core business. =)
        
         | scrlk wrote:
         | Intel should take a page out of HP's book when it came to
         | dealing with a bug in the HP-35 (first pocket scientific
         | calculator):
         | 
         | > The HP-35 had numerical algorithms that exceeded the
         | precision of most mainframe computers at the time. During
         | development, Dave Cochran, who was in charge of the algorithms,
         | tried to use a Burroughs B5500 to validate the results of the
         | HP-35 but instead found too little precision in the former to
         | continue. IBM mainframes also didn't measure up. This forced
         | time-consuming manual comparisons of results to mathematical
         | tables. A few bugs got through this process. For example: 2.02
         | ln ex resulted in 2 rather than 2.02. When the bug was
         | discovered, HP had already sold 25,000 units which was a huge
         | volume for the company. In a meeting, Dave Packard asked what
         | they were going to do about the units already in the field and
         | someone in the crowd said "Don't tell?" At this Packard's
         | pencil snapped and he said: "Who said that? We're going to tell
         | everyone and offer them, a replacement. It would be better to
         | never make a dime of profit than to have a product out there
         | with a problem". It turns out that less than a quarter of the
         | units were returned. Most people preferred to keep their buggy
         | calculator and the notice from HP offering the replacement.
         | 
         | https://www.hpmuseum.org/hp35.htm
        
           | basementcat wrote:
           | I wonder if Mr. Packard's answer would have been different if
           | a recall would have bankrupted the company or necessitated
           | layoff of a substantial percentage of staff.
        
             | scrlk wrote:
             | I can't speak for Dave Packard (or Bill Hewlett) - but I
             | will try to step in to their shoes:
             | 
             | 1) HP started off in test and measurement equipment
             | (voltmeters, oscilloscopes etc.) and built a good
             | reputation up. This was their primary business at the time.
             | 
             | 2) The customer base of the HP-35 and test and measurement
             | equipment would have a pretty good overlap.
             | 
             | Suppose the bug had been covered up, found, and then the
             | news about the cover up came to light? Would anyone trust
             | HP test and measurement equipment after that? It would
             | probably destroy the company.
        
             | rasz wrote:
             | Or potential of killing couple hundred passengers, or few
             | astronauts. Oh, wait...
        
       | wnevets wrote:
       | Are the CPUs that received elevated operating voltage permanently
       | damaged?
        
         | Pet_Ant wrote:
         | This is the most pressing question. If it was just a microcode
         | issue a cooloff and power cycle ought to at least reset things
         | but according to Wendel from Level 1 Tech, that doesn't seem to
         | always be the case.
        
           | kevingadd wrote:
           | The problem is that running at too high of a voltage for
           | sustained periods can cause physical degradation of the chip
           | in some cases. Hopefully not here!
        
             | chmod775 wrote:
             | > can cause physical degradation of the chip in some cases.
             | 
             | Not in some cases. Chips always physically degrade
             | regardless of voltage. Higher voltages will make it happen
             | faster.
        
               | Pet_Ant wrote:
               | Why do chis degrade? Is this due to the whiskers I've
               | heard about?
        
               | cesarb wrote:
               | > Why do chis degrade? Is this due to the whiskers I've
               | heard about?
               | 
               | No, tin whiskers are a separate issue, which happens
               | mostly outside the chips. The keyword you're looking for
               | is electromigration
               | (https://en.wikipedia.org/wiki/Electromigration).
        
         | layer8 wrote:
         | Not instantly it seems, but there have been reports of
         | degradation over time. It will be a case-by-case thing.
        
         | userbinator wrote:
         | Possible electromigration damage, yes.
        
       | NBJack wrote:
       | I was concerned this would happen to them, given how much power
       | was being pushed through their chips to keep them competitive. I
       | get the impression their innovation has either truly slowed down,
       | or AMD thought enough 'moves' ahead with their
       | tech/marketing/patents to paint them into a corner.
       | 
       | I don't think Intel is done though, at least not yet.
        
       | magicalhippo wrote:
       | There was recently[1] some talk about how the 13th/14th gen
       | mobile chips also had similar issues, though Intel insisted it's
       | something else.
       | 
       | Will be interesting to see how that pans out.
       | 
       | [1]: https://news.ycombinator.com/item?id=41026123
        
         | tardy_one wrote:
         | For server CPUs there's not a similar problem or they realize
         | server purchasers may be less willing to tolerate it? I'm not
         | all that thrilled with the prospect of buying Intels especially
         | when wondering about waiting to 5 year out replacement compared
         | to a few generations ago, but AMD server choices can be a bit
         | limited and I'm not really sure how to evaluate if there may be
         | increasing surprises more across the board.
        
           | sirn wrote:
           | Are you talking about Xeon Scalable? Although they share the
           | same core design as the desktop counterpart (Xeon Scalable
           | 4th Gen shares the same Golden Cove as 12th Gen, Xeon
           | Scalable 5th Gen shares the same Raptor Cove as 13th/14th
           | Gen), they're very different from the desktop counterpart
           | (monolithic vs tile/EMIB-based, ring bus vs mesh, power gate
           | vs FIVR), and often running in a more conservative
           | configuration (lower max clock, more conservative V/F curves,
           | etc.). There has been a rumor about Xeon Scalable 5th Gen
           | having the same issue, but it's more of a gossip rather than
           | a data point.
           | 
           | The issue does happen with desktop chips that are being used
           | in a server context when pairing with workstation chipset
           | such as W680. However, there haven't been any reports of Xeon
           | E-2400/E-3400 (which is essentially a desktop chip repurposed
           | as a server) with C266 having these issues, though it may be
           | because there hasn't been a large deployment of these chips
           | on the server just yet (or even if there are, it's still too
           | early to tell).
           | 
           | Do note that even without this particular issue, Xeon
           | Scalable 4th Gen (Sapphire Rapids) is not a good chip
           | (speaking from experience, I'm running w-3495x). It has
           | plenty of issues such as slow clock ramp, high latency, high
           | idle power draw, and the list goes on. While Xeon Scalable
           | 5th Gen (Emerald Rapids) seems to have fixed most of these
           | issues, Zen 4 EPYC is still a much better choice.
        
         | tedunangst wrote:
         | The mobile issue seems more anecdote than data? Almost as if
         | people on Reddit heard the 13/14 CPUs were bad, then their
         | laptop crashed, and they decided "it happened to me too".
        
           | magicalhippo wrote:
           | Well it's not just[1] redditors from what I can gather:
           | 
           |  _Now Alderon Games reports that Raptor Lake crashes impact
           | Intel 's 13th and 14th-Gen processors in laptops as well._
           | 
           |  _" Yes we have several laptops that have failed with the
           | same crashes. It's just slightly more rare then the desktop
           | CPU faults," the dev posted._
           | 
           | These are the guys who publicly claimed[2] Intel sold
           | defective chips based on the desktop chips crashing.
           | 
           | [1]: https://www.tomshardware.com/pc-components/cpus/dev-
           | reports-...
           | 
           | [2]: https://www.tomshardware.com/pc-components/cpus/game-
           | publish...
        
             | sirn wrote:
             | The problem may exist, but Alderon Games' report on the
             | mobile chip is more of an anecdote here because there's not
             | enough data points (unlike their desktop claims), and the
             | only SKU they give (13900HX) is actually a desktop chip in
             | a mobile package (BGA instead of LGA, so we're back into
             | the original issue). So in the end, even with Alderon's
             | claims, there's really not enough data points to come to a
             | conclusion on the mobile side of things.
        
               | Sakos wrote:
               | Why are you downplaying it too?
               | 
               | > "The laptops crash in the exact same way as the desktop
               | parts including workloads under Unreal Engine,
               | decompression, ycruncher or similar. Laptop chips we have
               | seen failing include but not limited to 13900HX etc.,"
               | Cassells said.
               | 
               | > "Intel seems to be down playing the issues here most
               | likely due to the expensive costs related to BGA rework
               | and possible harm to OEMs and Partners," he continued.
               | "We have seen these crashes on Razer, MSI, Asus Laptops
               | and similar used by developers in our studio to work on
               | the game. The crash reporting data for my game shows a
               | huge amount of laptops that could be having issues."
               | 
               | https://old.reddit.com/r/hardware/comments/1e13ipy/intel_
               | is_...
        
               | sirn wrote:
               | I'm not denying that the problem exists, but I don't
               | think Alderon provided enough data to come to a
               | conclusion, unlike on the desktop, where it's supported
               | by other parties in addition to Alderon's data (where you
               | can largely point to 14900KS/K/non-K/T,
               | 13900KS/K/non-K/T, 14700K, and 13700K being the one
               | affected)
               | 
               | Right now, the only example given is HX (which is a
               | repackaged desktop chip[^], as mentioned), so I'm not
               | denying that the problem is happening on HX based on
               | their claims (and it makes a lot of sense that HX is
               | affected! See below), but what about H CPUs? What about P
               | CPUs? What about U CPUs? The difference in impact between
               | "only HX is impacted" and "HX/H/P/U parts are all
               | affected" is a few orders of magnitude (a very top-end
               | 13th Gen mobile SKUs versus every 13th Gen mobile SKUs).
               | Currently, we don't have enough data how widespread the
               | issue is, and that makes it difficult to assess who is
               | impacted by this issue from this data alone.
               | 
               | [^]: HX is the only mobile CPU with B0 stepping, which is
               | the same as desktop 13th/14th Gen, while the mobile H/P/U
               | family are J0 and Q0, which are essentially a higher
               | clocked 12th Gen (i.e., using Golden Cove rather than
               | Raptor Cove)
        
               | paulmd wrote:
               | Alderon are the people claiming 100% of units fail which
               | doesn't seem supported by anyone else either. Wendell and
               | GN seem to have scoped the issue to around 10-25% across
               | multiple different sources.
               | 
               | Like they are the most extreme claimants at this point.
               | _Are_ they really credible?
        
               | paulmd wrote:
               | Ok, I take it back, this looks pretty indicative of a
               | low-load problem and evidently failure rates are _much_
               | higher in that scenario.
               | 
               | https://www.youtube.com/watch?v=yYfBxmBfq7k
        
       | christkv wrote:
       | The amount of current their chips pull on full boost is pretty
       | crazy. It would definitively not surprise me if some could get
       | damaged by extensive boosting.
        
       | brynet wrote:
       | Curious why Intel announced this on their community forums,
       | rather than somewhere more official.
        
         | samtheprogram wrote:
         | Optics / stock price
        
         | guywithahat wrote:
         | That's probably where people are mostly likely to understand
         | it. A lot of companies do this, especially while they're still
         | learning things.
        
           | wmf wrote:
           | These days people are more likely to see the announcement on
           | YouTube, TikTok, or Twitter.
        
             | slaymaker1907 wrote:
             | The first two require a lot more effort in video editing
             | than creating a forum post. Plus, it's just going to be
             | digested and regurgitated for the masses by people much
             | better at communicating technical information.
        
             | cyanydeez wrote:
             | Sobweird hearing high noise channels as the prefwrred
             | distributiom
        
             | paulmd wrote:
             | they did that too
             | 
             | https://youtu.be/wkrOYfmXhIc
        
         | beart wrote:
         | Based on what I know about corporations, it's entirely
         | plausible that the folks posting the information don't actually
         | have access to the communication channels you are referring to.
         | I don't even know how I would issue an official communication
         | at my own company if the need ever came up... so you go with
         | what you have.
        
         | langsoul-com wrote:
         | Note how they mentioned its still going to be tested with
         | various partners before released.
         | 
         | Ie we think this might solve it, but if it doesn't we can roll
         | back with the least amount of PR attention.
        
       | Covzire wrote:
       | Just want to say, I'm incredibly happy with my 7800X3D. It runs
       | ~70C max like Intel chips used to and with a $35 air cooler and
       | it's on average the fastest chip for gaming workloads right now.
        
         | amiga-workbench wrote:
         | I'm also very happy with my 5800X3D, it was wonderful value
         | back when AM5 had just released and DDR5/Motherboards still
         | cost an arm and a leg.
         | 
         | The energy efficiency is much appreciated in the UK with our
         | absurd price of electricity.
        
           | SushiHippie wrote:
           | Same, in my BIOS I can activate a "ECO Mode", which lets me
           | decide if I want to run my 7950x on full 170W TDP, 105W TDP
           | or 60W TDP.
           | 
           | I benchmarked it, the difference between 170 and 105 is
           | basically zero, and the difference to 60W is just a few
           | percent of a performance hit, but way worth it, as it's
           | ~0.3EUR/kWh over here.
        
             | aruametello wrote:
             | (if you are running windows)
             | 
             | you might want to check a tool called PBO2Tunner
             | (https://www.cybermania.ws/apps/pbo2-tuner/), you can tweak
             | values like EDC,TDC and PPT (power limit) from the GUI, and
             | it also accepts command line commands so you can automate
             | those tasks.
             | 
             | I made scripts that "cap" the power consumption of the cpu
             | based on what applications are running. (i.e. only going
             | all in on certain games, dynamically swaping between
             | 65-90-120-180w handmade profiles)
             | 
             | i made with power saving in mind given the idle power
             | consumption is rather high on modern ryzens.
             | 
             | edit: actually made a mistake given that PBO2Tunner is for
             | Zen3 cpus, and you mentioned Zen4.
        
       | fefe23 wrote:
       | So on one hand they are saying it's voltage (i.e. something
       | external, not their fault, bad mainboard manufacturers!).
       | 
       | On the other hand they are saying they will fix it in microcode.
       | How is that even possible?
       | 
       | Are they saying that their CPUs are signaling the mainboards to
       | give them too much voltage?
       | 
       | Can someone make sense of this? It reminds me of Steve Jobs' You
       | Are Holding It Wrong moment.
        
         | cqqxo4zV46cp wrote:
         | The "you're holding it wrong!"angle is all your take. They
         | don't make that claim.
        
           | k12sosse wrote:
           | "OK, great, let's give everybody a case" lives on
        
         | ls612 wrote:
         | The claim seems to be that the microcode on the CPU is in
         | certain circumstances requesting the wrong (presumably too
         | high) voltage from the motherboard. If that is the case fixing
         | the microcode will solve the issue going forward but won't help
         | people whose chips have already been damaged by excessive
         | voltage.
        
         | pitaj wrote:
         | > Are they saying that their CPUs are signaling the mainboards
         | to give them too much voltage?
         | 
         | Yes that's exactly what they said.
        
           | dboreham wrote:
           | So it's a 737 MAX problem: the software is running a control
           | loop that doesn't have deflection limits. So it tells the
           | stabilizer (or voltage reg in this case) to go hard nose
           | down.
        
             | nahnahno wrote:
             | lol what a stretch of an analogy
        
         | wtallis wrote:
         | The voltage supplied by the motherboard isn't supposed to be
         | constant. The CPU is continuously varying the voltage it's
         | requesting, based primarily on the highest frequency any of the
         | CPU cores are trying to run at. The motherboard is supposed to
         | know what the resistive losses are from the VRMs to the CPU
         | socket, so that it can deliver the requested voltage at the CPU
         | socket itself. There's room for either party to screw up: the
         | CPU could ask for too much voltage in some scenarios, or the
         | motherboard's voltage regulation could be poorly calibrated (or
         | deliberately skewed by overclocking presets).
         | 
         | On top of all this mess: these products were part of Intel's
         | repeated attempts to move the primary voltage rail (the one
         | feeding the CPU cores) to use on-die voltage regulators (DLVR).
         | They're present in silicon but unused. So it's not entirely
         | surprising if the fallback plan of relying solely on external
         | voltage regulation wasn't validated thoroughly enough.
        
         | basementcat wrote:
         | My guess is something like the following:
         | 
         | Modern CPU's are incredibly complex machines with a
         | ridiculously large amount of possible configuration states (too
         | large to exhaustively test after manufacture or sim during
         | design), e.g. a vector multiply in flight with an AES encode in
         | flight with x87 sincos, etc. Each operation is going to draw a
         | certain amount of current. It is impractical to guarantee each
         | functional unit with the required current but the supply rails
         | are sized for a "reasonable worst case".
         | 
         | Perhaps an underestimate was mistakenly made somewhere and not
         | caught until recently. Therefore the fix might be to modify the
         | instruction dispatcher (via microcode) to guarantee that
         | certain instruction configurations cannot happen (e.g. let the
         | x87 sincos stall until the vector multiply is done) to reduce
         | pressure on the voltage regulator.
        
           | hedgehog wrote:
           | It's worse than that, thermal management is part of the
           | puzzle. Think of that as heat generation happening across
           | three dimensions (X + Y + time) along with diffusion in 3D
           | through the package.
        
             | CoastalCoder wrote:
             | It's an interesting idea, but there's a caveat: time flows
             | in just one direction.
        
         | aseipp wrote:
         | Saying "elevated voltage causes damage" is not attributing
         | blame to anyone. In the very next sentence, they then attribute
         | the reason for that elevated voltage to their own microcode,
         | and so it is responsible for the damage. I literally do not
         | know how they could be any clearer on that.
        
       | TazeTSchnitzel wrote:
       | After watching https://youtube.com/watch?v=gTeubeCIwRw and some
       | related content, I personally don't believe it's an issue fixable
       | with microcode. I guess we'll see.
        
         | jpk wrote:
         | Because HN doesn't provide link previews, I'd recommend adding
         | some information about the content to your comment. Otherwise
         | we have to click through to YouTube for the comment to make any
         | sense.
         | 
         | That said, the video is the GamersNexus one where they talk
         | about an unverified claim that this is a fabrication process
         | issue caused by oxidation between atomic deposition layers. If
         | that's the case, then yeah, microcode can only do so much. But
         | like Steve says in the video, the oxidation theory has yet to
         | be proven and they're just reporting what they have so far
         | ahead of the Zen 5 reviews coming soon.
        
           | mjevans wrote:
           | Hopefully Intel ships them, and allows them to, test and
           | publish benchmarks with the current pre-release microcode
           | revision for review comparison.
        
       | acrispino wrote:
       | An Intel employee is posting on reddit:
       | https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...
       | 
       | A recent YouTube video by GamersNexus speculated the cause of
       | instability might be a manufacturing issue. The employee's
       | response follows.
       | 
       |  _Questions about manufacturing or Via Oxidation as reported by
       | Tech outlets:_
       | 
       |  _Short answer: We can confirm there was a via Oxidation
       | manufacturing issue (addressed back in 2023) but it is not
       | related to the instability issue._
       | 
       |  _Long answer: We can confirm that the via Oxidation
       | manufacturing issue affected some early Intel Core 13th Gen
       | desktop processors. However, the issue was root caused and
       | addressed with manufacturing improvements and screens in 2023. We
       | have also looked at it from the instability reports on Intel Core
       | 13th Gen desktop processors and the analysis to-date has
       | determined that only a small number of instability reports can be
       | connected to the manufacturing issue._
       | 
       |  _For the Instability issue, we are delivering a microcode patch
       | which addresses exposure to elevated voltages which is a key
       | element of the Instability issue. We are currently validating the
       | microcode patch to ensure the instability issues for 13th /14th
       | Gen are addressed_
        
         | hsbauauvhabzb wrote:
         | So they were producing defective CPUs, identified & addressed
         | the issue but didn't issue a recall, defect notice or public
         | statement relating to the issue?
         | 
         | Good to know.
        
           | thelastparadise wrote:
           | Dude's gonna be canned so hard.
        
           | Dylan16807 wrote:
           | It sounds like their analysis is that the oxidation issue is
           | comfortably below the level of "defective".
           | 
           | No product will ever be perfect. You don't need to do a
           | recall for a sufficiently rare problem.
           | 
           | And in case anyone skims, I will be extra clear, this is
           | based on the claim that the oxidation is _separate_ from the
           | real problem here.
        
             | abracadaniel wrote:
             | They could recall the defective batch. All of the cpus with
             | that defect will fail from it. The seem to have been
             | content to hope no one noticed.
        
               | Dylan16807 wrote:
               | What makes you think there was a "defective batch"? What
               | makes you think all the CPUs affected by that production
               | issue will fail from it?
               | 
               | That description sounds to me like it affected the entire
               | production line for months. It's only worth a recall if a
               | sufficient percent of those CPUs will fail. (I don't want
               | to argue about what particular percent that should be.)
        
               | hsbauauvhabzb wrote:
               | My CPU was unstable for months, I spent tens of hours and
               | hundreds on equipment to troubleshoot (I _never_ thought
               | my CPU would be the cause). Had I of known this, I would
               | have scrutinised the cpu a lot faster than what I did.
               | 
               | Intel not making a public statement about potentially
               | defective products could have been done with good PR spin
               | 'we detected an issue, believe the defect rate will be <
               | 0.25%, here's a test suite you can run, call if you think
               | you're one of the .25!' But they didn't.
               | 
               | I'm never buying an intel product again. Fuck intel.
        
               | Dylan16807 wrote:
               | This comment chain is talking about the oxidation in
               | particular, and specifically the situation where the
               | oxidation is _not_ the cause of the instability in the
               | title. That 's the only way they "identified & addressed
               | the issue but didn't issue a recall".
               | 
               | Do you have a reason to think the oxidation is the cause
               | of your problems?
               | 
               | Did you not read my first post trying to clarify the two
               | separate issues?
               | 
               | Am I misunderstanding something?
        
               | hsbauauvhabzb wrote:
               | Oxidisation in the context of CPU fabrication sounds
               | pretty bad, I find it hard to believe it would have no
               | impact on CPU stability regardless of what Intels PR team
               | to say while minimizing any actual impacts caused.
               | 
               | Edit: it sounds like intel have been aware of stability
               | issues for some time and have said nothing, I'm not sure
               | we have any reason to trust anything they say moving
               | forward, relating to oxidisation or any other claims they
               | make.
        
               | Dylan16807 wrote:
               | Well they didn't notice it for a good while, so it's
               | really hard to say how much impact it had.
               | 
               | And at a certain point if you barely believe anything
               | they say, then you shouldn't be using their statement to
               | get mad about. The complaint you're making depends on
               | very particular parts of their statement being true but
               | other very particular parts being not true. I don't think
               | we have the evidence to do that right now.
        
               | hsbauauvhabzb wrote:
               | > Well they didn't notice it for a good while, so it's
               | really hard to say how much impact it had.
               | 
               | That negates any arguments you had related to failure
               | rates.
               | 
               | > The complaint you're making depends on very particular
               | parts of their statement being true but other very
               | particular parts being not true
               | 
               | Er, I'm not even sure how to respond to this. GamersNexus
               | has indicated they know about the oxidisation issue,
               | intel *subsequently* confirm it was known internally but
               | no public statement was made until now. I'm not
               | unreasonably cherry picking parts of their statement and
               | then drawing unreasonable conclusions. Intel have very
               | clearly demonstrated they would have preferred to not
               | disclose an issue in fabrication processes which very
               | probably caused defective CPUs, they have demonstrated
               | untrustworthy behaviour related to this entire thing
               | (L1techs and GN are breaking the defective cpu story
               | following leaks from major intel clients who have
               | indicated that intel is basically refusing to cooperate).
               | 
               | Intel has known about these issues for some time and said
               | nothing. They have cost organisations and individuals
               | time and money. Nothing they say now can be trusted
               | unless it involves them admitting fault.
        
               | Dylan16807 wrote:
               | > That negates any arguments you had related to failure
               | rates.
               | 
               | I mean it's hard for _us_ to say, without sufficient
               | data. But Intel might have that much data.
               | 
               | Also what argument about failure rates? The one where I
               | said "if" about failure rates?
               | 
               | > Er, I'm not even sure how to respond to this.
               | GamersNexus has indicated they know about the oxidisation
               | issue, intel _subsequently_ confirm it was known
               | internally but no public statement was made until now.
               | 
               | GamersNexus thinks the oxidation might be the cause of
               | the instability everyone is having. Intel claims
               | otherwise.
               | 
               | Intel has no reason to lie about this detail. It doesn't
               | matter if the issue is oxidation versus something else.
               | 
               | Also the issue Intel admits to can't be the problem with
               | 14th gen, because it only happened to 13th gen chips.
               | 
               | > Intel has known about these issues for some time and
               | said nothing. Nothing they say now can be trusted unless
               | it involves them admitting fault.
               | 
               | If you don't trust what Intel said today at all, then you
               | can't make good claims about what they knew or didn't
               | know. You're picking and choosing what you believe to an
               | extent I can't support.
        
           | wslh wrote:
           | It is the Pentium FDIV drama all over again! [1]. It is even
           | in chapter 4 of the Andrew Grove's book!
           | 
           | [1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug
        
       | PedroBatista wrote:
       | Good for Intel to finally "figure it out" but I'm not 100% sure
       | microcode is 100% of the problem. As in everything complex
       | enough, the "problem" can actually be many compounded problems,
       | MB vendors "special" tune comes to mind.
       | 
       | But this is already a mess very hard to clean since I feel many
       | of these CPUs will die in an year or 2 because of these problems
       | today but by then nobody will remember this and an RMA will be
       | "difficult" to say the least.
        
         | johnklos wrote:
         | You're right - at least partly. If the issue is that Intel was
         | too aggressive with voltages, they can use microcode updates as
         | 1) an excuse to rejigger the power levels and voltages the BIOS
         | uses as part of the update, and 2) they can have the processor
         | itself be more conservative with the voltages and clocking it
         | calculates itself.
         | 
         | Anything Intel announces, in my experience, is half true, so
         | I'm interested to see what's actually true and what Intel will
         | just forget to mention or will outright hide.
        
       | nubinetwork wrote:
       | They already tried bios updates when they pushed out the "intel
       | defaults" a couple months ago...
        
         | wmf wrote:
         | Firmware and microcode aren't the same thing.
        
           | jeffbee wrote:
           | Very true and that's why it is odd that microcode has been
           | mentioned here. Surely they mean PCU software (Pcode), or
           | code for whatever they are calling the PCU these days.
        
             | wmf wrote:
             | I assume Intel's "microcode" updates include the PCU code,
             | maybe some ME code, and whatever other little cores are
             | hiding in the chip.
        
               | jeffbee wrote:
               | Well, do they? The operating system can provide microcode
               | updates to a running CPU. Can the operating system patch
               | the PCU, too?
               | 
               | When I look at a "BIOS update" it usually seems to
               | include UEFI, peripheral option ROMs, ME updates, and
               | microcode. So if the PCU is getting patched I would think
               | of it as a BIOS update. I think the ergonomics will be
               | indistinguishable for end users.
        
           | nicman23 wrote:
           | firmware can include microcode though
        
         | tedunangst wrote:
         | Except they didn't.
         | https://www.pcworld.com/article/2326812/intel-is-not-recomme...
        
       | ChrisArchitect wrote:
       | (updated from other post about mobile crashes)
       | 
       | Related:
       | 
       | Complaints about crashing 13th,14th Gen Intel CPUs now have data
       | to back them up
       | 
       | https://news.ycombinator.com/item?id=40962736
       | 
       | Intel is selling defective 13-14th Gen CPUs
       | 
       | https://news.ycombinator.com/item?id=40946644
       | 
       | Intel's woes with Core i9 CPUs crashing look worse than we
       | thought
       | 
       | https://news.ycombinator.com/item?id=40954500
       | 
       | Warframe devs report 80% of game crashes happen on Intel's Core
       | i9 chips
       | 
       | https://news.ycombinator.com/item?id=40961637
        
         | tedunangst wrote:
         | Not a dupe.
        
         | silisili wrote:
         | That one is mobile, this one is desktop, which they claim are
         | different causes.
        
       | HeliumHydride wrote:
       | https://scholar.harvard.edu/files/mickens/files/theslowwinte...
       | 
       | "Unfortunately for John, the branches made a pact with Satan and
       | quantum mechanics [...] In exchange for their last remaining bits
       | of entropy, the branches cast evil spells on future genera- tions
       | of processors. Those evil spells had names like "scaling- induced
       | voltage leaks" and "increasing levels of waste heat" [...] the
       | branches, those vanquished foes from long ago, would have the
       | last laugh."
       | 
       | "John was terrified by the collapse of the parallelism bubble,
       | and he quickly discarded his plans for a 743-core processor that
       | was dubbed The Hydra of Destiny and whose abstract Platonic ideal
       | was briefly the third-best chess player in Gary, Indiana.
       | Clutching a bottle of whiskey in one hand and a shot- gun in the
       | other, John scoured the research literature for ideas that might
       | save his dreams of infinite scaling. He discovered several papers
       | that described software-assisted hardware recovery. The basic
       | idea was simple: if hardware suffers more transient failures as
       | it gets smaller, why not allow software to detect erroneous
       | computations and re-execute them? This idea seemed promising
       | until John realized THAT IT WAS THE WORST IDEA EVER. Modern
       | software barely works when the hardware is correct, so relying on
       | software to correct hardware errors is like asking Godzilla to
       | prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD
       | TO RISING PROP- ERTY VALUES IN TOKYO. It's better to stop scaling
       | your transistors and avoid playing with monsters in the first
       | place, instead of devising an elaborate series of monster checks-
       | and-balances and then hoping that the monsters don't do what
       | monsters are always going to do because if they didn't do those
       | things, they'd be called dandelions or puppy hugs."
        
         | mattnewton wrote:
         | I haven't read this piece before but I just knew it was going
         | to be written by Mickens about halfway through your comment.
        
           | throwup238 wrote:
           | The "mickens" in the URL on the first line was a dead
           | giveaway :-)
        
         | yieldcrv wrote:
         | > According to my dad, flying in airplanes used to be fun...
         | Everybody was attractive ....
         | 
         | this is how I feel about electric car supercharging stations at
         | the moment. There is a definitely a privilege aspect, which
         | some attractive people are beneficiaries of in a predictable
         | way, as well as other expensive maintenance for their health
         | and attraction.
         | 
         | so I could see myself saying the same thing to my children
        
           | officeplant wrote:
           | I'm ruining that trend by charging my E-Transit in nice
           | places and dressing poorly.
        
         | tibbydudeza wrote:
         | Thanks - it is rather funny.
        
       | tpurves wrote:
       | I think it's telling that they are delaying the microcode patch
       | until _after_ all the reviewers publish their Zen5 reviews and
       | the comparisons of those chips against current Raptorlake
       | performance.
        
         | zenonu wrote:
         | Why even publish a comparison? Raptor Lake processors aren't a
         | functioning product to benchmark against.
        
           | tankenmate wrote:
           | Because if publishers don't publish then they don't make
           | money.
        
           | AnthonyMouse wrote:
           | Because the benchmarks will still exist on the sites after
           | the microcode is released and a lot of the sites won't bother
           | to go back and update them with the accurate performance
           | level.
        
       | Night_Thastus wrote:
       | "Elevated operating voltage" my foot.
       | 
       | We've already seen examples of this happening on non-OC'd server-
       | style motherboards that perfectly adhere to the intel spec. This
       | isn't like ASUS going 'hur dur 20% more voltage' and frying
       | chips. If that's all it was it would be obvious.
       | 
       | Lowering voltage may help mitigate the problem, but it sure as
       | shit isn't the cause.
        
         | dwattttt wrote:
         | They also admit a microcode algorithm produces incorrect
         | requests for voltages, it doesn't sound like they're trying to
         | shift the blame; ASUS doesn't write that microcode
        
         | sirn wrote:
         | It's worth noting that W680 boards are not a server board,
         | they're a workstation board, and often times they're
         | overclockable (or even overclocked by default). Wendell
         | actually showed the other day that the ASUS W680 board was
         | feeding 253W into a 35W (106W boost) 13700T CPU by default[1].
         | 
         | Supermicro and ASRock Rack do sell W680 as a server (because it
         | took Intel a really long time to release C266), but while
         | they're strictly to the spec, some boards are really not meant
         | for K CPUs. For example, the Supermicro MBI-311A-1T2N is only
         | certified for a non-TVB E/T CPUs, and trying to run the K CPU
         | on these can result in the board plumbing 1.55V into the CPU
         | during the single core load (where 1.4V would already be on the
         | higher side)[2].
         | 
         | In this particular case, the "non-OC'd server-style
         | motherboard" doesn't really mean anything (even more so in the
         | context of this announcement).
         | 
         | [1]: https://x.com/tekwendell/status/1814329015773086069
         | 
         | [2]: https://x.com/Buildzoid1/status/1814520745810100666
        
         | paulmd wrote:
         | Specifically I think the concerns are around idle voltage and
         | overshoot at this point, which is indeed something configured
         | by OEMs.
         | 
         | edit: BZ just put out a video talking about running Minecraft
         | servers destroying CPUs reliably, topping out at 83C, normally
         | in the 50s, running 3600 speeds. Which is a clear issue with
         | low-thread loads.
         | 
         | https://m.youtube.com/watch?v=yYfBxmBfq7k
        
       | userbinator wrote:
       | Reminds me of Sudden Northwood Death Syndrome, 2002.
       | 
       | Looks like history may be repeating itself, or at least rhyming
       | somewhat.
       | 
       | Back then, CPUs ran on fixed voltages and frequencies and only
       | overclockers discovered the limits. Even then, it was rare to
       | find reports of CPUs killed via overvolting, unless it was to an
       | extreme extent --- thermal throttling, instability, and shutdown
       | (THERMTRIP) seemed to occur before actual damage, preventing the
       | latter from happening.
       | 
       | Now, with CPU manufacturers attempting to squeeze all the
       | performance they can, they are essentially doing this
       | overclocking/overvolting automatically and dynamically in
       | firmware (microcode), and it's not surprising that some bug or
       | (deliberate?) ignorance that overlooked reliability may have
       | pushed things too far. Intel may have been more conservative with
       | the absolute maximum voltages until recently, and of course small
       | process sizes with higher potential for electromigration are a
       | source of increased fragility.
       | 
       | Also anecdotal, but I have an 8th-gen mobile CPU that has been
       | running hard against the thermal limits (100C) 24/7 for over 5
       | years (stock voltage, but with power limits all unlocked), and it
       | is still 100% stable. This and other stories of CPUs in use for
       | many years with clogged or even detached heatsinks seem to
       | contribute to the evidence that high voltage is what kills CPUs,
       | and neither heat nor frequency.
       | 
       | Edit: I just looked up the VCore maximum for the 13th/14th
       | processors - the datasheet says 1.72V! That is far more than I
       | expected for a 10nm process. For comparison, a 1st-gen i7 (45nm)
       | was specified at 1.55V absolute maximum, and in the 32nm version
       | they reduced that to 1.4V; then for the 22nm version it went up
       | slightly to 1.52V.
        
         | slaymaker1907 wrote:
         | Interesting, I hadn't heard about the Pentium overlocking
         | issues. My theory on the current issue that running chips for
         | long periods of time at 100C is not good for chip longevity,
         | but voltages could also be an issue. I came up with this theory
         | last summer when I built my rig with a 13900k, though I was
         | doing it with the intention of trying to set things up so the
         | CPU could last 10 years.
         | 
         | Anecdotally, my CPU has been a champ and I haven't noticed any
         | stability issues despite doing both a lot of gaming and a lot
         | of compiling on it. I lost a bit of performance but not much
         | setting a power limit of 150W.
        
         | cyanydeez wrote:
         | I believe the first round of Intel excuses here blamed the
         | motherboard manufacturers for trying to "auto" overclock these
         | CPUs.
        
         | unregistereddev wrote:
         | > Back then, CPUs ran on fixed voltages and frequencies and
         | only overclockers discovered the limits. Even then, it was rare
         | to find reports of CPUs killed via overvolting, unless it was
         | to an extreme extent --- thermal throttling, instability, and
         | shutdown (THERMTRIP) seemed to occur before actual damage,
         | preventing the latter from happening.
         | 
         | Oh the memories. I had a Thunderbird-core Athlon with a stock
         | frequency of (IIRC) 1050Mhz. It was stable at 1600Mhz, and I
         | ran it that way for years. I was able to get it to 1700Mhz, but
         | then my CPU's stability depended on ambient temperatures. When
         | the room got hot in the summer my workstation would randomly
         | kernel panic.
        
       | whalesalad wrote:
       | If I didn't just recently invest in 128gb of DDR4 I'd jump ship
       | to AMD/AM5. My 13900k has been (knock on wood) solid though -
       | with 24/7 uptime since July 2023.
        
         | thangngoc89 wrote:
         | I guess you're lucky. I own 2 machines for small scale CNN
         | training, one 13900k and one 14900k. I have to throttle the CPU
         | performances to 90% for stable running. This cost me about 1
         | hour / 100 hours of training.
        
           | whalesalad wrote:
           | Are you using any motherboard overclocking stuff? A lot of
           | mobo's are pushing these chips pretty hard right out of the
           | box.
           | 
           | I have mine at a factory setting that Intel would suggest,
           | not the asus multi core enhancement crap. noctua dh15 cooler.
           | It's really been a stable setup.
        
             | thangngoc89 wrote:
             | I didn't setup anything in BIOS. But my motherboard are
             | from asus. I will look into this. Thanks for your
             | suggestion.
        
               | Dunati wrote:
               | My 13900k has definitely degraded over time. I was
               | running bus defaults for everything and the pc was fine
               | for several months. When I started getting crashes it
               | took me a long time to diagnose it as a CPU problem.
               | Changing the mobo vdroop setting made the problem go away
               | for a while, but it came back. I then got it stable again
               | by dropping the core multipliers down to 54x, but then a
               | couple months later I had to drop to 53x. I just got an
               | rma replacement and it had made it 12 hours without
               | issue.
        
         | J_Shelby_J wrote:
         | I evaluated ddr4 vs ddr5 a year ago, and it wasn't worth it.
         | Chasing FPS and the cost to hit the same speed in ddr5 was just
         | too high, and I'm glad I did. I'm on a 13700k and I'm also very
         | stable. However, with the stock XMP profile for my ram I was
         | very much not stable and getting errors and bsods within
         | minutes on an occp burn in test. All I had to do was roll back
         | the memory clock speed a few hundred mhz.
        
       | phire wrote:
       | I find it hard to believe that it actually is a microcode issue.
       | 
       | Mostly because Intel has way too much motivation to pass it off
       | as a microcode issue, as they can fix a microcode issue for free,
       | by pushing out a patch. If it's an actual hardware issue, then
       | Intel will be forced to actually recall all the faulty CPUs,
       | which could cost them billions.
       | 
       | The other reason, is that it took them way too long to give
       | details. If it's as simple as a buggy microcode requesting an
       | out-of-spec voltage from the motherboard, they should have been
       | able to diagnose the problem extremely quickly and fix it in just
       | a few weeks. They would have detected the issue as soon as they
       | put voltage logging on the motherboard's VRM. And according to
       | some sources, Intel have apparently been shipping non-faulty CPUs
       | for months now (since April, from memory), and those don't have
       | an updated microcode.
       | 
       | This long delay and silence feels like they spent months of R&D
       | trying to create a workaround, create a new voltage spec to
       | provide the lowest voltage possible. Low enough to work around a
       | hardware fault on as many units as possible, without too large of
       | a performance regression, or creating new errors on other CPUs
       | because of undervolting.
       | 
       | I suspect that this microcode update will only "fix" the crashes
       | for some CPUs. My prediction is that in another month Intel will
       | claim there are actually two completely independent issues, and
       | reluctantly issue a recall for anything not fixed by the
       | microcode.
        
         | worthless-trash wrote:
         | I believe that the waters may be muddied enough that they wont
         | have to do a full recall and only if you 'provide evidence' the
         | system is still crashing.
        
         | RedShift1 wrote:
         | As I understand it, there are multiple voltages inside the CPU,
         | so just monitoring the motherboard VRM won't cut it.
         | 
         | That said I too am very skeptical. I just issued a moratorium
         | on the purchase of anything Intel 13th/14th gen in our company
         | and waiting for some actual proof that the issue is fully
         | resolved.
        
           | phire wrote:
           | It's complicated.
           | 
           | On Raptor lake, there are a few integrated voltage regulators
           | to which provide new voltages for specialised uses (like the
           | E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the
           | current draw on those regulators is pretty low. The bulk of
           | the power comes directly from motherboard VRMs on one of
           | several rails with no internal regulation. Most of the power
           | draw is grouped onto just two rails, VccGT for the GPU, and
           | VccCore (also known as VccIA in other generations) which
           | powers all the P-cores, all the E-cores and, the ring bus and
           | the last-level cache.
           | 
           | Which means all cores share the same voltage, and it's
           | trivial to monitor externally.
           | 
           | I guess it's possible the bug could be with only of the
           | integrated voltage regulators, but those seem to only power
           | various IO devices, and I struggle to see how they could
           | trigger this type of instability.
        
             | BeeOnRope wrote:
             | What's special about the E core's L2 cache such that it
             | gets on-chip regulated voltage?
        
         | jfindley wrote:
         | The months of R&D to create a workaround could simply be
         | because the subset of motherboards which trigger this issue are
         | doing something borderline/unexpected with their voltage
         | management, and finding a workaround for that behaviour in CPU
         | microcode is non-trivial. Not all motherboard models appear to
         | trigger the fault, which suggests that motherboard behaviour is
         | at least a contributing factor to the problem.
        
           | ploxiln wrote:
           | I think this issue was sort of cracked-open and popularized
           | recently by this particular video from Level1Techs:
           | https://www.youtube.com/watch?v=QzHcrbT5D_Y
           | 
           | Towards the middle of the video it brings up some very
           | interesting evidence, from online game server farms that use
           | 13900 and 14900 variants for their high single-core
           | performance for the cost, but with server-grade motherboards
           | and chipsets that do not do any overclocking, and would be
           | considered "conservative". But these environments show a very
           | high statistical failure rate for these particular CPU
           | models. This suggests that some high percentage of CPUs
           | produced are affected, and it's long run-time over which the
           | problem can develop, not just enthusiast/gamer motherboards
           | pushing high power levels.
        
         | starspangled wrote:
         | All modern CPUs come out of the factory with many many bugs.
         | The errata you see published are only the ones that they find
         | after shipping (if you're lucky, they might not even publish
         | all errata). Many bugs are fixed in testing and qualification
         | before shipping.
         | 
         | That's how CPU design goes. The way that is done is by pushing
         | as much to firmware as possible, adding chicken switches and
         | fallback paths, and all sorts of ways to intercept regular
         | operation and replace it with some trap to microcode or flush
         | or degraded operation.
         | 
         | Applying fixes and workaround might cost quite a bit of
         | performance (think spectre disabling of some kinds of branch
         | predictors for an obvious very big one). And in some cases you
         | even see in published errata they leave some theoretical
         | correctness bugs unfixed entirely. Where is the line before
         | accepting returns? Very blurry and unclear.
         | 
         | Almost certainly, huge parts of their voltage regulation (which
         | goes along with frequency, thermal, and logic throttling) will
         | be highly configurable. Quite likely it's run by entirely
         | programmable microcontrollers on chip. Things that are baked
         | into silicon might be voltage/droop sensors, temperature
         | sensors, etc., and those could behave unexpectedly, although
         | even then there might be redundancy or ways to compensate for
         | small errors.
         | 
         | I don't see they "passed it off" as a microcode issue, just
         | said that a microcode patch could fix it. As you see it's very
         | hard from the outside to know if something can be reasonably
         | fixed by microcode or to call it a "microcode issue". Most
         | things can be fixed with firmware/microcode patches, by design.
         | And many things are. For example if some voltage sensor circuit
         | on the chip behaved a bit differently than expected in the
         | design but they could correct it by adding some offsets to a
         | table, then the "issue" is that silicon deviates from the model
         | / design and that can not be changed, but firmware update would
         | be a perfectly good fix, to the point they might never bother
         | to redo the sensor even if they were doing a new spin of the
         | masks.
         | 
         | On the voltage issue, they did not say it was requesting an out
         | of spec voltage, they said it was incorrect. This is not
         | necessarily detectable out of context. Dynamic voltage and
         | frequency scaling and all the analog issues that go with it are
         | fiendishly complicated, voltage requested from a regulator is
         | not what gets seen at any given component of the chip, loads,
         | switching, capacitance, frequency, temperature, etc., can all
         | conspire to change these things. And modern CPUs run as close
         | to absolute minimum voltage/timing guard bands as possible to
         | improve efficiency, and they boost up to as high voltages as
         | they can to increase performance. A small bug or error in some
         | characterization data in this very complicated algorithm of
         | many variables and large multi dimensional tables could easily
         | cause voltage/timing to go out of spec and cause instability.
         | And it does not necessarily leave some nice log you can debug
         | because you can't measure voltage from all billion components
         | in the chip on a continuous basis.
         | 
         | And some bugs just take a while to find and fix. I'm not a
         | tester per se but I found a logic bug in a CPU (not Intel but
         | commercial CPU) that was quickly reproducible and resulted in a
         | very hard lockup of a unit in the core, but it still took weeks
         | to find it. Imagine some ephemeral analog bug lurking in a
         | dusty corner of their operating envelope.
         | 
         | Then you actually have to develop the fix, then you have to run
         | that fix through quite a rigorous testing process and get
         | reasonable confidence that it solves the problem, before you
         | would even make this announcement to say you've solved it. Add
         | N more weeks for that.
         | 
         | So, not to say a dishonest or bad motivation from Intel is out
         | of the question. But it seems impossible to make such
         | speculations from the information we have. This announcement
         | would be quite believable to me.
        
           | ChoGGi wrote:
           | I agree with most of what you said, so cherry picking one
           | thingy to reply to isn't my intention, but
           | 
           | "And some bugs just take a while to find and fix."
           | 
           | I think it's less that it took awhile to find the bug/etc,
           | more so that they've been pretty much radio silent for six
           | months. AMD had the issue with burning 7 series CPUs, they
           | were quick to at least put out a statement that they'll make
           | customers whole again.
        
           | sqeaky wrote:
           | > As you see it's very hard from the outside to know if
           | something can be reasonably fixed by microcode or to call it
           | a "microcode issue
           | 
           | They claimed:
           | 
           | > a microcode algorithm resulting in incorrect voltage
           | requests to the processor.
        
         | gwbas1c wrote:
         | It's most likely both a hardware issue and a microcode issue.
         | 
         | Making CPUs is kind-of like sorting eggs. When they're made,
         | they all have slightly different characteristics and get placed
         | into bins (IE, "binned") based on how they meet the specs.
         | 
         | To oversimplify, the cough "better" chips are sold at higher
         | prices because they can run at higher clock speeds and/or
         | handle higher voltages. If there's a spec of dust on the die, a
         | feature gets turned off and the chip is sold for a lower price.
         | 
         | In this case, this is most likely an edge case _that would not
         | be a defect_ if shipping microcode already handled it.
         | (Although it is appropriate to ask if it would result in
         | effected chips going into a lower-price bin if they are
         | effected.)
        
           | nequo wrote:
           | > If there's a spec of dust on the die, a feature gets turned
           | off and the chip is sold for a lower price.
           | 
           | Do you mean that if a 13900KS CPU has a manufacturing defect,
           | it gets downgraded and sold as 13900F or something else
           | according to the nature of the defect?
        
             | gwbas1c wrote:
             | It's been almost 20 years since I worked in the industry,
             | so I don't want to make assumptions about _specific
             | products._
             | 
             | When I was in the industry, it would be things like
             | disabling caches, disabling cores, ect. I don't remember
             | specific products, though.
             | 
             | Likewise, some die can handle higher voltages, clock
             | speeds, ect.
        
             | colejohnson66 wrote:
             | Yes. It's called the silicon lottery.
        
         | burnte wrote:
         | > I find it hard to believe that it actually is a microcode
         | issue.
         | 
         | They learned a lot from the Pentium disaster, even if it's a
         | hardware issue, they can address it with microcode at least,
         | which is just as good.
        
           | yencabulator wrote:
           | Except normally the result of a microcode workaround is that
           | the chip no longer performs at its claimed/previously-
           | measured level. Not "as good" by any standard.
           | 
           | For example, Intel CPU + Spectre mitigation is not "as good"
           | as a CPU that didn't have the vulnerability in the first
           | place.
        
             | sqeaky wrote:
             | At least with spectre applying the mitigation was a choice.
             | You could turn it off and game at full speed, while turning
             | it on for servers and web browsing for safety.
             | 
             | This is busted or working.
        
             | burnte wrote:
             | Microcode changes don't have to affect performance
             | negatively. Do you have any evidence this one will? If it's
             | a voltage algorithm failure, then I would expect that they
             | could run it as advertised with corrected microcode.
             | Unstable power is a massive issue for electronics like this
             | and I have no problem believing their explanation. Bad
             | power causes all sorts of weird issues.
        
               | yencabulator wrote:
               | If it was a microcode bug to begin with, fixing the bug
               | wouldn't need to degrade performance. If it was e.g. a
               | bad sensor, that you can "correct" well enough by
               | postprocessing, it doesn't need to degrade performance.
               | But if it's essentially incorrect binning -- the hardware
               | can't function as they thought it would, use microcode to
               | limit e.g. voltage to the range where it works right --
               | then that will degrade performance.
        
       | xyst wrote:
       | Wonder what Linus has to say on this. Dude knows how to rip into
       | crappy Intel products
        
         | weberer wrote:
         | Torvalds or the Youtube guy?
        
           | happosai wrote:
           | Yes
        
             | aruametello wrote:
             | I can imagine both will bash intel a bit.
             | 
             | "Linus Tech Tips" for the gaming crowd situation (loss of
             | "paid for" premium performance) and Torvalds for the
             | hardware vendor lack of transparency with the community.
        
       | salamo wrote:
       | Is there any info on how to diagnose this problem? Having just
       | put together a computer with the 14900KF, I _really_ don 't want
       | to swap it out if not necessary.
        
         | sudosysgen wrote:
         | Running a full memtest overnight and a day of Prime95 with
         | validation is the traditional way of sussing out instability.
        
           | paulmd wrote:
           | it's also a terrible stability test these days for the same
           | reasons Wendell talks about with cinebench in his video with
           | Ian (and Ian agrees too). Doesn't work like 90% of the chip -
           | it's purely a cache/avx benchmark. You can have a completely
           | unstable frontend and it'll just work fine because prime95
           | fits in icache and doesn't need the decoder, and it's just
           | vector op, vector op, vector op forever.
           | 
           | You can have a system that's 24/7 prime95 stable that crashes
           | as soon as you exit out, because it tests so very little of
           | it. That's actually not uncommon due to the changes in
           | frequency state that happen once the chip idles down... and
           | it's been this way for more than a decade, speedstep used to
           | be one of the things overclockers would turn off because it
           | posed so many problems vs just a stable constant frequency
           | load.
        
         | J_Shelby_J wrote:
         | OCCP burn in test with AVX and XMP disabled.
         | 
         | Tbh, XMP is probably the cause of most modern crashes on gaming
         | rigs. It does not guarantee stability. After finding a stable
         | cpu frequency, enable xmp and roll back the memory frequency
         | until you have no errors in occp. The whole thing can be done
         | in 20 minutes and your machine will have 24/7/365 uptime.
        
           | LtdJorge wrote:
           | This is good advice for overclocking, but how does it help
           | with the 13th/14th Gen issue? The issue is not due to clocks,
           | or at least doesn't appear to be.
        
         | Fabricio20 wrote:
         | There is no reliable way to diagnose this issue with the 14th
         | gen, the chip slowly degrades over time and you start getting
         | more and more (usually gpu driver under windows) crashes. I
         | believe the easy way might be to run decompression stress tests
         | if I remember correctly from Wendell's (Level1Techs) video.
         | 
         | I highly recommend going into your motherboard right now and
         | manually setting your configurations to the current intel
         | recommendation to prevent it from degrading to the point where
         | you'd need to RMA it. I have a 14900K and it took about 2.5
         | months before it started going south and it was getting worse
         | by the DAY for me. Intel has closed my RMA ticket since
         | changing the bios settings to very-low-compared-to-what-the-
         | original-is has made the system stable again, so I guess I have
         | a 14900K that isn't a high end chip anymore.
         | 
         | Below are the configs intel provided to me on my RMA ticket
         | that have made my clearly degraded chip stable again:
         | 
         | CEP (Current Excursion Protection)> Enable. eTVB (Enhanced
         | Thermal Velocity boost)> Enable. TVB (Thermal Velocity
         | boost)>Enable. TVB Voltage Optimization> Enable. ICCMAX
         | Unilimited bit>Disable. TjMAX Offset> 0. C-States (Including
         | C1E) >Enable. ICCMAX> 249A. ICCMAX_APP>200A. Power limit 1
         | (PL1)>125W. Power limit 2 (PL2)>188W
        
       | Havoc wrote:
       | > Intel is delivering a microcode patch which addresses the root
       | cause of exposure to elevated voltages.
       | 
       | That's great news for intel. If that's correct. If not that'll be
       | a PR bloodbath
        
       | eigenform wrote:
       | by "microcode" i assume they meant "pcode" for the PCU? (but they
       | decided not to make that distinction here for whatever reason?)
        
       | ChoGGi wrote:
       | Hmm, mid August is after the new Ryzens are out, I wonder how bad
       | of a performance hit this microcode update will bring?
       | 
       | And will it actually fix the issue?
       | 
       | https://www.youtube.com/watch?v=QzHcrbT5D_Y
        
       | uticus wrote:
       | Dumb question: let's say I am in charge of procurement for a
       | significant amount of machines, do I not have the option of
       | ordering machines from three generations back? Are older (proven
       | reliable) processors just not available because they're no longer
       | made, like my 1989 Camry?
        
         | wmf wrote:
         | Yeah, 12th gen is probably still available.
        
       | cdchn wrote:
       | I built a system last fall with an i9-13900K and have been having
       | the weirdest crashing problems with certain games that I never
       | had problems with before. NEVER been able to track it down, no
       | thermal issues, no overclocking, all updated drivers and BIOS.
       | Maybe this is finally the answer I've been looking for.
        
       ___________________________________________________________________
       (page generated 2024-07-23 23:11 UTC)