[HN Gopher] Intel Processor Instability Causing Oodle Decompress...
       ___________________________________________________________________
        
       Intel Processor Instability Causing Oodle Decompression Failures
        
       Author : firebaze
       Score  : 329 points
       Date   : 2024-02-23 09:14 UTC (13 hours ago)
        
 (HTM) web link (www.radgametools.com)
 (TXT) w3m dump (www.radgametools.com)
        
       | eqvinox wrote:
       | The article is a bit unclear on whether this happens with
       | standard/default settings, tough that's probably because they
       | don't know themselves. The workarounds changing things from
       | "Auto" to "disabled" or even increasing voltage settings
       | certainly seems like it also applies with defaults, and isn't
       | some overclocking/tuning side effect.
       | 
       | If that is the case... ouch.
        
         | yetihehe wrote:
         | It seems like it happens only on some select cpu specimens
         | (apparently works after replacing cpu with another one of the
         | same model), so probably a small binning failure?
        
           | eqvinox wrote:
           | I'm not sure I would call a binning failure "small" -- mostly
           | because I can't remember this ever happening before. Binning
           | is a core aspect of managing yields. And it seems that this
           | is breaking for a sufficient number of people to have a game
           | tooling vendor investigate. How many bug reports would it
           | take to get them into action?
        
             | rygorous wrote:
             | Person who actually did the investigation here. It took
             | exactly one bug report.
             | 
             | RAD/Epic Games Tools is a small B2B company. Oodle has one
             | person working full-time on it, namely me, and I do coding,
             | build/release engineering, docs, tech support, the works.
             | There's no multiple support tiers or anything like that,
             | all issues go straight into my inbox. Oodle Data in
             | particular is a lossless data compression API and many
             | customers use two entry points total, "compress" and
             | "decompress".
             | 
             | I get a single-digit number of support requests in any
             | given month, most of which is actually covered in the docs
             | and takes me all of 5 minutes to resolve to the customer's
             | satisfaction. The 3-4 actual bug reports I get in any given
             | year, I will investigate.
        
           | wmf wrote:
           | This is the usual silicon lottery. Every chip will work at
           | stock settings. Some will be stable when overclocked and some
           | won't.
        
         | scrlk wrote:
         | This sounds like motherboard manufacturers pushing aggressive
         | OOTB performance settings, likely in excess of the Intel spec.
        
           | eqvinox wrote:
           | That's... an assumption. At least 3 motherboard vendors are
           | affected, and going by the Gigabyte/MSI workarounds at the
           | end of the article, it looks like things need to be adjusted
           | away from Intel defaults.
           | 
           | ...it'll need a statement from Intel for some clarity on
           | this...
        
             | scrlk wrote:
             | > "Intel's default maximum TDP for the 13900K is 253 watts,
             | though it can easily consume 300 watts or more when given a
             | higher power limit. In our testing, manually setting the
             | power limit to 275-300 watts and the amperage limit to
             | 350A, proved to be perfectly stable for our 13900K. That
             | required going into the advanced CPU settings in the BIOS
             | to change the PL1/PL2 limits -- called short and long
             | duration power limits in our particular case. The
             | motherboard's default "Auto" power and current limits
             | meanwhile created instability issues -- which correspond to
             | a power limit of 4,096 watts and 4,096 amps." [0]
             | 
             | The motherboard manufacturers are setting default/auto
             | power and current limits that are way outside of Intel's
             | specs (253 W, 307 A) [1].
             | 
             | [0] https://www.tomshardware.com/pc-components/cpus/is-
             | your-inte...
             | 
             | [1] https://www.intel.com/content/www/us/en/content-
             | details/7438... (see pg. 98 and 184, the 13900K/14900K is
             | 8P + 16E 125 W)
        
               | eqvinox wrote:
               | Your [0] also says:
               | 
               | > It's not exactly clear why the 13900K suffers from
               | these instability problems, and how exactly downclocking,
               | lowering the power/current limits, and undervolting
               | prevent further crashes. Clearly, something is going
               | wrong with some CPUs. Are they "defective" or merely not
               | capable of running the out of spec settings used by many
               | motherboards?
        
               | scrlk wrote:
               | > Are they "defective" or merely not capable of running
               | the out of spec settings used by many motherboards?
               | 
               | I'd wager good money on the latter. Why would Intel
               | validate their CPUs against power and current limits that
               | are outside of spec? The users reporting issues probably
               | have CPUs that just made it in to the performance
               | envelope to be binned as a 13900K, so running out of spec
               | settings on these weaker chips results in instability.
               | 
               | It's cases like this where I wish Intel didn't exit the
               | motherboard space, they were known to be reliable but
               | typically at the cost of having a more limited feature
               | set.
        
               | RetroTechie wrote:
               | > I'd wager good money on the latter.
               | 
               | Don't guess, measure! The proper action here would be to
               | change BIOS settings from their default / "auto" settings
               | to per-Intel-spec safe ones. Same for RAM, and on systems
               | with known good power supplies, CPU cooling, software
               | installs etc. Then one of the following will happen:
               | 
               | a) BIOS ignores user settings & problem persists.
               | 
               | b) BIOS applies user settings & problem goes away.
               | 
               | c) BIOS applies user settings but problem persists.
               | 
               | Cases a & b count as "faulty BIOS" (motherboard
               | manufacturer caused). Case c counts as "faulty CPU", and
               | replacement cpu may or may not fix that.
               | 
               | No need to guess. Just do the legwork on systems where
               | problem occurs & power supply, RAM, CPU cooling & OS
               | install can be ruled out. Sadly, no doubt there's many
               | systems out there where that last condition doesn't hold.
        
               | vdaea wrote:
               | I have a 13900K. The default BIOS settings set a maximum
               | wattage of 4096W (!!!) that makes Prime95 fail. If I
               | change the settings back to 253W, what Intel says is the
               | maximum wattage, Prime95 stops failing.
               | 
               | Still, I don't know if I should RMA. I got the K version
               | because I intended to overclock in the future. And all of
               | this sounds like I won't be able to. I think increasing
               | the voltage a little bit makes the system more stable. I
               | have to play with it. (Really, if someone can say whether
               | I should RMA or not, I would appreciate some input)
               | 
               | Edit: decided to RMA. I have no patience for a CPU that
               | cost me +600EUR
        
               | eqvinox wrote:
               | > I'd wager good money on the latter.
               | 
               | I don't disagree, but I'm cautious about making a call
               | with the current information available. For example: yes,
               | a "4096W / 4096A" power limit _sounds_ odd, but it 's not
               | an automatic conclusion that this limit is intended to
               | work to protect the CPU. Instead, it is a function that
               | allows building a system with a particular PSU dimension
               | -- it would be odd if that were overloaded to protect the
               | chip itself. Maybe it is, maybe it isn't.
               | 
               | It's also very much possible that the M/B vendors altered
               | other defaults, but... I don't see
               | information/confirmation on that yet. It used to be that
               | at least one of the settings is the original CPU vendor
               | default, but last I looked at these things was >5 years
               | ago :(.
               | 
               | > It's cases like this where I wish Intel didn't exit the
               | motherboard space,
               | 
               | Full ACK.
        
               | sirn wrote:
               | Modern CPUs have many limits to protect the CPU and later
               | the clock behavior. For example, clock limit, current
               | limit (IccMax, 307A for 13900K), long power limit (PL1,
               | 125W), short power limit (PL2, 253W), transient peak
               | limit (PL3), overcurrent limit (PL4), thermal limit
               | (TjMax, 100c), Fast Throttle threshold (aka Per-Core
               | Thermal Limit, 107c), etc. It also has Voltage/Frequency
               | curves (V/F curve) to map how much voltage needs to drive
               | a certain frequency.
               | 
               | Intel 13900K has a fused V/F curve until its maximum
               | Turbo Boost 2.0 (5.5 GHz) in all cores, and two cores at
               | its Thermal Velocity Boost (aka favored cores, 5.8 GHz).
               | How much to boost depends on By Core Turbo Ratio. For
               | stock 13900K, this is 5.8 GHz for 2 cores, and 5.5 GHz
               | for up to 8 cores with E-cores capped at 4.3 GHz.
               | 
               | As you may have noticed, the CPU has a very coarse Turbo
               | Ratio beyond the first 2 cores. This is to allow the
               | clock to be regulated by one of the limits rather than a
               | fixed number. In reality, 253W PL2 can sustain around 5.1
               | GHz all P-cores, and after 56 seconds it will switch to
               | 125W PL1 which should give it around 4.7 GHz-ish (IIRC).
               | 
               | This is why when a motherboard manufacturer decides to
               | set PL1=PL2=4096 without touching other limits, it
               | results in a higher number in benchmark. The CPU will
               | consume as much power as it can to boost to 5.5 GHz,
               | until it hits one of the other limits (usually 100c
               | TjMax). This is how we ended up in this mess in the
               | consumer market.
               | 
               | Xeon, on the other hand, has a very conservative and
               | granular Turbo Ratio. My Xeon w9-3495x do have a fused
               | All Core Boost that does not exceed PL1 (56 cores 2.9 GHz
               | at 350W), which makes PL2 exist only for AVX512/AVX
               | workload.
               | 
               | (Side note: I always think that PL1=PL2=4096W is dumb
               | since performance gain is marginal at best, and always
               | set PL1=PL2=253W in all machines that I assembled. I
               | think even PL1=PL2=125W makes sense for the most usage. I
               | do overclock my Xeon to sustain PL1=PL2=420W though (this
               | is around 3.6 GHz, which is enough to make it faster than
               | 64-cores Threadripper 5995WX))
        
               | mschuster91 wrote:
               | > amperage limit to 350A
               | 
               | Jesus. By German electrical code, you need a 70 mm2
               | cross-section of copper to transfer that kind of current
               | without the cable heating up to a point that it endangers
               | the insulation. How do mainboard manufacturers supply
               | that kind of current without resistive loss from the
               | traces frying everything?
        
               | bonzini wrote:
               | Those electrical code cross sections are for 350A at
               | 230V, corresponding to about 80 kW (400V is the same as
               | it's actually three 230V wires).
               | 
               | Processors operate at about 1V. At 300W it's enough to
               | use a much smaller cross section, which is split across
               | many traces.
        
               | coryrc wrote:
               | That's not how I^2R losses work. Voltage is not relevant.
        
               | planede wrote:
               | I agree. However voltage is relevant for insulation,
               | which also affects how heat can dissipate for the wire,
               | and might also be relevant for a failing wire, when
               | higher and higher voltage can build up at the point of
               | failure (not sure if it's a common engineering
               | consideration outside of fuses, which are designed to
               | fail).
        
               | coryrc wrote:
               | That code is for round wire (minimal surface area per
               | volume) that can be placed inside insulation in walls.
               | 
               | This 350A is flat conductors (maximal surface area thus
               | heat dissipation) and very short (not that much power to
               | dissipate so the things it connects to have a significant
               | effect on heat dissipation).
        
               | michaelt wrote:
               | The traces are extremely short. Look at a modern
               | motherboard and you'll find a bank of capacitors and
               | regulators about 2cm away from the CPU socket.
               | 
               | If you've got 4 layers of 2oz copper, and you make the
               | positive and negative traces 10mm wide, you'll only be
               | dissipating 28 watts when the CPU is dissipating 300
               | watts. And most motherboards have more than 4 layers and
               | have space for more than 10mm of power trace width. And
               | there's a bunch of forced air cooling, due to that 300
               | watts of heat the CPU is producing.
               | 
               | Electrical code doesn't let buildings use cables that
               | dissipate 28 watts for 2cm of distance because it would
               | be extremely problematic if your 3m long EV charge cable
               | dissipated 4200 watts.
        
               | crote wrote:
               | Bursty current spikes, short and fat traces, using the
               | motherboard as a heat sink, active cooling, and allowing
               | the temperature to rise quite a bit. If you look at
               | thermal camera videos[0], it pretty clear where all the
               | heat is going (although a significant part of that is
               | coming from the voltage regulators).
               | 
               | On the other hand, your national electrical code is going
               | to assume you're running that 350A cable at peak capacity
               | 24/7, right next to other similarly-loaded cables,
               | stuffed in an isolated wall, for very long runs - and it
               | _still_ has to remain at acceptable temperatures during a
               | hot summer day.
               | 
               | [0]: https://www.youtube.com/watch?v=YyDMlXEZqb0
        
               | michaelt wrote:
               | _> The motherboard manufacturers are setting default
               | /auto power and current limits that are way outside of
               | Intel's specs_
               | 
               | The CPU only draws as much power as it needs, though?
               | 
               | I mean, if you plug a 20 watt phone into a 60 watt USB-C
               | power supply, or a 60 watt laptop into a 100 watt USB-C
               | power supply the device doesn't get overloaded with
               | power. It draws no more current than it needs.
               | 
               | The motherboard's power limits should state the amount of
               | power the PCB traces and buck regulators are rated to
               | provide to the socket - and if that's more than the
               | processor needs that's good, as it avoids throttling.
        
               | rfoo wrote:
               | The problem is these processors are unstable if they are
               | not properly throttled.
               | 
               | Of course users, especially enthusiast motherboard
               | consumers, hate throttling, hence the default.
        
               | scrlk wrote:
               | * User is running a demanding application (e.g. game).
               | 
               | * CPU clock speed increases (turbo boost), as long as the
               | CPU isn't hitting: 1) Tj_MAX (max temp before thermal
               | throttling kicks in); 2) the power and current limits
               | specified by the motherboard (in this case, effectively
               | disabled by the out of spec settings).
               | 
               | * Weaker chips will require more power to hit or maintain
               | a given turbo clock speed: with the power and current
               | limits disabled, the CPU will attempt to draw out of spec
               | power & current, causing issues for the on die fully-
               | integrated voltage regulator (noting that there's also
               | performance/quality variance for the FIVR), resulting in
               | the user experiencing instability.
        
               | wongarsu wrote:
               | Drawing peak power far in excess of the TDP is what all
               | Intel processors have been designed to do for many years
               | now.
               | 
               | Some consider it cheating the benchmarks, but the
               | justification is that TDP is the _Thermal_ Design Power.
               | It 's about the cooling system you need, not the power
               | delivery. If you make reasonable assumptions about the
               | thermal inertia of the cooling system you can Turbo Boost
               | at higher power and hope the workload is over before you
               | are forced to throttle down again.
               | 
               | Any mainboard that sets power limits to the TDP would be
               | considered wrong by both the community and Intel. This
               | looks like a solid indication that the issue is with
               | Intel
        
             | lupusreal wrote:
             | > _At least 3 motherboard vendors are affected_
             | 
             | Boss says, _" Do the thing."_ Engineer says, _" The thing
             | is out of spec!"_ Boss says, _" Competitor is doing the
             | thing already and it works."_ Engineer does the thing.
        
               | eqvinox wrote:
               | Yes, but also no. The motherboard market is "timing-
               | competitive", your product needs to be ready when the CPU
               | launches, especially for the kind of flagship CPU that
               | this specific issue is about. You can't wait and see what
               | the competitors are doing.
        
               | lupusreal wrote:
               | Fair point. Maybe _" This sort of thing worked fine in
               | the past."_
        
               | crote wrote:
               | Or perhaps they're all copying from the same reference
               | design?
        
             | paulmd wrote:
             | all three motherboard vendors enabling some out-of-spec
             | defaults wouldn't actually be surprising, though?
             | 
             | people forget, blowing up AM5 cpus wasn't just as Asus
             | thing... they were just the most ham-handed with the
             | voltages. _Everyone_ was operating out of spec, there were
             | chips that blew up on MSI and Gigabyte boards, and it wasn
             | 't just X3D either.
             | 
             | Intel is no different - nobody enforces the power limit out
             | of the box, and XMP will happily punch voltages up to
             | levels that result in eventual degradation/electromigration
             | of processors (on the order of years). Every enthusiast
             | knows that CPU failures are "rare" and yet either has had
             | some, or knows someone who's had some in their immediate
             | circles. Because XMP actually has caused non-trivial
             | degradation even on most DDR4 platforms.
             | 
             | In fact it's entirely possible that this is an
             | electromigration issue right here too - notice how this
             | affects some 13700Ks and 13900Ks too? Those chips have been
             | run for a year or two now. And if the processors were
             | marginal to begin with, and operated at out-of-spec
             | voltages (intentionally or not)... they could be starting
             | to wear out a little bit under the heaviest loads. Or the
             | memory controllers could be starting to lose stability at
             | the highest clocks (under the heaviest loads). That's a
             | thing that's not uncommon on 7nm and 5nm tier nodes.
        
               | rygorous wrote:
               | This is blowing up now, but the first report of this kind
               | of issue that reached me (I'm the current Oodle
               | maintainer) was in spring of last year. We've been trying
               | to track it down (and been in contact with Intel) since
               | then. The page linked in the OP has been up since
               | December.
               | 
               | Epic Games Tools is B2B and we don't generally get bug
               | reports from end users (although later last year, we did
               | have 2 end users write to us directly because of this
               | problem - first time this has happened for Oodle that I
               | can think of, and I've been working on this project since
               | 2015). Point being, we're normally at least one level
               | removed from end user bug reports, so add at least a few
               | weeks while our customers get bug reports from end users
               | but haven't seen enough of them yet to get in touch with
               | us (this is a rare failure that only affects a small
               | fraction of machines).
               | 
               | 13900Ks have been out since late Oct 2022. It's possible
               | that this doesn't show up on parts right out of the box
               | and takes a few months. It's equally plausible that it's
               | been happening for some people for as long as they've had
               | those CPUs, and the first such customers just bought
               | their new machines late 2022, maybe reported a bug around
               | the holidays/EOY that nobody looked at until January, and
               | then it took another 2-3 months for 3-4 other similar
               | crashes to show up that ultimately resulted in this case
               | getting escalated to us.
        
         | rygorous wrote:
         | (I'm Oodle maintainer and did most of this investigation.)
         | 
         | For the majority of systems "in the wild", I don't know. We had
         | two people with affected machines contact us and consent to do
         | some testing for us, and in both cases the issue still
         | reproduced after resetting the BIOS settings to defaults.
        
       | yetihehe wrote:
       | TL;DR: some motherboards by default overclock too much on some
       | intel processors, causing instability.
        
         | worewood wrote:
         | It's MCE all over again
        
         | eqvinox wrote:
         | That's not my interpretation. Cf the following:
         | 
         | > For MSI:
         | 
         | > Solution A): In BIOS, select "OC", select "CPU Core Voltage
         | Mode", select "Offset Mode", select "+(By PWM)", adjust the
         | voltage until the system is stable, recommend not to exceed
         | 0.025V for a single increase.
         | 
         | This really sounds like the Intel defaults are broken too.
        
           | yetihehe wrote:
           | Yes, that or insufficient quality checks, meaning some units
           | will fail, some will work. Apparently it was only a subset of
           | each model failing.
        
         | haunter wrote:
         | >default overclock too much
         | 
         | Per the article MSI literally suggests to OC the CPU to fix the
         | problem
        
           | xcv123 wrote:
           | Article recommends disabling overclocking. The MSI
           | recommendation is only to increase voltage.
        
       | londons_explore wrote:
       | If Oodle has control of this code, the logical thing for them to
       | do, when they detect a decompression checksum failure, is to re-
       | decompress the same data (perhaps single threaded rather then
       | multithreaded).
       | 
       | Sure, the user has a broken CPU, but if you can work around it
       | and still let the user play their games, you should.
        
         | yetihehe wrote:
         | Yes, but then the processor will fail at another task during
         | the game and corrupt some other memory. The only solution for
         | unstable processor is to make it stable or replace.
        
           | mike_hock wrote:
           | Yes. Props to Oodle for not passing on the hot potato but
           | trying to get the root cause fixed. This hack would have been
           | the easy way out for them so their product doesn't get
           | blamed.
        
         | lifthrasiir wrote:
         | As noted in the linked page, this issue would affect any heavy
         | use of CPU. Oodle happened to be optimized well to hit this
         | issue earlier than most other applications, but nothing can't
         | be really trusted at that point. There is a reason that they
         | recommend to disable overclocking if possible, because such
         | issue is in general linked to the instability due to excessive
         | overclocking.
        
           | barrkel wrote:
           | Retry is not an unusual response to unreliable hardware, and
           | all hardware is ultimately unreliable.
           | 
           | Software running at scale in the cloud is written to be
           | resilient to errors of this nature; jobs are cattle, if jobs
           | get stuck or fail they are retried, and sometimes duplicate
           | jobs are started concurrently to finish the entire batch
           | earlier.
        
             | lifthrasiir wrote:
             | Cloud machines have a way better guarantee about such
             | errors though. You will eventually see some errors at
             | scale, but that error rate can be reliably quantified and
             | handled accordingly.
             | 
             | Consumer machines are comparably wild. Remember that this
             | issue was mainly spotted from Unreal error messages. Some
             | do too much overclocking without enough testing, which
             | _will_ eventually harm the hardware anyway. Some happen to
             | live in places where single-error upsets are more frequent
             | (for example, high altitude or more radioactive bedrock).
             | Some have an insufficient power supply that causes erratic
             | behaviors only on heavy load. All those conditions _can_ be
             | handled in principle, but are much harder to do so in
             | practice. So giving up is much more reasonable in this
             | context.
        
         | cesaref wrote:
         | quote: However, this problem does not only affect Oodle, and
         | machines that suffer from this instability will also exhibit
         | failures in standard benchmark and stress test programs.
         | 
         | It sounds like a hardware issue, i'm guessing over-agressive
         | memory/cpu tuning, underpowered PSU triggering off behaviour
         | etc. The fact that replacing the processor makes the problem go
         | away does not in itself point to the processor as the issue -
         | you may find that changing the memory also 'fixes' the problem.
        
         | crest wrote:
         | I disagree. This code is just lucky enough to be able to detect
         | the data corruption, but without very deep understanding
         | afterward the whole system state has to be assumed to be
         | corrupted. You can retry until the code doesn't _detect_ data
         | corruption, but you have to assume other state is also
         | corrupted. The right thing would be to scream loudly at the
         | user that their system is unstable and to expect (future) data
         | corruption unless the hardware + firmware (or its
         | configuration) is fixed.
         | 
         | Sure it's unpleasant to be the messenger of bad news, but the
         | alternatives are far worse unless the system is just a
         | dedicated game console without any background processes (which
         | isn't how those CPUs are used).
        
           | BlueTemplar wrote:
           | Yeah, I had a RAM issue that might or might not have involved
           | turning on XMP, with eventually as a result several RAM
           | sticks with errors (BadRAM is amazing BTW, sadly, *nix-only),
           | and worse, corrupted storage partitions !
           | 
           | Was a real pain to deal with the fallout...
        
         | aChattuio wrote:
         | Needs to be fixed by microcode, BIOS update or recall from
         | Intel and partners
        
         | flohofwoe wrote:
         | That same hardware bug will also result in other corruption in
         | other places where it's not detected, which may then spiral
         | into much more catastophic behaviour. Oodle is just special in
         | that it detects the corruption and throws an error (which is
         | the right thing to do in this situation IMHO).
         | 
         | The ball is in Intel's court, such faulty CPUs should never
         | have made it out into the wild.
        
         | mnw21cam wrote:
         | I'll add to the chorus of other responses that the whole
         | computer is generally unreliable.
         | 
         | However, the really useful thing that this software _could_ do
         | is make the error message much better - explaining the likely
         | thing that is causing the decompression to fail, and advising
         | the computer should be fixed.
        
         | zX41ZdbW wrote:
         | We have the same check for silent data corruption in
         | ClickHouse. After it detects a checksum mismatch, it also tries
         | to check if it is caused by a single bit flip. If it is, we can
         | provide a more precise diagnostic for a user about possible
         | causes:
         | https://github.com/ClickHouse/ClickHouse/issues?q=is%253Aiss...
         | 
         | Then the natural question arises: if we detect that it is a
         | single bit flip, should we "un-flip" that bit, fix the data,
         | and continue? The answer is: no. These types of errors should
         | be explicit. They successfully help to detect broken RAM and
         | broken network devices, that have to be replaced. However, the
         | error is fixed automatically anyway by downloading the data
         | from a replica.
        
         | rygorous wrote:
         | (I'm the person who did most of the investigation.)
         | 
         | A relatively major realization during the investigation was
         | that a different mystery bug that also seemed to be affecting
         | many Unreal Engine games, namely a spurious "out of video
         | memory" error reported by the graphics driver, seemed to be
         | occurring not just on similar hardware, but in fact the exact
         | same machines.
         | 
         | For a public example, if you google for "gamerevolution the
         | finals crash on launch" and "gamerevolution the finals out of
         | video memory", you'll find a pair of articles describing
         | different errors, one resulting from an Oodle decompression
         | error, and one from the graphics driver spuriously reporting
         | out-of-memory errors, both posted on the same day with the same
         | suggested fix (lower P-core max clock multiplier).
         | 
         | That's the problem right there in a nutshell. It's not just
         | Oodle detecting spurious errors during its validation. Other
         | code on the same machine is glitching too. And "just try
         | repeating" is not a great fix because we can't trust the
         | "should we repeat?" check any more on that machine than we can
         | trust any of the other consistency checks that we already know
         | are spuriously failing at a high rate.
         | 
         | Many known HW issues you can work around in software just fine,
         | but frequent spurious CPU errors don't fall into that category.
        
       | lifthrasiir wrote:
       | This page doesn't seem to be linked from any other public page,
       | so I think it was a response to unwanted complaints from users
       | who tried to track the "oodle" thing in the error log---like
       | SQLite back in 2006 [1].
       | 
       | [1] https://news.ycombinator.com/item?id=36302805
        
         | eqvinox wrote:
         | It's linked from https://www.radgametools.com/tech.htm (click
         | "support" at the top, look next to "Oodle" logo - "Note: If you
         | are having trouble with an Intel 13900K or 14900K CPU, please
         | [[read this page]].")
        
           | lifthrasiir wrote:
           | Ooh, thank you! I looked so long at the Oodle section and
           | skimmed other sections as well (even searched for the
           | `oodleintel.htm` link in their source codes), but somehow
           | missed that...
        
         | atesti wrote:
         | There are some pages that are not linked, wondered what
         | happened to these products
         | 
         | https://www.radgametools.com/granny.html
         | https://www.radgametools.com/iggy.htm
         | https://www.radgametools.com/milesperf.htm
        
           | rygorous wrote:
           | Granny, Iggy and Miles are all discontinued as stand-alone
           | products. We're still providing support to existing customers
           | but not selling any new licenses.
        
             | pixelpoet wrote:
             | While we've got you, any chance you'll attend another
             | demoparty here in Germany? :)
             | 
             | Big thanks for your awesome blog, learnt much from it over
             | the years.
        
               | rygorous wrote:
               | Thanks!
               | 
               | Chance, sure, it's just a matter of logistics. Revision
               | is a bit tricky since it's usually shortly after GDC, a
               | very busy time in the game engine/middleware space I work
               | in, so not usually when I feel up to a pair of
               | international flights. :) Best odds are for something
               | between Christmas and New Year's Eve since that's when
               | I'm usually in Germany visiting family and friends
               | anyway.
        
       | imdsm wrote:
       | 1994 all over again!
        
         | cwillu wrote:
         | FDIV was a bug in the logical design, this is a over-aggressive
         | clock tuning, I fail to see any resemblance whatsoever beyond
         | intel being inside.
        
           | crest wrote:
           | It's no longer black and white like the FDIV bug, but if the
           | default configuration leads to data corruption in heavy SIMD
           | workloads... sure you can reduce clock speed or increase
           | voltage until it works, but unless the mainboards violate the
           | specs this is at least partly an Intel CPU flaw leading to
           | data corruption.
        
       | mhio wrote:
       | This sounds familiar... ye olde pentium III 1.13 GHz
       | 
       | https://www.tomshardware.com/reviews/intel-admits-problems-p...
        
         | ManuelKiessling wrote:
         | Unrelated to the actual topic, but kudos to the Tom's Hardware
         | site that they serve a 24 years old web posting flawlessly.
        
       | franzb wrote:
       | Reminds me of this saga I went through as an early adopter of AMD
       | Threadripper 3970X:
       | 
       | https://forum.level1techs.com/t/amd-threadripper-3970x-under...
       | 
       | HN discussion: https://news.ycombinator.com/item?id=22382946
       | 
       | Ended up investigating the issue with AMD for several months, was
       | generously compensated by AMD for all the troubles (sending
       | motherboards and CPUs back and forth, a real PITA), but the
       | outcome is that I've been running since then with a custom BIOS
       | image provided by AMD. I think at the end the fault was on
       | Gigabyte's side.
        
         | rwmj wrote:
         | Reminded me of the Intel Skylake bug found by the OCaml
         | compiler developers: https://tech.ahrefs.com/skylake-bug-a-
         | detective-story-ab1ad2...
        
         | rkagerer wrote:
         | Holy cow I had no idea CPU vendors would do this for you.
        
           | devmor wrote:
           | When you're not only helping them debug their own hardware
           | but are also spending money on their ridiculously overpriced
           | HEDT platform, it probably makes them want to keep you happy.
        
             | zitterbewegung wrote:
             | That is true and also lots of people use OCaml
        
           | zare_st wrote:
           | Supermicro gave us same type of assistance. Then new feature
           | of bifurcation did not work correctly. Without it, enterprise
           | telecommunications peripheral that costs 10x more than 4
           | socket Xeon motherboard can't run at nominal speed, and it
           | was ran on real lines, not test data.
           | 
           | They sent us custom BIOSes until it got stabilized and said
           | they'll put the patch in the following BIOS releases.
           | 
           | The thing is neither Intel nor AMD nor Supermicro can test
           | edge cases at max usage in niche environments without paying
           | money, but they would really love to claim with backup they
           | can be integrated for such solutions. If Intel wants to test
           | stuff in space for free they have to cooperate with NASA; the
           | alternative is in-house launch.
        
             | deepsun wrote:
             | NASA has super-elaborate testbeds and simulators. Maybe
             | producers can provide some format/interfaces/simulators for
             | users, users would write test-cases for it, and give back
             | to providers to run in-house.
             | 
             | If users pay seven figures+ it might make sense.
        
       | vdaea wrote:
       | I have a 13900K and I am affected. Out of the box BIOS settings
       | cause my CPU to fail Prime95, and it's always the same CPU cores
       | failing. Lowering the power limit slightly will make it stable. I
       | intended to better refrigerate the CPU and change the power limit
       | back to the default and if the problems continued I would RMA the
       | CPU, but now I'm not so sure that the BIOS is not pushing it
       | beyond the operating limits.
        
       | Havoc wrote:
       | Same cpus as the unity engine (or was it unreal?) with issues
       | 
       | Not a good look but at least it's fixable with bios tweaks rather
       | than a silicon flaw that's permanent
        
       | zvmaz wrote:
       | Does it mean that a formally verified piece of software like seL4
       | can still fail because of a potential "bug" in the hardware?
        
         | flumpcakes wrote:
         | I would assume that _any_ software, formally verified or not,
         | could fail due to a hardware problem. A cosmic ray could flip a
         | bit in a CPU register. The chances of that happening, and that
         | effecting anything in any meaningful way is probably
         | astronomically low. We probably have thousands of hardware
         | failures every day and don't notice them. This is why I think
         | rust in a kernel is probably a bad idea if it doesn't change
         | from the default 'panic on error'.
        
         | hmottestad wrote:
         | I would assume that software can always fail in the event of a
         | bug in the hardware. That's why systems that are really
         | redundant, for instance flight control computers, have several
         | computers that have to form a consensus of sorts.
        
           | eqvinox wrote:
           | It doesn't even need a bug in the hardware; cosmic rays or
           | alpha particles can also cause the same type of issue. For
           | those, making systems redundant is indeed a good solution.
           | 
           | For the situation of an actual ( _consistent_ ) hardware
           | _bug_ , redundancy wouldn't help... the redundant system
           | would have the same bug. Redundancy only helps for random-
           | style issues. (Which, to be fair, the one we're talking about
           | here seems to be.)
        
             | davrosthedalek wrote:
             | That's why some redundant systems use alternative
             | implementations for the parallel paths. Less likely that a
             | hardware bug will manifest the same way in all
             | implementations.
        
         | rygorous wrote:
         | Absolutely, yes.
         | 
         | It can also misbehave without any hardware bugs due to
         | glitching. Rates of incidence of this must be quite low or that
         | would be considered a HW bug, but it's never zero. Run code for
         | enough hours on enough machines collecting stack traces or core
         | dumps on crashes and you will notice that there's a low base
         | rate of failures that make absolutely no sense. (E.g. a null
         | pointer dereference literally right after a successful non-null
         | pointer check 2 instructions above it in the disassembly.)
         | 
         | You will also notice that many machines in a big fleet that log
         | such errors do so exactly once and never again, but some
         | reoccur several times and have a noticeably elevated failure
         | rate even though they're running the exact same code as
         | everyone else. This too is normal. These machines are, due to
         | manufacturing variation on the CPU, RAM, or whatever, much
         | glitchier than the baseline. Once you've identified such a
         | machine, you will want to replace it before it causes any
         | persistent data corruption, not just transient crashes or
         | glitches.
        
       | enraf wrote:
       | I got one of the faulty 13900k, at least in my case I can confirm
       | that the fault appeared using the default settings for pl1/pl2.
       | 
       | I was doing reinforcement learning on that system and it was
       | always crashing, I spent quite a bit of time trying to find the
       | problem, swapped the CPU for a 13700kf I was using in another PC,
       | the problem was solved.
       | 
       | So I contact Intel to start the RMA process, Intel said that the
       | MSI motherboard I was using doesn't support Linux, I emailed them
       | the official Intel GitHub repo with the microcode that enables
       | the support, they switched agents at that point but I was clear
       | to me at that moment that Intel was trying their best to avoid
       | the RMA, luckily I live in Europe, so I contacted my local
       | consumer protection agency and did the RMA through them, in the
       | meanwhile I saw a good offer for a 7950x + motherboard in an
       | online retailer, bought it and sold in the second market my old
       | motherboard and the RMA 13900k when I got it.
       | 
       | Not buying Intel ever again, I was using Intel because they
       | sponsor some projects in DS but damn.
        
         | hopfenspergerj wrote:
         | I've had instability with my 7700k since I bought it, and 16
         | months of bios updates haven't helped. Maybe this latest
         | generation of processors just has more trouble than older,
         | simpler designs.
        
           | acdha wrote:
           | Intel has been struggling with CPU performance for a decade,
           | and has been trying to regain their position in absolute
           | performance and performance/{price,watt} comparisons. I think
           | that means they're being less conservative than they used to
           | be on the hardware margins and also that their teams are
           | likely demoralized, too.
        
           | smolder wrote:
           | Possibly. I would start swapping parts around at that point.
           | Different memory, different CPU, or different motherboard.
           | Just 1 more anecdote, but my r7-7700x has been a dream (won
           | the silicon lottery). It runs at the maximum undervolt & RAM
           | at 6000 with no stability problems.
        
       | mattgreenrocks wrote:
       | I can't say I'm surprised by this at all.
       | 
       | I bought my 4790k's ASUS TUF board awhile back because I wanted
       | something basic enough and wasn't interested in overclocking or
       | tweaking. The BIOS had other ideas. I had to manually configure a
       | lot more things just to avoid overclocking, including setting RAM
       | timing and going through each BIOS setting to ensure it wasn't
       | overclocking in some way. The "optimal" setting would turn on
       | aggressive changes like playing with bus speed multipliers, etc.
        
         | layer8 wrote:
         | Few people buy a K processor who aren't interested in
         | overclocking and tweaking. I wouldn't be surprised if the BIOS
         | of a gaming mainboard sets the "optimal" defaults on that
         | basis, since the gaming market is all about benchmarks.
        
           | phil21 wrote:
           | I'm pretty much the same as OP. I almost always buy the K
           | version of the processor, but never intend to overclock. I
           | just figure I want the theoretical ability to, and the more
           | volume they have on those SKUs the less likely they are to
           | take it away entirely.
           | 
           | That or I'm just rewarding shitty corporate product
           | segmentation behavior. I never can quite decide.
           | 
           | I do agree over the recent years getting a "boring" higher-
           | end configuration is getting more and more difficult.
        
             | bonton89 wrote:
             | K chips often came with higher default clocks and
             | definitely have better resale value so they're often worth
             | buying even if you don't overclock.
        
               | smolder wrote:
               | Yes, the overclockable chips are better-binned/faster
               | chips even without enabling overclocking. (Unless you're
               | talking about X3D chips, which have most overclocking
               | features turned off due to thermal limitations of stacked
               | cache.)
        
           | Sweepi wrote:
           | "Few people buy a K processor who aren't interested in
           | overclocking and tweaking." The opposite is true: Most people
           | who buy a 'K' CPU dont do any tweaking, I would bet a
           | majority does not even activate things like XMP. The 'K' SKUs
           | are 1) The highest SKU in the Linup in a given Class 2) They
           | are faster then the non-'K' SKUs out of the box.
        
             | crote wrote:
             | > I would bet a majority does not even activate things like
             | XMP
             | 
             | I _highly_ doubt that. XMP is pretty much mandatory to get
             | even remotely close to the intended performance. Without
             | XMP your DDR4 memory isn 't going beyond 2400MHz - but you
             | almost have to _try_ to find a motherboard, memory, or CPU
             | which can 't run at 3200MHz or even higher. It has all been
             | designed for speeds like that, it's just not part of the
             | official DDR4 spec.
             | 
             | It's less critical with DDR5, but you're still expected to
             | enable it.
        
               | paulmd wrote:
               | nevertheless, both AMD and Intel refuse to warranty
               | processors operated outside of the spec, including when
               | done via XMP/Expo. AMD has gone so far as to add an
               | e-fuse in recent generations that permanently marks
               | processors that have been operated outside the official
               | spec.
               | 
               | https://www.extremetech.com/computing/amds-new-
               | threadripper-...
               | 
               | As much as enthusiasts would like this to be "normalized"
               | - from the perspective of the vendor it is _not_ , they
               | are very clear that this is something they do not cover.
               | And it will become more and more of a problem as
               | generations go forward - electromigration is happening
               | faster and faster (sometimes explosively, in the case of
               | AMD).
               | 
               | But it is quite difficult to get a gamer to understand
               | something when their framerate depends on not
               | understanding it.
               | 
               | https://semiengineering.com/uneven-circuit-aging-
               | becoming-a-...
               | 
               | https://semiengineering.com/3d-ic-reliability-degrades-
               | with-...
               | 
               | https://semiengineering.com/mitigating-electromigration-
               | in-c...
               | 
               | > GD-106: Overclocking AMD processors, including without
               | limitation, altering clock frequencies / multipliers or
               | memory timing / voltage, to operate beyond their stock
               | specifications will void any applicable AMD product
               | warranty, even when such overclocking is enabled via AMD
               | hardware and/or software. This may also void warranties
               | offered by the system manufacturer or retailer. Users
               | assume all risks and liabilities that may arise out of
               | overclocking AMD processors, including, without
               | limitation, failure of or damage to hardware, reduced
               | system performance and/or data loss, corruption or
               | vulnerability.
               | 
               | > GD-112: Overclocking memory will void any applicable
               | AMD product warranty, even if such overclocking is
               | enabled via AMD hardware and/or software. This may also
               | void warranties offered by the system manufacturer or
               | retailer or motherboard vendor. Users assume all risks
               | and liabilities that may arise out of overclocking
               | memory, including, without limitation, failure of or
               | damage to RAM/hardware, reduced system performance and/or
               | data loss, corruption or vulnerability.
        
               | bee_rider wrote:
               | I wonder if XMP is typically enabled by reviewers or on
               | marketing slides.
        
           | dmvdoug wrote:
           | > since the gaming market is all about benchmarks.
           | 
           | Why is that? I'm not a gamer so legit asking. It would seem
           | to me that what would be most important is do the actual
           | games that exist perform well, not some random, hypothetical
           | maximum performance that benchmarks can game.
        
             | Modified3019 wrote:
             | My impression is that people looking at gaming benchmarks
             | are looking at comparisons of FPS and frame times taken
             | from just running recent high end games, which sometimes
             | have settings to run through a repeatable demo for exactly
             | this purpose.
        
           | kllrnohj wrote:
           | The K chips aren't just unlocked, they're also significantly
           | faster out of the box. I'd guess very few K owners have any
           | intention of overclocking, especially as the gains are _very_
           | small, and instead just want the higher out of box
           | performance
        
         | ajross wrote:
         | As others are pointing out, that "basic enough" CPU is in fact
         | aimed directly at the overclocking market, as (likely) is the
         | motherboard you put it on. This isn't basic at all, this is a
         | high end tweaker rig. It's just a _decade old_ tweaker rig.
        
           | mattgreenrocks wrote:
           | Fair enough. I build it to last 7-10 years typically, so
           | happy to spend a little more on a quality board.
           | 
           | What's the go-to basic mobo brand/board for non-tweakers
           | these days?
        
             | ajross wrote:
             | There's not a lot, honestly. Pretty much all discrete
             | motherboards are gaming rigs of some form. The basic
             | computer for general users is now a "laptop" (which tend to
             | work quite well for general gaming, FWIW). But the low end
             | choices from the regular suspects (Gigabyte, MSI, Asus) are
             | generally fine in my experience. You do occasionally get a
             | weird/flawed device, like you do with many product areas.
        
               | Arrath wrote:
               | Yeah it really seems the market has bifurcated into "DIY
               | build-a-computer" targeted towards gamers, bedazzled with
               | RGB and all that jazz, and "Buy a used/refurb Dell mini-
               | atx office desktop computer", assuming they don't just
               | default to 'buy a laptop' as you point out.
        
             | rpcope1 wrote:
             | Supermicro is usually a good bet.
        
       | IYasha wrote:
       | (c) 1991 - 2024 Epic Games Tools LLC
       | 
       | Wow. RAD was bought by Epic? I kinda missed that. Feels old. :(
        
         | mobilio wrote:
         | Yup
         | 
         | https://www.epicgames.com/site/en-US/news/epic-acquires-rad-...
        
       | tibbydudeza wrote:
       | Why I chose a i9 13900 (non K) variant rather- my PC earns me
       | money as a freelance software dev so I can't stand weird issues
       | like this
        
         | svantana wrote:
         | As another software dev, I would pay big money for the "worst
         | possible computer" that exhibits all of the glitches and issues
         | that end users see. It's so annoying to get bug reports that I
         | can't reproduce.
        
           | tibbydudeza wrote:
           | I had my time during my embedded days - did a site visit 1000
           | km away and discovered no wonder the serial port and
           | scanner/printer is going wonky.
           | 
           | No shielding, earth - using the crappiest/cheapest PC they
           | could get instead of using the recommended kit as the sales
           | droid wanted a bigger commission.
           | 
           | Said call me when you replaced the h/w - I walked out and
           | went to the airport. They never called me.
        
         | fabianhjr wrote:
         | If that was the case why not go for Ryzen + ECC memory?
        
           | tibbydudeza wrote:
           | Got ECC memory in my server - I am a value for money - my
           | previous kit was i7-6700 system 48GB so I really sweated it
           | until Jetbrains let me know "She canno go more Captain".
           | 
           | DDR4/Intel motherboards are cheaper than AM5/DDR5 - also a
           | Ryzen laptop foobarred on my daughter so to me Intel kit was
           | just more stable - no weird XMP issues or overclocking to the
           | nines.
        
         | codexon wrote:
         | I got a 13900 non-K on a Linux server and it randomly locked up
         | the system after a month.
        
       | op00to wrote:
       | This sounds a lot like the behavior I see when I have over locked
       | my processor too far, and try to run AVX heavy workloads!
       | Cranking down the frequency during AVX seems to stabilize things.
        
         | Arech wrote:
         | Had the same experience overclocking old AMD Phenom II a while
         | ago. Worked flawlessly in all publicly available test software
         | I tried, until I run some custom heavily vectorized code, which
         | eventually required to shave off almost all overclocking :D
        
           | op00to wrote:
           | There's a way (at least on my Intel) to tell the processor to
           | clock down a certain number of steps depending on whether AVX
           | is being executed. So, for the majority of stuff that didn't
           | use AVX I let 'er rip, but when AVX is running it clocks down
           | a couple steps. I could use less voltage, and this CPU is
           | fast enough. I think it's a 13900k.
        
             | Arech wrote:
             | Ah, that's an interesting feature of new CPUs, I didn't
             | know about it! Thanks for telling!
        
       | newsclues wrote:
       | remember when. intel had intel branded reference boards? I would
       | like a comeback please
        
       | mbrumlow wrote:
       | I recently built a new system with a i9 149kf and a Ausus Formula
       | motherboard. For a VFIO system so I could run windows and play
       | some games.
       | 
       | It was a nightmare to get running stable. None is the default
       | settings the motherboard used worked. Games crashed, kernel and
       | emacs compiles failed.
       | 
       | End result I had to cap turbo to 5.4ghz on a 6ghz chip, and
       | enable settings that capped max watts and temperature for
       | throttling to 90c.
       | 
       | System seems stable now. Can get sustained 5.4ghz without
       | throttling and enjoying games at 120fps with 4k resolution.
       | 
       | Even though it is working I do feel a way about not being able to
       | run the system at any of the advertised numbers I paid for.
        
         | doubled112 wrote:
         | What I'm not happy about is the marketing around turbo boost.
         | 
         | You know how ISPs used to sell "up to X Mbps"? Same idea. Your
         | chip will turbo boost "up to 6.00 GHz".
         | 
         | It's basically automated overclocking, and as you learned,
         | sometimes it can't even do it in a stable fashion. Some of
         | those chips will never clock "up to 6.00 GHz" but they didn't
         | lie. "up to"
        
           | wtallis wrote:
           | It's particularly bad when they stop telling you what clock
           | speeds are achievable with more than one core active. At best
           | these days you get a "base clock" spec that's very slow that
           | doesn't correspond to any operating mode that occurs in real
           | life. You used to get a table of x GHz for y active cores,
           | but then the core counts got too large and the limits got
           | fuzzier.
           | 
           | And laptops have another layer of bullshit, because the
           | theoretical boost clocks the chip is capable of will in
           | practice be limited by the power delivery and cooling
           | provided by that specific machine, and the OEMs never tell
           | you what those limits are. So they'll happily take an extra
           | $200 for another 100MHz that you'll never see for more than a
           | few milliseconds while a different model with a slower-on-
           | paper CPU with better cooling can easily be more than 20%
           | faster.
        
             | kilolima wrote:
             | Yes, this is the situation for my dell laptop's i7-1165G7.
             | Alleged turbo boost to 4.7ghz! In reality, it will hit that
             | for a sec and then throttle to ~1ghz. I had to disable the
             | turbo boost AND two cores in bios to let it even achieve
             | ~2.0ghz speeds consistently. It's a total scam. Turns out
             | my 8th-Gen i5 laptop is almost the same speed on
             | benchmarks, just because it's a few mm thicker with better
             | cooling.
        
         | CooCooCaCha wrote:
         | Do you think that happened because you had insufficient
         | cooling?
         | 
         | Its hard to cool these new chips. AMD included.
        
           | kijin wrote:
           | Even if GP's cooling setup was less than ideal, the chip
           | should have throttled itself to a stable frequency instead of
           | crashing left and right.
        
             | cesarb wrote:
             | It might not have throttled fast enough. Without sufficient
             | thermal mass (or with insufficient heat transfer to that
             | thermal mass, for instance if the thermal paste is
             | misapplied), it might heat up too fast for the sensors to
             | keep up.
        
           | Hikikomori wrote:
           | For different reasons though. AMD's chiplets produce heat in
           | a small area which makes it hard to transfer heat quickly.
           | Intel just use a shitload more power and thus more heat.
        
             | CooCooCaCha wrote:
             | That's not entirely it though. Modern AMD and Intel chips
             | are built to run at their thermal/frequency limits and will
             | jump to those limits at a moments notice in order to
             | maximize performance.
             | 
             | So unless you have powerful cooling you will hit the
             | thermal limit.
        
               | Hikikomori wrote:
               | I didnt say that it was everything that matters, just
               | commenting on the difference between them.
        
         | JohnBooty wrote:
         | enable settings that capped max watts and temperature for
         | throttling to 90c.
         | 
         | You were going _above_ 90C before???
         | 
         | My first thought is that seems insane, but is apparently normal
         | for that chip, according to Intel: "your processor supports up
         | to 100degC and any temperatures below it are normal and
         | expected"
         | 
         | https://community.intel.com/t5/Processors/i9-14900K-temperat...
         | 
         | That is just wild though. On one hand you should obviously get
         | the performance that was advertised and that you paid for. On
         | the other hand IMO operating a CPU at 90-100C is just insane.
         | It really feels like utter desperation on Intel's part.
         | 
         | I would be curious what kind of cooling setup you have.
        
           | legosexmagic wrote:
           | the amount of cooling you get is proportional to the
           | difference of component temperature to ambient temperature.
           | thats why modern chips are engineered to run much hotter.
        
             | jnxx wrote:
             | Until Dust Puppy kills 'em
        
           | dist-epoch wrote:
           | For both Intel/AMD 100C is now a target, not a limit.
        
           | bee_rider wrote:
           | Hey, is there a cooling solution that sprays water on some
           | sort of heat spreader and lets it evaporate? Kidding. Kinda.
           | But actually, is that possible?
        
             | Analemma_ wrote:
             | That's essentially what vapor chamber coolers are. It's a
             | sealed unit where the working fluid evaporates on the CPU
             | end, absorbing a ton of heat, and then condenses on the
             | other side, before going back to do the cycle again.
             | Because the heat of vaporization is so large, these can
             | move a lot more heat than ordinary heat sinks.
        
             | smolder wrote:
             | Heat pipes sort of do that naturally without any active
             | "spraying". They contain a fluid that phase changes and
             | carries heat away. Closed loop water coolers have the
             | active flow of water you want for maximum effect. I don't
             | think your idea would be an improvement on that.
        
           | jcalvinowens wrote:
           | > On the other hand IMO operating a CPU at 90-100C is just
           | insane.
           | 
           | No it isn't, the manufacturer literally says it's normal! I
           | think people who spend as much money on cooling setups as the
           | chip are the insane ones.
           | 
           | My favorite story: I once put Linux on a big machine that had
           | been running windows, and discovered dmesg was full of
           | thermal throttling alerts. Turns out, the heatsink _was not
           | in contact with the CPU die_ because it had a nub that needed
           | to occupy the same space as a little capacitor.
           | 
           | I'd been using that machine to play X-plane for over two
           | years, and I never noticed. It was not meaningfully slower:
           | the throttling events would only happen every ten or so
           | seconds. I'm still using it today, although with the heatsink
           | fixed :)
           | 
           | I have a garage machine with a ca. 2014 Haswell that's been
           | running full tilt at 90C+ for a good bit of its life. It just
           | won't die.
        
           | crote wrote:
           | Temperatures like that have been fairly normal for a few
           | generations now - both for Intel and AMD. It might look
           | insane compared to what you were used to seeing a decade ago,
           | but it's actually not that crazy.
           | 
           | First, the temperature sensors got a lot better. Previously
           | you only had one sensor per core/cpu, and it was placed
           | wherever there was space - nowadays it'll have dozens of
           | sensors placed in the most likely hotspots. A decade ago a
           | 70C temp meant that some parts of the CPU were closer to 90C,
           | whereas nowadays a 90C temp means the hottest part is
           | _actually_ 90C.
           | 
           | Second, the better sensors allow more accurate tuning. While
           | 100C might be totally fine, 120C is probably already going to
           | cause serious damage. The problem here is that you can't just
           | rely on a somewhat-distant sensor to always be a constant 20C
           | below the peak value: it's also going to be lagging a bit. It
           | might take a tenth of a second for that temp spike in the
           | hotspot to reach the sensor, but in the time between the
           | spike starting and the temp at the sensor raising enough to
           | trigger a downthrottle you could've already caused serious
           | damage. A decade ago that meant leaving some margin for
           | safety, but these days they can just keep going right up to
           | the limit.
           | 
           | It's also why overclocking simply isn't really a "thing"
           | anymore. Previous CPUs had plenty of safety margin left for
           | risk-takers to exploit, modern CPUs use up all that margin by
           | automatically overclocking until it hits either a temperature
           | limit or a power draw limit.
        
             | JoshTriplett wrote:
             | Exactly. Temperature measurements are a lot like available
             | memory measurements in that regard. People wonder why the
             | OS uses up all available memory, and it's because the OS
             | knows that empty memory is useless, while memory used to
             | cache disk is potentially useful (and can always be
             | discarded when that memory is needed for something else).
             | So, the persistent state of memory is always "full".
             | 
             | Similarly, processors convert thermal headroom to
             | performance, until they run out of thermal headroom. So if
             | you improve the cooling on a processor that has work to do
             | (rather than sleeping), it will use up that cooling and
             | perform better, rather than performing the same and running
             | cooler.
             | 
             | (Mobile processors operate differently, since they need to
             | maintain a much tighter thermal envelope to not burn the
             | user. And processors can also target power levels rather
             | than thermals. But when a processor is in its default "run
             | as fast as possible" mode, its normal operating temperature
             | will be close to the 100C limit.)
        
               | jtriangle wrote:
               | There's a side benefit to this as well, your cooling
               | solution is more effective at 90C than it is at 40C, you
               | know, highschool physics, deltaT and all that.
        
         | BobbyTables2 wrote:
         | Seems scary that 10% difference in clock frequency is
         | makes/breaks stability.
         | 
         | How much margin is really there?
        
           | rygorous wrote:
           | Dynamic switching power (i.e. the fraction of the chip's
           | power consumption from actually switching transistors, as
           | opposed to just "being on") scales with V^2 * f, where
           | V=voltage and f=frequency, and V in turn depends on f, where
           | higher frequencies need higher voltage. Not really linearly
           | (It's Complicated(tm)), but it's not a terrible first-order
           | approximation, which makes the dynamic switching power have a
           | roughly cubic dependency on frequency.
           | 
           | Therefore, 1.1x the frequency at the high end (where
           | switching power dominates) is 1.33x the power draw.
           | 
           | Those final few hundred MHz really hurt. Conversely, that's
           | also why you see "Eco" power profiles with a major reduction
           | in power draw that cost you maybe 5-10% of your peak
           | performance.
        
         | callalex wrote:
         | Did you try cleaning everything and re-mounting the cooler with
         | new paste? I've seen similar behavior when people mess up and
         | get bubbles in their paste. Do you see wildly different
         | temperature readouts for different cores?
        
         | phantomwhiskers wrote:
         | I also recently built a system with the 14900KF on an ASUS TUF
         | motherboard and NZXT Kraken 360 cooler, and so far I haven't
         | experienced any issues running everything at default BIOS
         | settings (defaulted to 5.7GHz). I haven't seen temps above 70C
         | yet, although granted I also haven't seen CPU load go above
         | 40%, and haven't tried running any benchmarking software.
         | 
         | I'm curious about what you are using for cooling, as 90C at
         | 5.4ghz seems way off compared to what I am seeing on my
         | processor, but it could just be that I'm not pushing my
         | processor quite as hard even with the higher clock rate.
        
       | jeffbee wrote:
       | Similar experience with an Asus motherboard. With their automatic
       | tuning, instability leading to compiler crashes. Had to manually
       | set the BIOS for sanity.
       | 
       | I believe the problems are compounded by the way their SuperIO
       | controls the cooler, because the crashes were associated with
       | temperature excursions to 100C. It's too slow to ramp up and too
       | quick to ramp down. It is possible to tune this from userspace
       | under Linux. But really the up ramp should be controlled by a
       | leading indicator like the voltage regulator instead of a lagging
       | indicator. Alternately the Linux p-state controller could
       | anticipate the power levels and program a higher fan speed.
        
       | PawBer wrote:
       | Reminds me of this Raymond Chen classic:
       | https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...
        
       | Ochi wrote:
       | So ideally, we should disable hyper threading to mitigate
       | security issues and now also disable turbo mode to mitigate
       | memory corruption issues. Maybe we should also disable C states
       | to avoid side-channel attacks and disable efficiency cores to
       | avoid scheduler issues... and at some point we are back to a
       | feature set from 20+ years ago. :P
        
         | bee_rider wrote:
         | IMO it is worth noting that the "turbo mode," as you call it,
         | seems to be an overlock that some motherboards do by default.
         | Not the stock boost frequencies.
         | 
         | The hyperthread and c-state stuff, eh, if you want to run code
         | that might be a virus you will have to limit your system. I
         | dunno. It would be a shame if we lost the ability to ignore
         | that advice. Most desktops are single-user after all.
        
           | blibble wrote:
           | turbo boost is an advertised feature of the chip
           | 
           | these chips that have been specially binned because they are
           | supposedly stable at those frequencies (within an envelope
           | set by intel)
           | 
           | if intel can't get it to work they shouldn't be selling these
           | chips at all
        
             | bee_rider wrote:
             | Unless I misread the blog post, there doesn't seem to be
             | any issue with the stock turbo behavior.
        
           | alwayslikethis wrote:
           | Provided enough cooling, a chip that can boost to its turbo
           | frequency for a few seconds should also run stably at that
           | frequency indefinitely. Nowadays these boost clocks are so
           | high that there is often not much gained by pushing any
           | further.
        
           | dist-epoch wrote:
           | Intel should police their own ecosystem.
        
             | ajross wrote:
             | They have, in the past. People (including posters here)
             | absolutely freaked out about clock-locked processors and
             | screamed about the needless product differentiation of
             | selling "K" CPUs at a premium.
             | 
             | People _want_ to overclock. Gamers want to see big numbers.
             | If gamers don 't do it their motherboard vendors will. It's
             | not a market over which Intel is going to have much
             | control, really.
             | 
             | Note that you don't, in general, see this kind of silly
             | edgelord clocking in the laptop segments.
        
               | dist-epoch wrote:
               | Overclocking is ok.
               | 
               | Out of the box default overclocking is not, this aspect
               | should be policed.
        
               | ajross wrote:
               | FWIW, there's no evidence that this is an "out of the box
               | default" configuration on any of this hardware. Almost
               | certainly these are users who clicked on the "Mega Super
               | Optimizzzz!!!" button in their BIOS settings. And again,
               | overclocking support on gaming motherboards _is a
               | feature_ that consumers want, and will pay for. So of
               | course the vendors are going to provide it.
        
               | rygorous wrote:
               | Oodle maintainer here, we had two people that hit the
               | issue offer to run some experiments for us. Neither were
               | doing any overclocking before and both tried numerous
               | things including resetting to BIOS defaults and also
               | updating their BIOS (there was a known [to Intel] issue
               | affecting some ASUS boards that had been fixed in a BIOS
               | update in spring of 2023, and we were asked to rule it
               | out.)
               | 
               | This issue doesn't affect every such machine, but both
               | people affected by the issue that consented to run tests
               | for us still had the issue reproduce after flashing BIOS
               | to current and with BIOS default settings for absolutely
               | everything.
               | 
               | Among the settings enabled _by default_ on some boards:
               | current limit set to 511 amps (...wat), long duration
               | power limit set to 350W (Intel spec: 125W), short
               | duration power limit also set to 350W (Intel spec: 253W),
               | "MultiCore Enhancement" which is extra clock boosting
               | past what the CPUs do themselves set to "Auto" not "Off",
               | and some others.
        
             | jnxx wrote:
             | Why does this reminds me in this big, extremely profitable
             | company that made something every American needs in a
             | while, which seems to have abandoned all sanity in their
             | processes? Looks like Intel and Boeing are on a similar
             | path....
        
           | jnxx wrote:
           | > The hyperthread and c-state stuff, eh, if you want to run
           | code that might be a virus you will have to limit your
           | system.
           | 
           | So, you are trusting all web pages you view? Because these
           | are unknown code running on your box which probably has some
           | beefy private data.
        
             | Wowfunhappy wrote:
             | I know some people browse the web while gaming, but I
             | don't. For the gaming use case, I legit want a toggle that
             | says "yes, all the code I'm running is trusted, now please
             | prioritize maximum performance at all costs." For all I
             | care this mode can cut the network connection since I don't
             | do multiplayer.
             | 
             | I imagine people doing e.g. heavy number crunching might
             | want something similar.
        
             | bee_rider wrote:
             | I run noscript and try to be selective about which pages I
             | enable.
        
           | jrockway wrote:
           | Remember that you run a lot of untrusted code on your single-
           | user desktop through Javascript on websites. Javascript can
           | do all those side channel attacks like Spectre and Meltdown.
        
             | bee_rider wrote:
             | There are almost certainly unmitigated Spectre-style bugs
             | hiding in modern hardware. People who don't block
             | JavaScript by default are impossible to protect anyway.
        
             | VyseofArcadia wrote:
             | Maybe you do, but some of us use NoScript[0] and whitelist
             | sites we trust.
             | 
             | I'm not affiliated with NoScript. I just think it's insane
             | that we run oodles of code to display web pages.
             | 
             | [0] https://noscript.net/
        
               | Workaccount2 wrote:
               | Using no-script made me realize how unchained the
               | Internet has become. Sites with upwards of 15 different
               | domains all running whatever JS they want on your
               | machine. Totally insane.
        
         | secondcoming wrote:
         | Or just disable overclocking.
        
           | ajross wrote:
           | Why is this downvoted? That's exactly what's happening here.
           | The affected devices are being overclocked, and the
           | instructions at the end of the linked support document detail
           | how to find the correct limits for your CPU and set them in
           | your BIOS.
        
             | sandworm101 wrote:
             | I think it is because overclocking has become so normal
             | that it is an expected feature on most chips. Being told to
             | disable it is like being told to disable the supercharger
             | on your new Ferrari: you are no longer getting what you
             | thought you had paid for.
        
               | crazygringo wrote:
               | Is that true?
               | 
               | I thought the entire premise of overclocking was that
               | it's not officially supported and it may break things.
               | 
               | The whole point is that you're _not_ paying for it and it
               | 's entirely at-risk.
               | 
               | Because if you do want a higher level of _guaranteed_
               | performance, you _do_ need to pay for a faster chip (if
               | it exists).
        
               | Kluggy wrote:
               | CPU manufacturers certainly hold the line you stated but
               | motherboard venders have jumped over the line and now
               | sell motherboards that overlock for the end user entirely
               | transparently.
               | 
               | It's fair for the end user who bought a motherboard that
               | promises a higher clock speed to expect that clock speed.
        
               | crazygringo wrote:
               | Do these motherboards explicitly provide a warranty that
               | covers not just damage from overclocking but also CPU
               | errors?
               | 
               | If you can provide links, I'd be curious to see what
               | guarantees they make. "What's fair" depends very
               | specifically on what language they use.
        
               | Red_Leaves_Flyy wrote:
               | From my limited knowledge the motherboard manufacturers
               | hide behind disclaimers. Iirc even using fast ram at
               | their rated clock speed with a cpu that does not support
               | that speed is a warranty violation.
        
               | Kluggy wrote:
               | https://www.asus.com/motherboards-
               | components/motherboards/tu... has "ASUS Multicore
               | Enhancement" bios setting which defaulted to Auto which
               | is documented as "This item allows you to maximize the
               | overclocking performance optimized by ASUS core ratio
               | settings."
               | 
               | They have now entered the AI bubble with
               | 
               | https://www.asus.com/microsite/motherboard/Intelligent-
               | mothe...
               | 
               | MSI has a similar setting, although I don't know exactly
               | what models have it nor what it's called
        
               | sandworm101 wrote:
               | Well, Ferrari also tells people not to break speed
               | limits. But if their cars started breaking apart at 85mph
               | they would still be blamed. This might not be warranty
               | repair, intel is probably not liable legally, but this
               | should have impact on their reputation: intel put out a
               | chip that does not handle overclocking very well. Ok.
               | I'll remember that when I am shopping for my next chip.
        
               | simondotau wrote:
               | > The whole point is that you're not paying for it
               | 
               | Tell that to anyone who paid extra for a K-series Intel
               | chip.
        
               | bee_rider wrote:
               | I don't think there's a great car analogy because the
               | ecosystems and stakes are different.
               | 
               | These chips require motherboards to function, and these
               | unlocked chips get their configuration from the
               | motherboard. There's no analogous entity to Ferrari the
               | company here, it is like you bought an engine from one
               | company, a gearbox from another, and the gearbox had a
               | "responsiveness enhancement" setting that always redlined
               | your RPMs or something (I don't know cars).
        
               | ajross wrote:
               | Ferrari for sure warrants their cars as sold. But if you
               | take it to a mod shop and put in an aftermarket turbo
               | that damages your valves, you don't go whining to HN with
               | an article with "Ferrari Engine Instability" in the
               | title, do you?
               | 
               | I don't know what you want Intel to do here. They tell
               | you upfront what the power and clock limits are on the
               | parts. But the market has a three decade history of
               | people pushing the chips a little past their limit for
               | fun and profit, so they "allow" it even if they know it
               | won't work for everything.
        
               | kuschku wrote:
               | These Motherboards are Intel certified. If I get a mod
               | shop to install a ferrari certified part, I expect the
               | part to work.
        
               | ajross wrote:
               | Meh. So you're in the "Intel should take affirmative
               | action to prevent overclocking" camp. And as mentioned
               | the response to that is that they've tried that (on
               | multiple occasions, using multiple techniques) and people
               | freaked out about that too. They can't win, I guess.
        
               | xcv123 wrote:
               | Ferrari does not allow modifications of their cars. If
               | you take it to a mod shop, they will void the warranty
               | and you will be banned from purchasing a new Ferrari.
        
               | rygorous wrote:
               | (Oodle maintainer here.) This issue only occurs on some
               | small fraction of machines, but on those that we've had
               | access to, it reproduces with BIOS defaults and no user-
               | specified overclocking. It turns out several of these
               | mainboards will overclock and set other values out of
               | spec even at BIOS defaults.
               | 
               | I don't have a problem with end users experiencing
               | instability once they manually overclock (that's how it
               | goes), but CPU + mainboard combinations experiencing
               | typical OC symptoms with out-of-the-box settings is just
               | not OK.
               | 
               | This appears to be an arms race between mainboard vendors
               | all going further and further past spec by default
               | because it gives better benchmark and review scores and
               | their competition does it. Intel for their part are
               | themselves also dialing in their parts more aggressively
               | (and, presumably although I don't know for sure, with
               | smaller margins) over time, and they are for sure aware
               | that this is happening, because a) even had they not
               | known already (which they did) they would have learned
               | about this months ago when we first contacted them about
               | this issue, b) technically out of spec or not, as long as
               | it seems to work fine for users and makes their parts
               | look better in reviews, they're not going to complain.
               | 
               | However, it turns out, it does not work fine for at least
               | some small fraction of machines. I have no idea what that
               | percentage is, but it's high enough that googling for say
               | "Intel 13900K crash" yields plenty of relevant results.
               | Some of this will be actual intentional overclockers but,
               | given how boards default to some extend of out-of-spec
               | overclocking enabled, it's unlikely to be all of them.
               | 
               | Meanwhile we (and other SW vendors) are getting a
               | noticeable uptick in crash reports on, specifically,
               | recent K-series Intel CPUs, and it's not something we can
               | sanely work around because the issue manifests as code
               | randomly misbehaving and it's not even when doing
               | anything fancy. The Oodle issue in particular is during
               | LZ77-family decompression, which is to say, all integer
               | arithmetic (not even multiplies, just adds, shifts and
               | logic ops), loads/stores and branches. This is the bare
               | essentials. If it was an issue with say AVX2, we could
               | avoid AVX2 code paths on that family of machines (and
               | preferably figure out what exactly is going wrong so we
               | can come up with a more targeted workaround). But there
               | is no sane plan B for "integer ALU ops, load/stores and
               | branches don't work reliably under load". If we can't
               | rely on that working, there is not enough left for us to
               | work around bugs with!
               | 
               | I realize this all looks like finger-pointing, but this
               | is truly beyond our capacity to work around in a sane way
               | in SW, with what we know so far anyway. Maybe there is a
               | much more specific trigger involved that we could avoid,
               | but if so, we haven't found it yet.
               | 
               | Either way, when it's easy to find end user machines that
               | are crashing at stock settings, things have gone too far
               | and Intel needs to sit down with their HW partners and
               | get everyone (themselves included) to de-escalate.
        
               | johnklos wrote:
               | If you put your foot to the floor in a supercharged car,
               | you're going to eventually have to let it up lest you
               | melt things or you burn all your oil because your rings
               | aren't making contact with the cylinder walls any more.
               | It's an apt metaphor since the same is true of CPUs. You
               | can't run a CPU at 400 watts continuously for more than a
               | handful of seconds at a time.
               | 
               | The problem is that Intel has normalized it so much that
               | all their high end CPUs do this, and apparently do it
               | often. It's not unexpected that they might be too close
               | to the point where things are melting, so to speak.
               | 
               | I'd rather slower and more stable any day - I chose a
               | Ryzen 7900 over a 7900X intentionally - but that isn't
               | what all the marketing out there is trying to sell. The
               | fancy motherboards, the water coolers, the highly clocked
               | memory all account for lots of markup, so that's what's
               | marketed. I'm not a fan.
               | 
               | It is worth noting a distinction between the terms
               | "overclocking" and "turbo clocking". "Overclocking" has
               | traditionally meant running the clock "over" the rating.
               | "Turbo clocking" is now built in to almost every CPU out
               | there. One technically can void your warranty, whereas
               | the other doesn't.
               | 
               | Since we're mostly technical people here, we should use
               | the appropriate term where the context makes that choice
               | more accurate. It's like virus and Trojan - we SHOULD be
               | technically correct, but that doesn't mean highly
               | technical people aren't still calling Trojans viruses now
               | and then.
        
             | cduzz wrote:
             | I think "overclock" implies that the end-user is doing
             | something that's out-of-spec for the thing they're
             | operating.
             | 
             | This "I can run a core at a faster speed" is a documented
             | feature so not really overclocking.
        
               | TylerE wrote:
               | That's literally overclocking. You're clocking it at a
               | rate over the nameplate value. Just because the BIOS is
               | factory-unlocked doesn't really change anything.
        
               | LoganDark wrote:
               | Yeah. Intel advertises the ability to overclock, but that
               | doesn't mean overclocking is in spec. It just means Intel
               | allows you to run it out of spec if you so choose. The
               | spec says you can set the clock multiplier, it doesn't
               | say anything above the stock range will actually be
               | stable.
        
               | TylerE wrote:
               | Plus almost always people are tweaking voltages and such
               | also.
        
               | LoganDark wrote:
               | This article is about automatic / enabled-by-default
               | overclocking, which isn't actually specified by Intel but
               | is done by the motherboard manufacturers anyway. At least
               | the "GAMING GAMING GAMING" oriented ones like MSI and
               | friends.
               | 
               | As an example of motherboard manufacturers going outside
               | specifications, my MSI motherboard has a built-in option
               | to change BCLK, which is the clock reference for the
               | entire PCIe bus. Changing it not only overclocks the CPU,
               | but also the GPU's connection (not the GPU itself), as
               | well as the NVMe SSD.
               | 
               | This was so not-endorsed by Intel that they quickly
               | pushed microcode that shuts the CPU down if it detects
               | BCLK tampering.
               | 
               | In response, MSI added a dropdown that allows you to
               | downgrade the microcode of the CPU.
               | 
               | So yeah. Very not within specifications.
        
           | bee_rider wrote:
           | At least the built in "multicore enhancement" type overclocks
           | that are popular nowadays with motherboard manufacturers.
           | 
           | I wonder if the old style "bump it up and memtest" type
           | overclocking would catch this. Actually, what is the good
           | testing tool nowadays? Does memtest check AVX frequencies?
        
             | wmf wrote:
             | Intel Performance Maximizer is from Intel so I'd hope it
             | has good tests.
        
           | Kon-Peki wrote:
           | But isn't overclocking the entire point of buying the K
           | version of these chips?
        
             | secondcoming wrote:
             | Yes it is but there's more to overclocking than just the
             | CPU. You also need adequate cooling and fine-tuning of
             | parameters I'll never truly understand. There are so many
             | moving parts that you're not guaranteed anything. It seems
             | like the CPUs were actually running at their overclocked
             | speeds, but the rest of the system couldn't keep up.
        
               | szundi wrote:
               | Also might need to raise voltage etc
        
             | rygorous wrote:
             | Definitely not. These are supposed to be higher-quality
             | bins that also ship with higher stock clock rates (both
             | base and boost) and are rated for them.
             | 
             | I don't know how common this is across the whole population
             | of PC buyers, but personally, I have for sure bought
             | K-series parts then not clocked them past their stock
             | settings, trusting that they are rated for it and deeply
             | uninterested in any OCing past that. (I prefer my machines
             | stable, thank you very much.)
        
               | bbarnett wrote:
               | Decades ago I visited a fellow Amiga user's house. He had
               | an overclocked 68060 Apollo board.
               | 
               | He was so happy with the speed. Would not stop telling
               | everyone, and talking about it. Yet as I watched him demo
               | it, it rebooted every minute or so. Most unstable thing
               | ever.
               | 
               | Sure it booted in 2 seconds, and he just went about his
               | merry way, but.. what?! Guy could have still overclocked
               | a little less and had stability, but nope.
               | 
               | Some overclockers are weird.
        
           | zenonu wrote:
           | Intel is already running their CPUs at the red line. We're
           | seeing the margin breaking down as Intel tries to remain
           | competitive. The latest 14900KS can even pull > 400W. It's
           | utter insanity.
        
             | refulgentis wrote:
             | I wish I kept up with them better, I swear every 3 months I
             | see a headline that is "Intel says N nodes in {N-1 duration
             | - 3 months}. I think I just saw 5 nodes in 4 years? And
             | we've had 2 in the last 4? Sigh.
        
         | vondur wrote:
         | It seems like problems occur from different firmware from the
         | various motherboard manufacturers. I have a motherboard with a
         | Ryzen 7950x and it would randomly not boot. I'd have to remove
         | the battery from the system, let it fully reset, and then it
         | would work again. Finally an update to the firmware fixed that
         | bug.
        
         | whoisthemachine wrote:
         | Good, fast, cheap. Choose two.
        
         | paulmd wrote:
         | haha, knew it wouldn't take long for the AMD fanboys to get
         | winding up on how awful this is gonna be.
         | 
         | https://news.ycombinator.com/item?id=39479081
         | 
         | Somehow people think that it's a strawman, but people like
         | parent comment _actually think and post like this_ lol
        
         | rkagerer wrote:
         | Right, when they still knew how to make reliable hardware
         | instead of cramming in features that aren't fully thought out
         | and come with questionable tradeoffs to hit the bleeding edge.
        
       | perryizgr8 wrote:
       | If I ever encounter a CPU bug causing problems in my production
       | code, I will consider my life complete. I will be satisfied that
       | I've practiced my profession to a high degree of completeness.
        
         | dist-epoch wrote:
         | You should go work for Facebook. At their scale they are
         | encountering CPU bugs daily:
         | 
         | > This has resulted in hundreds of CPUs detected for these
         | errors
         | 
         | https://arxiv.org/abs/2102.11245
        
       | ryukoposting wrote:
       | I vaguely recall motherboard vendors ignoring Intel's power
       | recommendations a couple years ago, which was causing weird
       | thermal/performance issues (was it Asus?). I get the impression
       | that's what's happening here, again.
        
       | adamc wrote:
       | While I appreciate their point of view, from a consumer pov this
       | would definitely be a failure of their software, since an
       | implicit requirement is that it has to run on the customer
       | machines. People aren't going to throw away their CPU for this,
       | they are going to return the game (if possible), and certainly
       | express the bad user experience they had with it.
        
         | xcv123 wrote:
         | This is a hardware fault causing other software to fail. Intel
         | and mainboard manufacturers have recommended workarounds.
         | Customers are not stupid and they know who is at fault.
        
           | adamc wrote:
           | I predict you are wrong and there will be returns. It's not
           | really a question of stupidity, but of what options are
           | available to them.
        
             | xcv123 wrote:
             | No they just follow the provided instructions, go into the
             | BIOS setup, and fix the settings. These are $1k CPUs.
             | Purchased by enthusiasts, not retards. If their CPU is
             | unstable they will want to fix it.
        
       | johnklos wrote:
       | For years I've had this impression that Intel CPUs were, to put
       | it simply, trying too hard. I administer servers for various
       | companies, and some use Intel even though I generally recommend
       | AMD or non-x86.
       | 
       | A pattern I've noticed is that some of the AMD systems I
       | administer have never crashed or panicked. Several are almost ten
       | years old and have had years of continuous uptime. Some have had
       | panics that've been related to failing hardware (bad memory,
       | storage, power supply), but none has become unstable without the
       | underlying cause eventually being discovered.
       | 
       | Intel systems, on the other hand, have had panics that just have
       | had no explanation, have had no underlying hardware failures, and
       | have had no discernible patterns. Multiple systems, running an OS
       | and software that was bit-for-bit identical to what has been
       | running on AMD systems, have panicked. Whereas some of the AMD
       | systems that had bad memory had consumer motherboards with non-
       | ECC memory, the Intel systems have typically been Supermicro or
       | Dell "server" systems with ECC.
       | 
       | In one case two identical Supermicro Xeon D systems with ECC were
       | paired with two identical Steamroller (pre-Ryzen) AMD systems.
       | All systems provided primary and backup NAT, routing,
       | firewalling, DNS, et cetera. The Xeon systems were put in place
       | after the AMD systems because certain people wanted "server
       | grade" hardware, which is understandable, and low power AMD
       | server systems weren't a thing in that time period. Over the
       | course of several years, the Xeon systems had random panics,
       | whereas one of the AMD systems had a failed SSD, but no unplanned
       | or unexplained panic or outage, and the other had never had a
       | panic or unplanned reboot in all the years it was in continuous
       | service.
       | 
       | Had I collected information more deliberately from the very
       | beginning of these side-by-side AMD and Intel installations, I'd
       | have something more than anecdotal, but I'm comfortable calling
       | the conclusion real: multiple generations of Intel systems, even
       | with server hardware and ECC, have issues with random crashes and
       | panics, on the order of perhaps one every year or two. I do not
       | see a similar instability on AMD, though.
       | 
       | With brand new Intel CPUs taking substantially more power than
       | similarly performing AMD CPUs, we have a more literal example of
       | what I think is the underlying cause: Intel is trying way too
       | hard to get every tiny bit of performance out of their CPUs,
       | often to the detriment of the overall balance of the system.
       | Between the not insignificantly higher number of CPU
       | vulnerabilities on Intel due to shortcuts illustrated by the
       | performance losses from enabling mitigations, and the rather
       | shocking power draw of stock Intel CPUs that have turbo boosting
       | enabled, I can't recommend any Intel system for any use where
       | stability matters.
        
         | crote wrote:
         | On the other hand, I'm currently dealing with an AMD system
         | which seems to randomly hard reboot every couple of days, as if
         | someone pressed the power button for 5 seconds.
         | 
         | It could keep running for 70 hours, or it could crash twice in
         | 4 hours. Stress-testing CPU, GPU, memory, and storage doesn't
         | invoke a crash, but it'll crash when all I'm running is a
         | single Firefox tab with HN open.
         | 
         | Maybe I got unlucky, or maybe you got lucky. Who knows, really.
        
           | xcv123 wrote:
           | If it's not the system hardware could it be power
           | instability? Brownouts will trigger a reset. A UPS with power
           | conditioner will fix that.
        
       | terrelln wrote:
       | We also regularly run into hardware issues with Zstd. Often the
       | decompressor is the first thing to interact with data coming over
       | the network. Or like in this case the decompressor is generally
       | very sensitive to bit-flips, with or without checksumming
       | enabled, so notices other hardware problems more than other
       | processes running on the same host.
       | 
       | One decision that Zstd made was to include only a checksum of the
       | original data. This is sufficient to ensure data integrity. But,
       | it makes it harder to rule out the decompressor as the source of
       | the corruption, because you can't determine if the compressed
       | data is corrupt.
        
       | FileSorter wrote:
       | I recently had to RMA my i9-13900KS because it was faulty. I was
       | experiencing some of the weirdest behavior I have ever seen on a
       | PC. For example, 1. Whenever I tried to install Nvidia drivers I
       | would get "7-zip: Data error" 2. A fresh install of Windows would
       | give me SxS error when trying to launch edge 3. I could not open
       | the control panel 4. BSOD loop on boot
        
       | dontupvoteme wrote:
       | Decompression failure immediately went to a far grimmer failure
       | mode in my mind..
        
       | colombiunpride2 wrote:
       | I wonder if this is partly related to the LGA1700 frame problems
       | that tend to bend the heat spreader.
       | 
       | There are two after market contact frames that drop the
       | temperature around 10 Celsius and ensure flat contact with the
       | head spreader. The stock frame causes the center of the head
       | spreader to dip.
       | 
       | I wonder if the turbo boost is controlled by a Proportional-
       | integral-derivative controller.(PID)
       | 
       | The idea that the parameters are fine tuned to slow down
       | processor speed as it heats up but before it overshoots its
       | maximum threshold.
       | 
       | If those PID values are tuned to assume flat heat spreader/heat
       | sink contact, I can see where a bent heat spreader could cause
       | the cpu to overshoot its safe limit and cause errors.
        
       | ezekiel68 wrote:
       | Based on the article contents, this doesn't seem to be CPU
       | errata. We already know that overclocking above a certain point
       | will cause OS crashes. This seems to be system instability just
       | below the threshold of crashing. Aggressive power and clock
       | settings manifest as this instability without causing an actual
       | crash.
       | 
       | I don't find this situation much different than needing to dial
       | back BIOS settings when actual crashes are observed.
        
         | sedatk wrote:
         | Exactly. I remember when I overcloked my 486DX4-100 to 120Mhz,
         | everything would work fine but the floppy drive. It just
         | wouldn't work for whatever reason. Never thought it was a CPU
         | issue, I'd just asked for it.
        
       ___________________________________________________________________
       (page generated 2024-02-23 23:00 UTC)