[HN Gopher] Intel Processor Instability Causing Oodle Decompress...
___________________________________________________________________
Intel Processor Instability Causing Oodle Decompression Failures
Author : firebaze
Score : 329 points
Date : 2024-02-23 09:14 UTC (13 hours ago)
(HTM) web link (www.radgametools.com)
(TXT) w3m dump (www.radgametools.com)
| eqvinox wrote:
| The article is a bit unclear on whether this happens with
| standard/default settings, tough that's probably because they
| don't know themselves. The workarounds changing things from
| "Auto" to "disabled" or even increasing voltage settings
| certainly seems like it also applies with defaults, and isn't
| some overclocking/tuning side effect.
|
| If that is the case... ouch.
| yetihehe wrote:
| It seems like it happens only on some select cpu specimens
| (apparently works after replacing cpu with another one of the
| same model), so probably a small binning failure?
| eqvinox wrote:
| I'm not sure I would call a binning failure "small" -- mostly
| because I can't remember this ever happening before. Binning
| is a core aspect of managing yields. And it seems that this
| is breaking for a sufficient number of people to have a game
| tooling vendor investigate. How many bug reports would it
| take to get them into action?
| rygorous wrote:
| Person who actually did the investigation here. It took
| exactly one bug report.
|
| RAD/Epic Games Tools is a small B2B company. Oodle has one
| person working full-time on it, namely me, and I do coding,
| build/release engineering, docs, tech support, the works.
| There's no multiple support tiers or anything like that,
| all issues go straight into my inbox. Oodle Data in
| particular is a lossless data compression API and many
| customers use two entry points total, "compress" and
| "decompress".
|
| I get a single-digit number of support requests in any
| given month, most of which is actually covered in the docs
| and takes me all of 5 minutes to resolve to the customer's
| satisfaction. The 3-4 actual bug reports I get in any given
| year, I will investigate.
| wmf wrote:
| This is the usual silicon lottery. Every chip will work at
| stock settings. Some will be stable when overclocked and some
| won't.
| scrlk wrote:
| This sounds like motherboard manufacturers pushing aggressive
| OOTB performance settings, likely in excess of the Intel spec.
| eqvinox wrote:
| That's... an assumption. At least 3 motherboard vendors are
| affected, and going by the Gigabyte/MSI workarounds at the
| end of the article, it looks like things need to be adjusted
| away from Intel defaults.
|
| ...it'll need a statement from Intel for some clarity on
| this...
| scrlk wrote:
| > "Intel's default maximum TDP for the 13900K is 253 watts,
| though it can easily consume 300 watts or more when given a
| higher power limit. In our testing, manually setting the
| power limit to 275-300 watts and the amperage limit to
| 350A, proved to be perfectly stable for our 13900K. That
| required going into the advanced CPU settings in the BIOS
| to change the PL1/PL2 limits -- called short and long
| duration power limits in our particular case. The
| motherboard's default "Auto" power and current limits
| meanwhile created instability issues -- which correspond to
| a power limit of 4,096 watts and 4,096 amps." [0]
|
| The motherboard manufacturers are setting default/auto
| power and current limits that are way outside of Intel's
| specs (253 W, 307 A) [1].
|
| [0] https://www.tomshardware.com/pc-components/cpus/is-
| your-inte...
|
| [1] https://www.intel.com/content/www/us/en/content-
| details/7438... (see pg. 98 and 184, the 13900K/14900K is
| 8P + 16E 125 W)
| eqvinox wrote:
| Your [0] also says:
|
| > It's not exactly clear why the 13900K suffers from
| these instability problems, and how exactly downclocking,
| lowering the power/current limits, and undervolting
| prevent further crashes. Clearly, something is going
| wrong with some CPUs. Are they "defective" or merely not
| capable of running the out of spec settings used by many
| motherboards?
| scrlk wrote:
| > Are they "defective" or merely not capable of running
| the out of spec settings used by many motherboards?
|
| I'd wager good money on the latter. Why would Intel
| validate their CPUs against power and current limits that
| are outside of spec? The users reporting issues probably
| have CPUs that just made it in to the performance
| envelope to be binned as a 13900K, so running out of spec
| settings on these weaker chips results in instability.
|
| It's cases like this where I wish Intel didn't exit the
| motherboard space, they were known to be reliable but
| typically at the cost of having a more limited feature
| set.
| RetroTechie wrote:
| > I'd wager good money on the latter.
|
| Don't guess, measure! The proper action here would be to
| change BIOS settings from their default / "auto" settings
| to per-Intel-spec safe ones. Same for RAM, and on systems
| with known good power supplies, CPU cooling, software
| installs etc. Then one of the following will happen:
|
| a) BIOS ignores user settings & problem persists.
|
| b) BIOS applies user settings & problem goes away.
|
| c) BIOS applies user settings but problem persists.
|
| Cases a & b count as "faulty BIOS" (motherboard
| manufacturer caused). Case c counts as "faulty CPU", and
| replacement cpu may or may not fix that.
|
| No need to guess. Just do the legwork on systems where
| problem occurs & power supply, RAM, CPU cooling & OS
| install can be ruled out. Sadly, no doubt there's many
| systems out there where that last condition doesn't hold.
| vdaea wrote:
| I have a 13900K. The default BIOS settings set a maximum
| wattage of 4096W (!!!) that makes Prime95 fail. If I
| change the settings back to 253W, what Intel says is the
| maximum wattage, Prime95 stops failing.
|
| Still, I don't know if I should RMA. I got the K version
| because I intended to overclock in the future. And all of
| this sounds like I won't be able to. I think increasing
| the voltage a little bit makes the system more stable. I
| have to play with it. (Really, if someone can say whether
| I should RMA or not, I would appreciate some input)
|
| Edit: decided to RMA. I have no patience for a CPU that
| cost me +600EUR
| eqvinox wrote:
| > I'd wager good money on the latter.
|
| I don't disagree, but I'm cautious about making a call
| with the current information available. For example: yes,
| a "4096W / 4096A" power limit _sounds_ odd, but it 's not
| an automatic conclusion that this limit is intended to
| work to protect the CPU. Instead, it is a function that
| allows building a system with a particular PSU dimension
| -- it would be odd if that were overloaded to protect the
| chip itself. Maybe it is, maybe it isn't.
|
| It's also very much possible that the M/B vendors altered
| other defaults, but... I don't see
| information/confirmation on that yet. It used to be that
| at least one of the settings is the original CPU vendor
| default, but last I looked at these things was >5 years
| ago :(.
|
| > It's cases like this where I wish Intel didn't exit the
| motherboard space,
|
| Full ACK.
| sirn wrote:
| Modern CPUs have many limits to protect the CPU and later
| the clock behavior. For example, clock limit, current
| limit (IccMax, 307A for 13900K), long power limit (PL1,
| 125W), short power limit (PL2, 253W), transient peak
| limit (PL3), overcurrent limit (PL4), thermal limit
| (TjMax, 100c), Fast Throttle threshold (aka Per-Core
| Thermal Limit, 107c), etc. It also has Voltage/Frequency
| curves (V/F curve) to map how much voltage needs to drive
| a certain frequency.
|
| Intel 13900K has a fused V/F curve until its maximum
| Turbo Boost 2.0 (5.5 GHz) in all cores, and two cores at
| its Thermal Velocity Boost (aka favored cores, 5.8 GHz).
| How much to boost depends on By Core Turbo Ratio. For
| stock 13900K, this is 5.8 GHz for 2 cores, and 5.5 GHz
| for up to 8 cores with E-cores capped at 4.3 GHz.
|
| As you may have noticed, the CPU has a very coarse Turbo
| Ratio beyond the first 2 cores. This is to allow the
| clock to be regulated by one of the limits rather than a
| fixed number. In reality, 253W PL2 can sustain around 5.1
| GHz all P-cores, and after 56 seconds it will switch to
| 125W PL1 which should give it around 4.7 GHz-ish (IIRC).
|
| This is why when a motherboard manufacturer decides to
| set PL1=PL2=4096 without touching other limits, it
| results in a higher number in benchmark. The CPU will
| consume as much power as it can to boost to 5.5 GHz,
| until it hits one of the other limits (usually 100c
| TjMax). This is how we ended up in this mess in the
| consumer market.
|
| Xeon, on the other hand, has a very conservative and
| granular Turbo Ratio. My Xeon w9-3495x do have a fused
| All Core Boost that does not exceed PL1 (56 cores 2.9 GHz
| at 350W), which makes PL2 exist only for AVX512/AVX
| workload.
|
| (Side note: I always think that PL1=PL2=4096W is dumb
| since performance gain is marginal at best, and always
| set PL1=PL2=253W in all machines that I assembled. I
| think even PL1=PL2=125W makes sense for the most usage. I
| do overclock my Xeon to sustain PL1=PL2=420W though (this
| is around 3.6 GHz, which is enough to make it faster than
| 64-cores Threadripper 5995WX))
| mschuster91 wrote:
| > amperage limit to 350A
|
| Jesus. By German electrical code, you need a 70 mm2
| cross-section of copper to transfer that kind of current
| without the cable heating up to a point that it endangers
| the insulation. How do mainboard manufacturers supply
| that kind of current without resistive loss from the
| traces frying everything?
| bonzini wrote:
| Those electrical code cross sections are for 350A at
| 230V, corresponding to about 80 kW (400V is the same as
| it's actually three 230V wires).
|
| Processors operate at about 1V. At 300W it's enough to
| use a much smaller cross section, which is split across
| many traces.
| coryrc wrote:
| That's not how I^2R losses work. Voltage is not relevant.
| planede wrote:
| I agree. However voltage is relevant for insulation,
| which also affects how heat can dissipate for the wire,
| and might also be relevant for a failing wire, when
| higher and higher voltage can build up at the point of
| failure (not sure if it's a common engineering
| consideration outside of fuses, which are designed to
| fail).
| coryrc wrote:
| That code is for round wire (minimal surface area per
| volume) that can be placed inside insulation in walls.
|
| This 350A is flat conductors (maximal surface area thus
| heat dissipation) and very short (not that much power to
| dissipate so the things it connects to have a significant
| effect on heat dissipation).
| michaelt wrote:
| The traces are extremely short. Look at a modern
| motherboard and you'll find a bank of capacitors and
| regulators about 2cm away from the CPU socket.
|
| If you've got 4 layers of 2oz copper, and you make the
| positive and negative traces 10mm wide, you'll only be
| dissipating 28 watts when the CPU is dissipating 300
| watts. And most motherboards have more than 4 layers and
| have space for more than 10mm of power trace width. And
| there's a bunch of forced air cooling, due to that 300
| watts of heat the CPU is producing.
|
| Electrical code doesn't let buildings use cables that
| dissipate 28 watts for 2cm of distance because it would
| be extremely problematic if your 3m long EV charge cable
| dissipated 4200 watts.
| crote wrote:
| Bursty current spikes, short and fat traces, using the
| motherboard as a heat sink, active cooling, and allowing
| the temperature to rise quite a bit. If you look at
| thermal camera videos[0], it pretty clear where all the
| heat is going (although a significant part of that is
| coming from the voltage regulators).
|
| On the other hand, your national electrical code is going
| to assume you're running that 350A cable at peak capacity
| 24/7, right next to other similarly-loaded cables,
| stuffed in an isolated wall, for very long runs - and it
| _still_ has to remain at acceptable temperatures during a
| hot summer day.
|
| [0]: https://www.youtube.com/watch?v=YyDMlXEZqb0
| michaelt wrote:
| _> The motherboard manufacturers are setting default
| /auto power and current limits that are way outside of
| Intel's specs_
|
| The CPU only draws as much power as it needs, though?
|
| I mean, if you plug a 20 watt phone into a 60 watt USB-C
| power supply, or a 60 watt laptop into a 100 watt USB-C
| power supply the device doesn't get overloaded with
| power. It draws no more current than it needs.
|
| The motherboard's power limits should state the amount of
| power the PCB traces and buck regulators are rated to
| provide to the socket - and if that's more than the
| processor needs that's good, as it avoids throttling.
| rfoo wrote:
| The problem is these processors are unstable if they are
| not properly throttled.
|
| Of course users, especially enthusiast motherboard
| consumers, hate throttling, hence the default.
| scrlk wrote:
| * User is running a demanding application (e.g. game).
|
| * CPU clock speed increases (turbo boost), as long as the
| CPU isn't hitting: 1) Tj_MAX (max temp before thermal
| throttling kicks in); 2) the power and current limits
| specified by the motherboard (in this case, effectively
| disabled by the out of spec settings).
|
| * Weaker chips will require more power to hit or maintain
| a given turbo clock speed: with the power and current
| limits disabled, the CPU will attempt to draw out of spec
| power & current, causing issues for the on die fully-
| integrated voltage regulator (noting that there's also
| performance/quality variance for the FIVR), resulting in
| the user experiencing instability.
| wongarsu wrote:
| Drawing peak power far in excess of the TDP is what all
| Intel processors have been designed to do for many years
| now.
|
| Some consider it cheating the benchmarks, but the
| justification is that TDP is the _Thermal_ Design Power.
| It 's about the cooling system you need, not the power
| delivery. If you make reasonable assumptions about the
| thermal inertia of the cooling system you can Turbo Boost
| at higher power and hope the workload is over before you
| are forced to throttle down again.
|
| Any mainboard that sets power limits to the TDP would be
| considered wrong by both the community and Intel. This
| looks like a solid indication that the issue is with
| Intel
| lupusreal wrote:
| > _At least 3 motherboard vendors are affected_
|
| Boss says, _" Do the thing."_ Engineer says, _" The thing
| is out of spec!"_ Boss says, _" Competitor is doing the
| thing already and it works."_ Engineer does the thing.
| eqvinox wrote:
| Yes, but also no. The motherboard market is "timing-
| competitive", your product needs to be ready when the CPU
| launches, especially for the kind of flagship CPU that
| this specific issue is about. You can't wait and see what
| the competitors are doing.
| lupusreal wrote:
| Fair point. Maybe _" This sort of thing worked fine in
| the past."_
| crote wrote:
| Or perhaps they're all copying from the same reference
| design?
| paulmd wrote:
| all three motherboard vendors enabling some out-of-spec
| defaults wouldn't actually be surprising, though?
|
| people forget, blowing up AM5 cpus wasn't just as Asus
| thing... they were just the most ham-handed with the
| voltages. _Everyone_ was operating out of spec, there were
| chips that blew up on MSI and Gigabyte boards, and it wasn
| 't just X3D either.
|
| Intel is no different - nobody enforces the power limit out
| of the box, and XMP will happily punch voltages up to
| levels that result in eventual degradation/electromigration
| of processors (on the order of years). Every enthusiast
| knows that CPU failures are "rare" and yet either has had
| some, or knows someone who's had some in their immediate
| circles. Because XMP actually has caused non-trivial
| degradation even on most DDR4 platforms.
|
| In fact it's entirely possible that this is an
| electromigration issue right here too - notice how this
| affects some 13700Ks and 13900Ks too? Those chips have been
| run for a year or two now. And if the processors were
| marginal to begin with, and operated at out-of-spec
| voltages (intentionally or not)... they could be starting
| to wear out a little bit under the heaviest loads. Or the
| memory controllers could be starting to lose stability at
| the highest clocks (under the heaviest loads). That's a
| thing that's not uncommon on 7nm and 5nm tier nodes.
| rygorous wrote:
| This is blowing up now, but the first report of this kind
| of issue that reached me (I'm the current Oodle
| maintainer) was in spring of last year. We've been trying
| to track it down (and been in contact with Intel) since
| then. The page linked in the OP has been up since
| December.
|
| Epic Games Tools is B2B and we don't generally get bug
| reports from end users (although later last year, we did
| have 2 end users write to us directly because of this
| problem - first time this has happened for Oodle that I
| can think of, and I've been working on this project since
| 2015). Point being, we're normally at least one level
| removed from end user bug reports, so add at least a few
| weeks while our customers get bug reports from end users
| but haven't seen enough of them yet to get in touch with
| us (this is a rare failure that only affects a small
| fraction of machines).
|
| 13900Ks have been out since late Oct 2022. It's possible
| that this doesn't show up on parts right out of the box
| and takes a few months. It's equally plausible that it's
| been happening for some people for as long as they've had
| those CPUs, and the first such customers just bought
| their new machines late 2022, maybe reported a bug around
| the holidays/EOY that nobody looked at until January, and
| then it took another 2-3 months for 3-4 other similar
| crashes to show up that ultimately resulted in this case
| getting escalated to us.
| rygorous wrote:
| (I'm Oodle maintainer and did most of this investigation.)
|
| For the majority of systems "in the wild", I don't know. We had
| two people with affected machines contact us and consent to do
| some testing for us, and in both cases the issue still
| reproduced after resetting the BIOS settings to defaults.
| yetihehe wrote:
| TL;DR: some motherboards by default overclock too much on some
| intel processors, causing instability.
| worewood wrote:
| It's MCE all over again
| eqvinox wrote:
| That's not my interpretation. Cf the following:
|
| > For MSI:
|
| > Solution A): In BIOS, select "OC", select "CPU Core Voltage
| Mode", select "Offset Mode", select "+(By PWM)", adjust the
| voltage until the system is stable, recommend not to exceed
| 0.025V for a single increase.
|
| This really sounds like the Intel defaults are broken too.
| yetihehe wrote:
| Yes, that or insufficient quality checks, meaning some units
| will fail, some will work. Apparently it was only a subset of
| each model failing.
| haunter wrote:
| >default overclock too much
|
| Per the article MSI literally suggests to OC the CPU to fix the
| problem
| xcv123 wrote:
| Article recommends disabling overclocking. The MSI
| recommendation is only to increase voltage.
| londons_explore wrote:
| If Oodle has control of this code, the logical thing for them to
| do, when they detect a decompression checksum failure, is to re-
| decompress the same data (perhaps single threaded rather then
| multithreaded).
|
| Sure, the user has a broken CPU, but if you can work around it
| and still let the user play their games, you should.
| yetihehe wrote:
| Yes, but then the processor will fail at another task during
| the game and corrupt some other memory. The only solution for
| unstable processor is to make it stable or replace.
| mike_hock wrote:
| Yes. Props to Oodle for not passing on the hot potato but
| trying to get the root cause fixed. This hack would have been
| the easy way out for them so their product doesn't get
| blamed.
| lifthrasiir wrote:
| As noted in the linked page, this issue would affect any heavy
| use of CPU. Oodle happened to be optimized well to hit this
| issue earlier than most other applications, but nothing can't
| be really trusted at that point. There is a reason that they
| recommend to disable overclocking if possible, because such
| issue is in general linked to the instability due to excessive
| overclocking.
| barrkel wrote:
| Retry is not an unusual response to unreliable hardware, and
| all hardware is ultimately unreliable.
|
| Software running at scale in the cloud is written to be
| resilient to errors of this nature; jobs are cattle, if jobs
| get stuck or fail they are retried, and sometimes duplicate
| jobs are started concurrently to finish the entire batch
| earlier.
| lifthrasiir wrote:
| Cloud machines have a way better guarantee about such
| errors though. You will eventually see some errors at
| scale, but that error rate can be reliably quantified and
| handled accordingly.
|
| Consumer machines are comparably wild. Remember that this
| issue was mainly spotted from Unreal error messages. Some
| do too much overclocking without enough testing, which
| _will_ eventually harm the hardware anyway. Some happen to
| live in places where single-error upsets are more frequent
| (for example, high altitude or more radioactive bedrock).
| Some have an insufficient power supply that causes erratic
| behaviors only on heavy load. All those conditions _can_ be
| handled in principle, but are much harder to do so in
| practice. So giving up is much more reasonable in this
| context.
| cesaref wrote:
| quote: However, this problem does not only affect Oodle, and
| machines that suffer from this instability will also exhibit
| failures in standard benchmark and stress test programs.
|
| It sounds like a hardware issue, i'm guessing over-agressive
| memory/cpu tuning, underpowered PSU triggering off behaviour
| etc. The fact that replacing the processor makes the problem go
| away does not in itself point to the processor as the issue -
| you may find that changing the memory also 'fixes' the problem.
| crest wrote:
| I disagree. This code is just lucky enough to be able to detect
| the data corruption, but without very deep understanding
| afterward the whole system state has to be assumed to be
| corrupted. You can retry until the code doesn't _detect_ data
| corruption, but you have to assume other state is also
| corrupted. The right thing would be to scream loudly at the
| user that their system is unstable and to expect (future) data
| corruption unless the hardware + firmware (or its
| configuration) is fixed.
|
| Sure it's unpleasant to be the messenger of bad news, but the
| alternatives are far worse unless the system is just a
| dedicated game console without any background processes (which
| isn't how those CPUs are used).
| BlueTemplar wrote:
| Yeah, I had a RAM issue that might or might not have involved
| turning on XMP, with eventually as a result several RAM
| sticks with errors (BadRAM is amazing BTW, sadly, *nix-only),
| and worse, corrupted storage partitions !
|
| Was a real pain to deal with the fallout...
| aChattuio wrote:
| Needs to be fixed by microcode, BIOS update or recall from
| Intel and partners
| flohofwoe wrote:
| That same hardware bug will also result in other corruption in
| other places where it's not detected, which may then spiral
| into much more catastophic behaviour. Oodle is just special in
| that it detects the corruption and throws an error (which is
| the right thing to do in this situation IMHO).
|
| The ball is in Intel's court, such faulty CPUs should never
| have made it out into the wild.
| mnw21cam wrote:
| I'll add to the chorus of other responses that the whole
| computer is generally unreliable.
|
| However, the really useful thing that this software _could_ do
| is make the error message much better - explaining the likely
| thing that is causing the decompression to fail, and advising
| the computer should be fixed.
| zX41ZdbW wrote:
| We have the same check for silent data corruption in
| ClickHouse. After it detects a checksum mismatch, it also tries
| to check if it is caused by a single bit flip. If it is, we can
| provide a more precise diagnostic for a user about possible
| causes:
| https://github.com/ClickHouse/ClickHouse/issues?q=is%253Aiss...
|
| Then the natural question arises: if we detect that it is a
| single bit flip, should we "un-flip" that bit, fix the data,
| and continue? The answer is: no. These types of errors should
| be explicit. They successfully help to detect broken RAM and
| broken network devices, that have to be replaced. However, the
| error is fixed automatically anyway by downloading the data
| from a replica.
| rygorous wrote:
| (I'm the person who did most of the investigation.)
|
| A relatively major realization during the investigation was
| that a different mystery bug that also seemed to be affecting
| many Unreal Engine games, namely a spurious "out of video
| memory" error reported by the graphics driver, seemed to be
| occurring not just on similar hardware, but in fact the exact
| same machines.
|
| For a public example, if you google for "gamerevolution the
| finals crash on launch" and "gamerevolution the finals out of
| video memory", you'll find a pair of articles describing
| different errors, one resulting from an Oodle decompression
| error, and one from the graphics driver spuriously reporting
| out-of-memory errors, both posted on the same day with the same
| suggested fix (lower P-core max clock multiplier).
|
| That's the problem right there in a nutshell. It's not just
| Oodle detecting spurious errors during its validation. Other
| code on the same machine is glitching too. And "just try
| repeating" is not a great fix because we can't trust the
| "should we repeat?" check any more on that machine than we can
| trust any of the other consistency checks that we already know
| are spuriously failing at a high rate.
|
| Many known HW issues you can work around in software just fine,
| but frequent spurious CPU errors don't fall into that category.
| lifthrasiir wrote:
| This page doesn't seem to be linked from any other public page,
| so I think it was a response to unwanted complaints from users
| who tried to track the "oodle" thing in the error log---like
| SQLite back in 2006 [1].
|
| [1] https://news.ycombinator.com/item?id=36302805
| eqvinox wrote:
| It's linked from https://www.radgametools.com/tech.htm (click
| "support" at the top, look next to "Oodle" logo - "Note: If you
| are having trouble with an Intel 13900K or 14900K CPU, please
| [[read this page]].")
| lifthrasiir wrote:
| Ooh, thank you! I looked so long at the Oodle section and
| skimmed other sections as well (even searched for the
| `oodleintel.htm` link in their source codes), but somehow
| missed that...
| atesti wrote:
| There are some pages that are not linked, wondered what
| happened to these products
|
| https://www.radgametools.com/granny.html
| https://www.radgametools.com/iggy.htm
| https://www.radgametools.com/milesperf.htm
| rygorous wrote:
| Granny, Iggy and Miles are all discontinued as stand-alone
| products. We're still providing support to existing customers
| but not selling any new licenses.
| pixelpoet wrote:
| While we've got you, any chance you'll attend another
| demoparty here in Germany? :)
|
| Big thanks for your awesome blog, learnt much from it over
| the years.
| rygorous wrote:
| Thanks!
|
| Chance, sure, it's just a matter of logistics. Revision
| is a bit tricky since it's usually shortly after GDC, a
| very busy time in the game engine/middleware space I work
| in, so not usually when I feel up to a pair of
| international flights. :) Best odds are for something
| between Christmas and New Year's Eve since that's when
| I'm usually in Germany visiting family and friends
| anyway.
| imdsm wrote:
| 1994 all over again!
| cwillu wrote:
| FDIV was a bug in the logical design, this is a over-aggressive
| clock tuning, I fail to see any resemblance whatsoever beyond
| intel being inside.
| crest wrote:
| It's no longer black and white like the FDIV bug, but if the
| default configuration leads to data corruption in heavy SIMD
| workloads... sure you can reduce clock speed or increase
| voltage until it works, but unless the mainboards violate the
| specs this is at least partly an Intel CPU flaw leading to
| data corruption.
| mhio wrote:
| This sounds familiar... ye olde pentium III 1.13 GHz
|
| https://www.tomshardware.com/reviews/intel-admits-problems-p...
| ManuelKiessling wrote:
| Unrelated to the actual topic, but kudos to the Tom's Hardware
| site that they serve a 24 years old web posting flawlessly.
| franzb wrote:
| Reminds me of this saga I went through as an early adopter of AMD
| Threadripper 3970X:
|
| https://forum.level1techs.com/t/amd-threadripper-3970x-under...
|
| HN discussion: https://news.ycombinator.com/item?id=22382946
|
| Ended up investigating the issue with AMD for several months, was
| generously compensated by AMD for all the troubles (sending
| motherboards and CPUs back and forth, a real PITA), but the
| outcome is that I've been running since then with a custom BIOS
| image provided by AMD. I think at the end the fault was on
| Gigabyte's side.
| rwmj wrote:
| Reminded me of the Intel Skylake bug found by the OCaml
| compiler developers: https://tech.ahrefs.com/skylake-bug-a-
| detective-story-ab1ad2...
| rkagerer wrote:
| Holy cow I had no idea CPU vendors would do this for you.
| devmor wrote:
| When you're not only helping them debug their own hardware
| but are also spending money on their ridiculously overpriced
| HEDT platform, it probably makes them want to keep you happy.
| zitterbewegung wrote:
| That is true and also lots of people use OCaml
| zare_st wrote:
| Supermicro gave us same type of assistance. Then new feature
| of bifurcation did not work correctly. Without it, enterprise
| telecommunications peripheral that costs 10x more than 4
| socket Xeon motherboard can't run at nominal speed, and it
| was ran on real lines, not test data.
|
| They sent us custom BIOSes until it got stabilized and said
| they'll put the patch in the following BIOS releases.
|
| The thing is neither Intel nor AMD nor Supermicro can test
| edge cases at max usage in niche environments without paying
| money, but they would really love to claim with backup they
| can be integrated for such solutions. If Intel wants to test
| stuff in space for free they have to cooperate with NASA; the
| alternative is in-house launch.
| deepsun wrote:
| NASA has super-elaborate testbeds and simulators. Maybe
| producers can provide some format/interfaces/simulators for
| users, users would write test-cases for it, and give back
| to providers to run in-house.
|
| If users pay seven figures+ it might make sense.
| vdaea wrote:
| I have a 13900K and I am affected. Out of the box BIOS settings
| cause my CPU to fail Prime95, and it's always the same CPU cores
| failing. Lowering the power limit slightly will make it stable. I
| intended to better refrigerate the CPU and change the power limit
| back to the default and if the problems continued I would RMA the
| CPU, but now I'm not so sure that the BIOS is not pushing it
| beyond the operating limits.
| Havoc wrote:
| Same cpus as the unity engine (or was it unreal?) with issues
|
| Not a good look but at least it's fixable with bios tweaks rather
| than a silicon flaw that's permanent
| zvmaz wrote:
| Does it mean that a formally verified piece of software like seL4
| can still fail because of a potential "bug" in the hardware?
| flumpcakes wrote:
| I would assume that _any_ software, formally verified or not,
| could fail due to a hardware problem. A cosmic ray could flip a
| bit in a CPU register. The chances of that happening, and that
| effecting anything in any meaningful way is probably
| astronomically low. We probably have thousands of hardware
| failures every day and don't notice them. This is why I think
| rust in a kernel is probably a bad idea if it doesn't change
| from the default 'panic on error'.
| hmottestad wrote:
| I would assume that software can always fail in the event of a
| bug in the hardware. That's why systems that are really
| redundant, for instance flight control computers, have several
| computers that have to form a consensus of sorts.
| eqvinox wrote:
| It doesn't even need a bug in the hardware; cosmic rays or
| alpha particles can also cause the same type of issue. For
| those, making systems redundant is indeed a good solution.
|
| For the situation of an actual ( _consistent_ ) hardware
| _bug_ , redundancy wouldn't help... the redundant system
| would have the same bug. Redundancy only helps for random-
| style issues. (Which, to be fair, the one we're talking about
| here seems to be.)
| davrosthedalek wrote:
| That's why some redundant systems use alternative
| implementations for the parallel paths. Less likely that a
| hardware bug will manifest the same way in all
| implementations.
| rygorous wrote:
| Absolutely, yes.
|
| It can also misbehave without any hardware bugs due to
| glitching. Rates of incidence of this must be quite low or that
| would be considered a HW bug, but it's never zero. Run code for
| enough hours on enough machines collecting stack traces or core
| dumps on crashes and you will notice that there's a low base
| rate of failures that make absolutely no sense. (E.g. a null
| pointer dereference literally right after a successful non-null
| pointer check 2 instructions above it in the disassembly.)
|
| You will also notice that many machines in a big fleet that log
| such errors do so exactly once and never again, but some
| reoccur several times and have a noticeably elevated failure
| rate even though they're running the exact same code as
| everyone else. This too is normal. These machines are, due to
| manufacturing variation on the CPU, RAM, or whatever, much
| glitchier than the baseline. Once you've identified such a
| machine, you will want to replace it before it causes any
| persistent data corruption, not just transient crashes or
| glitches.
| enraf wrote:
| I got one of the faulty 13900k, at least in my case I can confirm
| that the fault appeared using the default settings for pl1/pl2.
|
| I was doing reinforcement learning on that system and it was
| always crashing, I spent quite a bit of time trying to find the
| problem, swapped the CPU for a 13700kf I was using in another PC,
| the problem was solved.
|
| So I contact Intel to start the RMA process, Intel said that the
| MSI motherboard I was using doesn't support Linux, I emailed them
| the official Intel GitHub repo with the microcode that enables
| the support, they switched agents at that point but I was clear
| to me at that moment that Intel was trying their best to avoid
| the RMA, luckily I live in Europe, so I contacted my local
| consumer protection agency and did the RMA through them, in the
| meanwhile I saw a good offer for a 7950x + motherboard in an
| online retailer, bought it and sold in the second market my old
| motherboard and the RMA 13900k when I got it.
|
| Not buying Intel ever again, I was using Intel because they
| sponsor some projects in DS but damn.
| hopfenspergerj wrote:
| I've had instability with my 7700k since I bought it, and 16
| months of bios updates haven't helped. Maybe this latest
| generation of processors just has more trouble than older,
| simpler designs.
| acdha wrote:
| Intel has been struggling with CPU performance for a decade,
| and has been trying to regain their position in absolute
| performance and performance/{price,watt} comparisons. I think
| that means they're being less conservative than they used to
| be on the hardware margins and also that their teams are
| likely demoralized, too.
| smolder wrote:
| Possibly. I would start swapping parts around at that point.
| Different memory, different CPU, or different motherboard.
| Just 1 more anecdote, but my r7-7700x has been a dream (won
| the silicon lottery). It runs at the maximum undervolt & RAM
| at 6000 with no stability problems.
| mattgreenrocks wrote:
| I can't say I'm surprised by this at all.
|
| I bought my 4790k's ASUS TUF board awhile back because I wanted
| something basic enough and wasn't interested in overclocking or
| tweaking. The BIOS had other ideas. I had to manually configure a
| lot more things just to avoid overclocking, including setting RAM
| timing and going through each BIOS setting to ensure it wasn't
| overclocking in some way. The "optimal" setting would turn on
| aggressive changes like playing with bus speed multipliers, etc.
| layer8 wrote:
| Few people buy a K processor who aren't interested in
| overclocking and tweaking. I wouldn't be surprised if the BIOS
| of a gaming mainboard sets the "optimal" defaults on that
| basis, since the gaming market is all about benchmarks.
| phil21 wrote:
| I'm pretty much the same as OP. I almost always buy the K
| version of the processor, but never intend to overclock. I
| just figure I want the theoretical ability to, and the more
| volume they have on those SKUs the less likely they are to
| take it away entirely.
|
| That or I'm just rewarding shitty corporate product
| segmentation behavior. I never can quite decide.
|
| I do agree over the recent years getting a "boring" higher-
| end configuration is getting more and more difficult.
| bonton89 wrote:
| K chips often came with higher default clocks and
| definitely have better resale value so they're often worth
| buying even if you don't overclock.
| smolder wrote:
| Yes, the overclockable chips are better-binned/faster
| chips even without enabling overclocking. (Unless you're
| talking about X3D chips, which have most overclocking
| features turned off due to thermal limitations of stacked
| cache.)
| Sweepi wrote:
| "Few people buy a K processor who aren't interested in
| overclocking and tweaking." The opposite is true: Most people
| who buy a 'K' CPU dont do any tweaking, I would bet a
| majority does not even activate things like XMP. The 'K' SKUs
| are 1) The highest SKU in the Linup in a given Class 2) They
| are faster then the non-'K' SKUs out of the box.
| crote wrote:
| > I would bet a majority does not even activate things like
| XMP
|
| I _highly_ doubt that. XMP is pretty much mandatory to get
| even remotely close to the intended performance. Without
| XMP your DDR4 memory isn 't going beyond 2400MHz - but you
| almost have to _try_ to find a motherboard, memory, or CPU
| which can 't run at 3200MHz or even higher. It has all been
| designed for speeds like that, it's just not part of the
| official DDR4 spec.
|
| It's less critical with DDR5, but you're still expected to
| enable it.
| paulmd wrote:
| nevertheless, both AMD and Intel refuse to warranty
| processors operated outside of the spec, including when
| done via XMP/Expo. AMD has gone so far as to add an
| e-fuse in recent generations that permanently marks
| processors that have been operated outside the official
| spec.
|
| https://www.extremetech.com/computing/amds-new-
| threadripper-...
|
| As much as enthusiasts would like this to be "normalized"
| - from the perspective of the vendor it is _not_ , they
| are very clear that this is something they do not cover.
| And it will become more and more of a problem as
| generations go forward - electromigration is happening
| faster and faster (sometimes explosively, in the case of
| AMD).
|
| But it is quite difficult to get a gamer to understand
| something when their framerate depends on not
| understanding it.
|
| https://semiengineering.com/uneven-circuit-aging-
| becoming-a-...
|
| https://semiengineering.com/3d-ic-reliability-degrades-
| with-...
|
| https://semiengineering.com/mitigating-electromigration-
| in-c...
|
| > GD-106: Overclocking AMD processors, including without
| limitation, altering clock frequencies / multipliers or
| memory timing / voltage, to operate beyond their stock
| specifications will void any applicable AMD product
| warranty, even when such overclocking is enabled via AMD
| hardware and/or software. This may also void warranties
| offered by the system manufacturer or retailer. Users
| assume all risks and liabilities that may arise out of
| overclocking AMD processors, including, without
| limitation, failure of or damage to hardware, reduced
| system performance and/or data loss, corruption or
| vulnerability.
|
| > GD-112: Overclocking memory will void any applicable
| AMD product warranty, even if such overclocking is
| enabled via AMD hardware and/or software. This may also
| void warranties offered by the system manufacturer or
| retailer or motherboard vendor. Users assume all risks
| and liabilities that may arise out of overclocking
| memory, including, without limitation, failure of or
| damage to RAM/hardware, reduced system performance and/or
| data loss, corruption or vulnerability.
| bee_rider wrote:
| I wonder if XMP is typically enabled by reviewers or on
| marketing slides.
| dmvdoug wrote:
| > since the gaming market is all about benchmarks.
|
| Why is that? I'm not a gamer so legit asking. It would seem
| to me that what would be most important is do the actual
| games that exist perform well, not some random, hypothetical
| maximum performance that benchmarks can game.
| Modified3019 wrote:
| My impression is that people looking at gaming benchmarks
| are looking at comparisons of FPS and frame times taken
| from just running recent high end games, which sometimes
| have settings to run through a repeatable demo for exactly
| this purpose.
| kllrnohj wrote:
| The K chips aren't just unlocked, they're also significantly
| faster out of the box. I'd guess very few K owners have any
| intention of overclocking, especially as the gains are _very_
| small, and instead just want the higher out of box
| performance
| ajross wrote:
| As others are pointing out, that "basic enough" CPU is in fact
| aimed directly at the overclocking market, as (likely) is the
| motherboard you put it on. This isn't basic at all, this is a
| high end tweaker rig. It's just a _decade old_ tweaker rig.
| mattgreenrocks wrote:
| Fair enough. I build it to last 7-10 years typically, so
| happy to spend a little more on a quality board.
|
| What's the go-to basic mobo brand/board for non-tweakers
| these days?
| ajross wrote:
| There's not a lot, honestly. Pretty much all discrete
| motherboards are gaming rigs of some form. The basic
| computer for general users is now a "laptop" (which tend to
| work quite well for general gaming, FWIW). But the low end
| choices from the regular suspects (Gigabyte, MSI, Asus) are
| generally fine in my experience. You do occasionally get a
| weird/flawed device, like you do with many product areas.
| Arrath wrote:
| Yeah it really seems the market has bifurcated into "DIY
| build-a-computer" targeted towards gamers, bedazzled with
| RGB and all that jazz, and "Buy a used/refurb Dell mini-
| atx office desktop computer", assuming they don't just
| default to 'buy a laptop' as you point out.
| rpcope1 wrote:
| Supermicro is usually a good bet.
| IYasha wrote:
| (c) 1991 - 2024 Epic Games Tools LLC
|
| Wow. RAD was bought by Epic? I kinda missed that. Feels old. :(
| mobilio wrote:
| Yup
|
| https://www.epicgames.com/site/en-US/news/epic-acquires-rad-...
| tibbydudeza wrote:
| Why I chose a i9 13900 (non K) variant rather- my PC earns me
| money as a freelance software dev so I can't stand weird issues
| like this
| svantana wrote:
| As another software dev, I would pay big money for the "worst
| possible computer" that exhibits all of the glitches and issues
| that end users see. It's so annoying to get bug reports that I
| can't reproduce.
| tibbydudeza wrote:
| I had my time during my embedded days - did a site visit 1000
| km away and discovered no wonder the serial port and
| scanner/printer is going wonky.
|
| No shielding, earth - using the crappiest/cheapest PC they
| could get instead of using the recommended kit as the sales
| droid wanted a bigger commission.
|
| Said call me when you replaced the h/w - I walked out and
| went to the airport. They never called me.
| fabianhjr wrote:
| If that was the case why not go for Ryzen + ECC memory?
| tibbydudeza wrote:
| Got ECC memory in my server - I am a value for money - my
| previous kit was i7-6700 system 48GB so I really sweated it
| until Jetbrains let me know "She canno go more Captain".
|
| DDR4/Intel motherboards are cheaper than AM5/DDR5 - also a
| Ryzen laptop foobarred on my daughter so to me Intel kit was
| just more stable - no weird XMP issues or overclocking to the
| nines.
| codexon wrote:
| I got a 13900 non-K on a Linux server and it randomly locked up
| the system after a month.
| op00to wrote:
| This sounds a lot like the behavior I see when I have over locked
| my processor too far, and try to run AVX heavy workloads!
| Cranking down the frequency during AVX seems to stabilize things.
| Arech wrote:
| Had the same experience overclocking old AMD Phenom II a while
| ago. Worked flawlessly in all publicly available test software
| I tried, until I run some custom heavily vectorized code, which
| eventually required to shave off almost all overclocking :D
| op00to wrote:
| There's a way (at least on my Intel) to tell the processor to
| clock down a certain number of steps depending on whether AVX
| is being executed. So, for the majority of stuff that didn't
| use AVX I let 'er rip, but when AVX is running it clocks down
| a couple steps. I could use less voltage, and this CPU is
| fast enough. I think it's a 13900k.
| Arech wrote:
| Ah, that's an interesting feature of new CPUs, I didn't
| know about it! Thanks for telling!
| newsclues wrote:
| remember when. intel had intel branded reference boards? I would
| like a comeback please
| mbrumlow wrote:
| I recently built a new system with a i9 149kf and a Ausus Formula
| motherboard. For a VFIO system so I could run windows and play
| some games.
|
| It was a nightmare to get running stable. None is the default
| settings the motherboard used worked. Games crashed, kernel and
| emacs compiles failed.
|
| End result I had to cap turbo to 5.4ghz on a 6ghz chip, and
| enable settings that capped max watts and temperature for
| throttling to 90c.
|
| System seems stable now. Can get sustained 5.4ghz without
| throttling and enjoying games at 120fps with 4k resolution.
|
| Even though it is working I do feel a way about not being able to
| run the system at any of the advertised numbers I paid for.
| doubled112 wrote:
| What I'm not happy about is the marketing around turbo boost.
|
| You know how ISPs used to sell "up to X Mbps"? Same idea. Your
| chip will turbo boost "up to 6.00 GHz".
|
| It's basically automated overclocking, and as you learned,
| sometimes it can't even do it in a stable fashion. Some of
| those chips will never clock "up to 6.00 GHz" but they didn't
| lie. "up to"
| wtallis wrote:
| It's particularly bad when they stop telling you what clock
| speeds are achievable with more than one core active. At best
| these days you get a "base clock" spec that's very slow that
| doesn't correspond to any operating mode that occurs in real
| life. You used to get a table of x GHz for y active cores,
| but then the core counts got too large and the limits got
| fuzzier.
|
| And laptops have another layer of bullshit, because the
| theoretical boost clocks the chip is capable of will in
| practice be limited by the power delivery and cooling
| provided by that specific machine, and the OEMs never tell
| you what those limits are. So they'll happily take an extra
| $200 for another 100MHz that you'll never see for more than a
| few milliseconds while a different model with a slower-on-
| paper CPU with better cooling can easily be more than 20%
| faster.
| kilolima wrote:
| Yes, this is the situation for my dell laptop's i7-1165G7.
| Alleged turbo boost to 4.7ghz! In reality, it will hit that
| for a sec and then throttle to ~1ghz. I had to disable the
| turbo boost AND two cores in bios to let it even achieve
| ~2.0ghz speeds consistently. It's a total scam. Turns out
| my 8th-Gen i5 laptop is almost the same speed on
| benchmarks, just because it's a few mm thicker with better
| cooling.
| CooCooCaCha wrote:
| Do you think that happened because you had insufficient
| cooling?
|
| Its hard to cool these new chips. AMD included.
| kijin wrote:
| Even if GP's cooling setup was less than ideal, the chip
| should have throttled itself to a stable frequency instead of
| crashing left and right.
| cesarb wrote:
| It might not have throttled fast enough. Without sufficient
| thermal mass (or with insufficient heat transfer to that
| thermal mass, for instance if the thermal paste is
| misapplied), it might heat up too fast for the sensors to
| keep up.
| Hikikomori wrote:
| For different reasons though. AMD's chiplets produce heat in
| a small area which makes it hard to transfer heat quickly.
| Intel just use a shitload more power and thus more heat.
| CooCooCaCha wrote:
| That's not entirely it though. Modern AMD and Intel chips
| are built to run at their thermal/frequency limits and will
| jump to those limits at a moments notice in order to
| maximize performance.
|
| So unless you have powerful cooling you will hit the
| thermal limit.
| Hikikomori wrote:
| I didnt say that it was everything that matters, just
| commenting on the difference between them.
| JohnBooty wrote:
| enable settings that capped max watts and temperature for
| throttling to 90c.
|
| You were going _above_ 90C before???
|
| My first thought is that seems insane, but is apparently normal
| for that chip, according to Intel: "your processor supports up
| to 100degC and any temperatures below it are normal and
| expected"
|
| https://community.intel.com/t5/Processors/i9-14900K-temperat...
|
| That is just wild though. On one hand you should obviously get
| the performance that was advertised and that you paid for. On
| the other hand IMO operating a CPU at 90-100C is just insane.
| It really feels like utter desperation on Intel's part.
|
| I would be curious what kind of cooling setup you have.
| legosexmagic wrote:
| the amount of cooling you get is proportional to the
| difference of component temperature to ambient temperature.
| thats why modern chips are engineered to run much hotter.
| jnxx wrote:
| Until Dust Puppy kills 'em
| dist-epoch wrote:
| For both Intel/AMD 100C is now a target, not a limit.
| bee_rider wrote:
| Hey, is there a cooling solution that sprays water on some
| sort of heat spreader and lets it evaporate? Kidding. Kinda.
| But actually, is that possible?
| Analemma_ wrote:
| That's essentially what vapor chamber coolers are. It's a
| sealed unit where the working fluid evaporates on the CPU
| end, absorbing a ton of heat, and then condenses on the
| other side, before going back to do the cycle again.
| Because the heat of vaporization is so large, these can
| move a lot more heat than ordinary heat sinks.
| smolder wrote:
| Heat pipes sort of do that naturally without any active
| "spraying". They contain a fluid that phase changes and
| carries heat away. Closed loop water coolers have the
| active flow of water you want for maximum effect. I don't
| think your idea would be an improvement on that.
| jcalvinowens wrote:
| > On the other hand IMO operating a CPU at 90-100C is just
| insane.
|
| No it isn't, the manufacturer literally says it's normal! I
| think people who spend as much money on cooling setups as the
| chip are the insane ones.
|
| My favorite story: I once put Linux on a big machine that had
| been running windows, and discovered dmesg was full of
| thermal throttling alerts. Turns out, the heatsink _was not
| in contact with the CPU die_ because it had a nub that needed
| to occupy the same space as a little capacitor.
|
| I'd been using that machine to play X-plane for over two
| years, and I never noticed. It was not meaningfully slower:
| the throttling events would only happen every ten or so
| seconds. I'm still using it today, although with the heatsink
| fixed :)
|
| I have a garage machine with a ca. 2014 Haswell that's been
| running full tilt at 90C+ for a good bit of its life. It just
| won't die.
| crote wrote:
| Temperatures like that have been fairly normal for a few
| generations now - both for Intel and AMD. It might look
| insane compared to what you were used to seeing a decade ago,
| but it's actually not that crazy.
|
| First, the temperature sensors got a lot better. Previously
| you only had one sensor per core/cpu, and it was placed
| wherever there was space - nowadays it'll have dozens of
| sensors placed in the most likely hotspots. A decade ago a
| 70C temp meant that some parts of the CPU were closer to 90C,
| whereas nowadays a 90C temp means the hottest part is
| _actually_ 90C.
|
| Second, the better sensors allow more accurate tuning. While
| 100C might be totally fine, 120C is probably already going to
| cause serious damage. The problem here is that you can't just
| rely on a somewhat-distant sensor to always be a constant 20C
| below the peak value: it's also going to be lagging a bit. It
| might take a tenth of a second for that temp spike in the
| hotspot to reach the sensor, but in the time between the
| spike starting and the temp at the sensor raising enough to
| trigger a downthrottle you could've already caused serious
| damage. A decade ago that meant leaving some margin for
| safety, but these days they can just keep going right up to
| the limit.
|
| It's also why overclocking simply isn't really a "thing"
| anymore. Previous CPUs had plenty of safety margin left for
| risk-takers to exploit, modern CPUs use up all that margin by
| automatically overclocking until it hits either a temperature
| limit or a power draw limit.
| JoshTriplett wrote:
| Exactly. Temperature measurements are a lot like available
| memory measurements in that regard. People wonder why the
| OS uses up all available memory, and it's because the OS
| knows that empty memory is useless, while memory used to
| cache disk is potentially useful (and can always be
| discarded when that memory is needed for something else).
| So, the persistent state of memory is always "full".
|
| Similarly, processors convert thermal headroom to
| performance, until they run out of thermal headroom. So if
| you improve the cooling on a processor that has work to do
| (rather than sleeping), it will use up that cooling and
| perform better, rather than performing the same and running
| cooler.
|
| (Mobile processors operate differently, since they need to
| maintain a much tighter thermal envelope to not burn the
| user. And processors can also target power levels rather
| than thermals. But when a processor is in its default "run
| as fast as possible" mode, its normal operating temperature
| will be close to the 100C limit.)
| jtriangle wrote:
| There's a side benefit to this as well, your cooling
| solution is more effective at 90C than it is at 40C, you
| know, highschool physics, deltaT and all that.
| BobbyTables2 wrote:
| Seems scary that 10% difference in clock frequency is
| makes/breaks stability.
|
| How much margin is really there?
| rygorous wrote:
| Dynamic switching power (i.e. the fraction of the chip's
| power consumption from actually switching transistors, as
| opposed to just "being on") scales with V^2 * f, where
| V=voltage and f=frequency, and V in turn depends on f, where
| higher frequencies need higher voltage. Not really linearly
| (It's Complicated(tm)), but it's not a terrible first-order
| approximation, which makes the dynamic switching power have a
| roughly cubic dependency on frequency.
|
| Therefore, 1.1x the frequency at the high end (where
| switching power dominates) is 1.33x the power draw.
|
| Those final few hundred MHz really hurt. Conversely, that's
| also why you see "Eco" power profiles with a major reduction
| in power draw that cost you maybe 5-10% of your peak
| performance.
| callalex wrote:
| Did you try cleaning everything and re-mounting the cooler with
| new paste? I've seen similar behavior when people mess up and
| get bubbles in their paste. Do you see wildly different
| temperature readouts for different cores?
| phantomwhiskers wrote:
| I also recently built a system with the 14900KF on an ASUS TUF
| motherboard and NZXT Kraken 360 cooler, and so far I haven't
| experienced any issues running everything at default BIOS
| settings (defaulted to 5.7GHz). I haven't seen temps above 70C
| yet, although granted I also haven't seen CPU load go above
| 40%, and haven't tried running any benchmarking software.
|
| I'm curious about what you are using for cooling, as 90C at
| 5.4ghz seems way off compared to what I am seeing on my
| processor, but it could just be that I'm not pushing my
| processor quite as hard even with the higher clock rate.
| jeffbee wrote:
| Similar experience with an Asus motherboard. With their automatic
| tuning, instability leading to compiler crashes. Had to manually
| set the BIOS for sanity.
|
| I believe the problems are compounded by the way their SuperIO
| controls the cooler, because the crashes were associated with
| temperature excursions to 100C. It's too slow to ramp up and too
| quick to ramp down. It is possible to tune this from userspace
| under Linux. But really the up ramp should be controlled by a
| leading indicator like the voltage regulator instead of a lagging
| indicator. Alternately the Linux p-state controller could
| anticipate the power levels and program a higher fan speed.
| PawBer wrote:
| Reminds me of this Raymond Chen classic:
| https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...
| Ochi wrote:
| So ideally, we should disable hyper threading to mitigate
| security issues and now also disable turbo mode to mitigate
| memory corruption issues. Maybe we should also disable C states
| to avoid side-channel attacks and disable efficiency cores to
| avoid scheduler issues... and at some point we are back to a
| feature set from 20+ years ago. :P
| bee_rider wrote:
| IMO it is worth noting that the "turbo mode," as you call it,
| seems to be an overlock that some motherboards do by default.
| Not the stock boost frequencies.
|
| The hyperthread and c-state stuff, eh, if you want to run code
| that might be a virus you will have to limit your system. I
| dunno. It would be a shame if we lost the ability to ignore
| that advice. Most desktops are single-user after all.
| blibble wrote:
| turbo boost is an advertised feature of the chip
|
| these chips that have been specially binned because they are
| supposedly stable at those frequencies (within an envelope
| set by intel)
|
| if intel can't get it to work they shouldn't be selling these
| chips at all
| bee_rider wrote:
| Unless I misread the blog post, there doesn't seem to be
| any issue with the stock turbo behavior.
| alwayslikethis wrote:
| Provided enough cooling, a chip that can boost to its turbo
| frequency for a few seconds should also run stably at that
| frequency indefinitely. Nowadays these boost clocks are so
| high that there is often not much gained by pushing any
| further.
| dist-epoch wrote:
| Intel should police their own ecosystem.
| ajross wrote:
| They have, in the past. People (including posters here)
| absolutely freaked out about clock-locked processors and
| screamed about the needless product differentiation of
| selling "K" CPUs at a premium.
|
| People _want_ to overclock. Gamers want to see big numbers.
| If gamers don 't do it their motherboard vendors will. It's
| not a market over which Intel is going to have much
| control, really.
|
| Note that you don't, in general, see this kind of silly
| edgelord clocking in the laptop segments.
| dist-epoch wrote:
| Overclocking is ok.
|
| Out of the box default overclocking is not, this aspect
| should be policed.
| ajross wrote:
| FWIW, there's no evidence that this is an "out of the box
| default" configuration on any of this hardware. Almost
| certainly these are users who clicked on the "Mega Super
| Optimizzzz!!!" button in their BIOS settings. And again,
| overclocking support on gaming motherboards _is a
| feature_ that consumers want, and will pay for. So of
| course the vendors are going to provide it.
| rygorous wrote:
| Oodle maintainer here, we had two people that hit the
| issue offer to run some experiments for us. Neither were
| doing any overclocking before and both tried numerous
| things including resetting to BIOS defaults and also
| updating their BIOS (there was a known [to Intel] issue
| affecting some ASUS boards that had been fixed in a BIOS
| update in spring of 2023, and we were asked to rule it
| out.)
|
| This issue doesn't affect every such machine, but both
| people affected by the issue that consented to run tests
| for us still had the issue reproduce after flashing BIOS
| to current and with BIOS default settings for absolutely
| everything.
|
| Among the settings enabled _by default_ on some boards:
| current limit set to 511 amps (...wat), long duration
| power limit set to 350W (Intel spec: 125W), short
| duration power limit also set to 350W (Intel spec: 253W),
| "MultiCore Enhancement" which is extra clock boosting
| past what the CPUs do themselves set to "Auto" not "Off",
| and some others.
| jnxx wrote:
| Why does this reminds me in this big, extremely profitable
| company that made something every American needs in a
| while, which seems to have abandoned all sanity in their
| processes? Looks like Intel and Boeing are on a similar
| path....
| jnxx wrote:
| > The hyperthread and c-state stuff, eh, if you want to run
| code that might be a virus you will have to limit your
| system.
|
| So, you are trusting all web pages you view? Because these
| are unknown code running on your box which probably has some
| beefy private data.
| Wowfunhappy wrote:
| I know some people browse the web while gaming, but I
| don't. For the gaming use case, I legit want a toggle that
| says "yes, all the code I'm running is trusted, now please
| prioritize maximum performance at all costs." For all I
| care this mode can cut the network connection since I don't
| do multiplayer.
|
| I imagine people doing e.g. heavy number crunching might
| want something similar.
| bee_rider wrote:
| I run noscript and try to be selective about which pages I
| enable.
| jrockway wrote:
| Remember that you run a lot of untrusted code on your single-
| user desktop through Javascript on websites. Javascript can
| do all those side channel attacks like Spectre and Meltdown.
| bee_rider wrote:
| There are almost certainly unmitigated Spectre-style bugs
| hiding in modern hardware. People who don't block
| JavaScript by default are impossible to protect anyway.
| VyseofArcadia wrote:
| Maybe you do, but some of us use NoScript[0] and whitelist
| sites we trust.
|
| I'm not affiliated with NoScript. I just think it's insane
| that we run oodles of code to display web pages.
|
| [0] https://noscript.net/
| Workaccount2 wrote:
| Using no-script made me realize how unchained the
| Internet has become. Sites with upwards of 15 different
| domains all running whatever JS they want on your
| machine. Totally insane.
| secondcoming wrote:
| Or just disable overclocking.
| ajross wrote:
| Why is this downvoted? That's exactly what's happening here.
| The affected devices are being overclocked, and the
| instructions at the end of the linked support document detail
| how to find the correct limits for your CPU and set them in
| your BIOS.
| sandworm101 wrote:
| I think it is because overclocking has become so normal
| that it is an expected feature on most chips. Being told to
| disable it is like being told to disable the supercharger
| on your new Ferrari: you are no longer getting what you
| thought you had paid for.
| crazygringo wrote:
| Is that true?
|
| I thought the entire premise of overclocking was that
| it's not officially supported and it may break things.
|
| The whole point is that you're _not_ paying for it and it
| 's entirely at-risk.
|
| Because if you do want a higher level of _guaranteed_
| performance, you _do_ need to pay for a faster chip (if
| it exists).
| Kluggy wrote:
| CPU manufacturers certainly hold the line you stated but
| motherboard venders have jumped over the line and now
| sell motherboards that overlock for the end user entirely
| transparently.
|
| It's fair for the end user who bought a motherboard that
| promises a higher clock speed to expect that clock speed.
| crazygringo wrote:
| Do these motherboards explicitly provide a warranty that
| covers not just damage from overclocking but also CPU
| errors?
|
| If you can provide links, I'd be curious to see what
| guarantees they make. "What's fair" depends very
| specifically on what language they use.
| Red_Leaves_Flyy wrote:
| From my limited knowledge the motherboard manufacturers
| hide behind disclaimers. Iirc even using fast ram at
| their rated clock speed with a cpu that does not support
| that speed is a warranty violation.
| Kluggy wrote:
| https://www.asus.com/motherboards-
| components/motherboards/tu... has "ASUS Multicore
| Enhancement" bios setting which defaulted to Auto which
| is documented as "This item allows you to maximize the
| overclocking performance optimized by ASUS core ratio
| settings."
|
| They have now entered the AI bubble with
|
| https://www.asus.com/microsite/motherboard/Intelligent-
| mothe...
|
| MSI has a similar setting, although I don't know exactly
| what models have it nor what it's called
| sandworm101 wrote:
| Well, Ferrari also tells people not to break speed
| limits. But if their cars started breaking apart at 85mph
| they would still be blamed. This might not be warranty
| repair, intel is probably not liable legally, but this
| should have impact on their reputation: intel put out a
| chip that does not handle overclocking very well. Ok.
| I'll remember that when I am shopping for my next chip.
| simondotau wrote:
| > The whole point is that you're not paying for it
|
| Tell that to anyone who paid extra for a K-series Intel
| chip.
| bee_rider wrote:
| I don't think there's a great car analogy because the
| ecosystems and stakes are different.
|
| These chips require motherboards to function, and these
| unlocked chips get their configuration from the
| motherboard. There's no analogous entity to Ferrari the
| company here, it is like you bought an engine from one
| company, a gearbox from another, and the gearbox had a
| "responsiveness enhancement" setting that always redlined
| your RPMs or something (I don't know cars).
| ajross wrote:
| Ferrari for sure warrants their cars as sold. But if you
| take it to a mod shop and put in an aftermarket turbo
| that damages your valves, you don't go whining to HN with
| an article with "Ferrari Engine Instability" in the
| title, do you?
|
| I don't know what you want Intel to do here. They tell
| you upfront what the power and clock limits are on the
| parts. But the market has a three decade history of
| people pushing the chips a little past their limit for
| fun and profit, so they "allow" it even if they know it
| won't work for everything.
| kuschku wrote:
| These Motherboards are Intel certified. If I get a mod
| shop to install a ferrari certified part, I expect the
| part to work.
| ajross wrote:
| Meh. So you're in the "Intel should take affirmative
| action to prevent overclocking" camp. And as mentioned
| the response to that is that they've tried that (on
| multiple occasions, using multiple techniques) and people
| freaked out about that too. They can't win, I guess.
| xcv123 wrote:
| Ferrari does not allow modifications of their cars. If
| you take it to a mod shop, they will void the warranty
| and you will be banned from purchasing a new Ferrari.
| rygorous wrote:
| (Oodle maintainer here.) This issue only occurs on some
| small fraction of machines, but on those that we've had
| access to, it reproduces with BIOS defaults and no user-
| specified overclocking. It turns out several of these
| mainboards will overclock and set other values out of
| spec even at BIOS defaults.
|
| I don't have a problem with end users experiencing
| instability once they manually overclock (that's how it
| goes), but CPU + mainboard combinations experiencing
| typical OC symptoms with out-of-the-box settings is just
| not OK.
|
| This appears to be an arms race between mainboard vendors
| all going further and further past spec by default
| because it gives better benchmark and review scores and
| their competition does it. Intel for their part are
| themselves also dialing in their parts more aggressively
| (and, presumably although I don't know for sure, with
| smaller margins) over time, and they are for sure aware
| that this is happening, because a) even had they not
| known already (which they did) they would have learned
| about this months ago when we first contacted them about
| this issue, b) technically out of spec or not, as long as
| it seems to work fine for users and makes their parts
| look better in reviews, they're not going to complain.
|
| However, it turns out, it does not work fine for at least
| some small fraction of machines. I have no idea what that
| percentage is, but it's high enough that googling for say
| "Intel 13900K crash" yields plenty of relevant results.
| Some of this will be actual intentional overclockers but,
| given how boards default to some extend of out-of-spec
| overclocking enabled, it's unlikely to be all of them.
|
| Meanwhile we (and other SW vendors) are getting a
| noticeable uptick in crash reports on, specifically,
| recent K-series Intel CPUs, and it's not something we can
| sanely work around because the issue manifests as code
| randomly misbehaving and it's not even when doing
| anything fancy. The Oodle issue in particular is during
| LZ77-family decompression, which is to say, all integer
| arithmetic (not even multiplies, just adds, shifts and
| logic ops), loads/stores and branches. This is the bare
| essentials. If it was an issue with say AVX2, we could
| avoid AVX2 code paths on that family of machines (and
| preferably figure out what exactly is going wrong so we
| can come up with a more targeted workaround). But there
| is no sane plan B for "integer ALU ops, load/stores and
| branches don't work reliably under load". If we can't
| rely on that working, there is not enough left for us to
| work around bugs with!
|
| I realize this all looks like finger-pointing, but this
| is truly beyond our capacity to work around in a sane way
| in SW, with what we know so far anyway. Maybe there is a
| much more specific trigger involved that we could avoid,
| but if so, we haven't found it yet.
|
| Either way, when it's easy to find end user machines that
| are crashing at stock settings, things have gone too far
| and Intel needs to sit down with their HW partners and
| get everyone (themselves included) to de-escalate.
| johnklos wrote:
| If you put your foot to the floor in a supercharged car,
| you're going to eventually have to let it up lest you
| melt things or you burn all your oil because your rings
| aren't making contact with the cylinder walls any more.
| It's an apt metaphor since the same is true of CPUs. You
| can't run a CPU at 400 watts continuously for more than a
| handful of seconds at a time.
|
| The problem is that Intel has normalized it so much that
| all their high end CPUs do this, and apparently do it
| often. It's not unexpected that they might be too close
| to the point where things are melting, so to speak.
|
| I'd rather slower and more stable any day - I chose a
| Ryzen 7900 over a 7900X intentionally - but that isn't
| what all the marketing out there is trying to sell. The
| fancy motherboards, the water coolers, the highly clocked
| memory all account for lots of markup, so that's what's
| marketed. I'm not a fan.
|
| It is worth noting a distinction between the terms
| "overclocking" and "turbo clocking". "Overclocking" has
| traditionally meant running the clock "over" the rating.
| "Turbo clocking" is now built in to almost every CPU out
| there. One technically can void your warranty, whereas
| the other doesn't.
|
| Since we're mostly technical people here, we should use
| the appropriate term where the context makes that choice
| more accurate. It's like virus and Trojan - we SHOULD be
| technically correct, but that doesn't mean highly
| technical people aren't still calling Trojans viruses now
| and then.
| cduzz wrote:
| I think "overclock" implies that the end-user is doing
| something that's out-of-spec for the thing they're
| operating.
|
| This "I can run a core at a faster speed" is a documented
| feature so not really overclocking.
| TylerE wrote:
| That's literally overclocking. You're clocking it at a
| rate over the nameplate value. Just because the BIOS is
| factory-unlocked doesn't really change anything.
| LoganDark wrote:
| Yeah. Intel advertises the ability to overclock, but that
| doesn't mean overclocking is in spec. It just means Intel
| allows you to run it out of spec if you so choose. The
| spec says you can set the clock multiplier, it doesn't
| say anything above the stock range will actually be
| stable.
| TylerE wrote:
| Plus almost always people are tweaking voltages and such
| also.
| LoganDark wrote:
| This article is about automatic / enabled-by-default
| overclocking, which isn't actually specified by Intel but
| is done by the motherboard manufacturers anyway. At least
| the "GAMING GAMING GAMING" oriented ones like MSI and
| friends.
|
| As an example of motherboard manufacturers going outside
| specifications, my MSI motherboard has a built-in option
| to change BCLK, which is the clock reference for the
| entire PCIe bus. Changing it not only overclocks the CPU,
| but also the GPU's connection (not the GPU itself), as
| well as the NVMe SSD.
|
| This was so not-endorsed by Intel that they quickly
| pushed microcode that shuts the CPU down if it detects
| BCLK tampering.
|
| In response, MSI added a dropdown that allows you to
| downgrade the microcode of the CPU.
|
| So yeah. Very not within specifications.
| bee_rider wrote:
| At least the built in "multicore enhancement" type overclocks
| that are popular nowadays with motherboard manufacturers.
|
| I wonder if the old style "bump it up and memtest" type
| overclocking would catch this. Actually, what is the good
| testing tool nowadays? Does memtest check AVX frequencies?
| wmf wrote:
| Intel Performance Maximizer is from Intel so I'd hope it
| has good tests.
| Kon-Peki wrote:
| But isn't overclocking the entire point of buying the K
| version of these chips?
| secondcoming wrote:
| Yes it is but there's more to overclocking than just the
| CPU. You also need adequate cooling and fine-tuning of
| parameters I'll never truly understand. There are so many
| moving parts that you're not guaranteed anything. It seems
| like the CPUs were actually running at their overclocked
| speeds, but the rest of the system couldn't keep up.
| szundi wrote:
| Also might need to raise voltage etc
| rygorous wrote:
| Definitely not. These are supposed to be higher-quality
| bins that also ship with higher stock clock rates (both
| base and boost) and are rated for them.
|
| I don't know how common this is across the whole population
| of PC buyers, but personally, I have for sure bought
| K-series parts then not clocked them past their stock
| settings, trusting that they are rated for it and deeply
| uninterested in any OCing past that. (I prefer my machines
| stable, thank you very much.)
| bbarnett wrote:
| Decades ago I visited a fellow Amiga user's house. He had
| an overclocked 68060 Apollo board.
|
| He was so happy with the speed. Would not stop telling
| everyone, and talking about it. Yet as I watched him demo
| it, it rebooted every minute or so. Most unstable thing
| ever.
|
| Sure it booted in 2 seconds, and he just went about his
| merry way, but.. what?! Guy could have still overclocked
| a little less and had stability, but nope.
|
| Some overclockers are weird.
| zenonu wrote:
| Intel is already running their CPUs at the red line. We're
| seeing the margin breaking down as Intel tries to remain
| competitive. The latest 14900KS can even pull > 400W. It's
| utter insanity.
| refulgentis wrote:
| I wish I kept up with them better, I swear every 3 months I
| see a headline that is "Intel says N nodes in {N-1 duration
| - 3 months}. I think I just saw 5 nodes in 4 years? And
| we've had 2 in the last 4? Sigh.
| vondur wrote:
| It seems like problems occur from different firmware from the
| various motherboard manufacturers. I have a motherboard with a
| Ryzen 7950x and it would randomly not boot. I'd have to remove
| the battery from the system, let it fully reset, and then it
| would work again. Finally an update to the firmware fixed that
| bug.
| whoisthemachine wrote:
| Good, fast, cheap. Choose two.
| paulmd wrote:
| haha, knew it wouldn't take long for the AMD fanboys to get
| winding up on how awful this is gonna be.
|
| https://news.ycombinator.com/item?id=39479081
|
| Somehow people think that it's a strawman, but people like
| parent comment _actually think and post like this_ lol
| rkagerer wrote:
| Right, when they still knew how to make reliable hardware
| instead of cramming in features that aren't fully thought out
| and come with questionable tradeoffs to hit the bleeding edge.
| perryizgr8 wrote:
| If I ever encounter a CPU bug causing problems in my production
| code, I will consider my life complete. I will be satisfied that
| I've practiced my profession to a high degree of completeness.
| dist-epoch wrote:
| You should go work for Facebook. At their scale they are
| encountering CPU bugs daily:
|
| > This has resulted in hundreds of CPUs detected for these
| errors
|
| https://arxiv.org/abs/2102.11245
| ryukoposting wrote:
| I vaguely recall motherboard vendors ignoring Intel's power
| recommendations a couple years ago, which was causing weird
| thermal/performance issues (was it Asus?). I get the impression
| that's what's happening here, again.
| adamc wrote:
| While I appreciate their point of view, from a consumer pov this
| would definitely be a failure of their software, since an
| implicit requirement is that it has to run on the customer
| machines. People aren't going to throw away their CPU for this,
| they are going to return the game (if possible), and certainly
| express the bad user experience they had with it.
| xcv123 wrote:
| This is a hardware fault causing other software to fail. Intel
| and mainboard manufacturers have recommended workarounds.
| Customers are not stupid and they know who is at fault.
| adamc wrote:
| I predict you are wrong and there will be returns. It's not
| really a question of stupidity, but of what options are
| available to them.
| xcv123 wrote:
| No they just follow the provided instructions, go into the
| BIOS setup, and fix the settings. These are $1k CPUs.
| Purchased by enthusiasts, not retards. If their CPU is
| unstable they will want to fix it.
| johnklos wrote:
| For years I've had this impression that Intel CPUs were, to put
| it simply, trying too hard. I administer servers for various
| companies, and some use Intel even though I generally recommend
| AMD or non-x86.
|
| A pattern I've noticed is that some of the AMD systems I
| administer have never crashed or panicked. Several are almost ten
| years old and have had years of continuous uptime. Some have had
| panics that've been related to failing hardware (bad memory,
| storage, power supply), but none has become unstable without the
| underlying cause eventually being discovered.
|
| Intel systems, on the other hand, have had panics that just have
| had no explanation, have had no underlying hardware failures, and
| have had no discernible patterns. Multiple systems, running an OS
| and software that was bit-for-bit identical to what has been
| running on AMD systems, have panicked. Whereas some of the AMD
| systems that had bad memory had consumer motherboards with non-
| ECC memory, the Intel systems have typically been Supermicro or
| Dell "server" systems with ECC.
|
| In one case two identical Supermicro Xeon D systems with ECC were
| paired with two identical Steamroller (pre-Ryzen) AMD systems.
| All systems provided primary and backup NAT, routing,
| firewalling, DNS, et cetera. The Xeon systems were put in place
| after the AMD systems because certain people wanted "server
| grade" hardware, which is understandable, and low power AMD
| server systems weren't a thing in that time period. Over the
| course of several years, the Xeon systems had random panics,
| whereas one of the AMD systems had a failed SSD, but no unplanned
| or unexplained panic or outage, and the other had never had a
| panic or unplanned reboot in all the years it was in continuous
| service.
|
| Had I collected information more deliberately from the very
| beginning of these side-by-side AMD and Intel installations, I'd
| have something more than anecdotal, but I'm comfortable calling
| the conclusion real: multiple generations of Intel systems, even
| with server hardware and ECC, have issues with random crashes and
| panics, on the order of perhaps one every year or two. I do not
| see a similar instability on AMD, though.
|
| With brand new Intel CPUs taking substantially more power than
| similarly performing AMD CPUs, we have a more literal example of
| what I think is the underlying cause: Intel is trying way too
| hard to get every tiny bit of performance out of their CPUs,
| often to the detriment of the overall balance of the system.
| Between the not insignificantly higher number of CPU
| vulnerabilities on Intel due to shortcuts illustrated by the
| performance losses from enabling mitigations, and the rather
| shocking power draw of stock Intel CPUs that have turbo boosting
| enabled, I can't recommend any Intel system for any use where
| stability matters.
| crote wrote:
| On the other hand, I'm currently dealing with an AMD system
| which seems to randomly hard reboot every couple of days, as if
| someone pressed the power button for 5 seconds.
|
| It could keep running for 70 hours, or it could crash twice in
| 4 hours. Stress-testing CPU, GPU, memory, and storage doesn't
| invoke a crash, but it'll crash when all I'm running is a
| single Firefox tab with HN open.
|
| Maybe I got unlucky, or maybe you got lucky. Who knows, really.
| xcv123 wrote:
| If it's not the system hardware could it be power
| instability? Brownouts will trigger a reset. A UPS with power
| conditioner will fix that.
| terrelln wrote:
| We also regularly run into hardware issues with Zstd. Often the
| decompressor is the first thing to interact with data coming over
| the network. Or like in this case the decompressor is generally
| very sensitive to bit-flips, with or without checksumming
| enabled, so notices other hardware problems more than other
| processes running on the same host.
|
| One decision that Zstd made was to include only a checksum of the
| original data. This is sufficient to ensure data integrity. But,
| it makes it harder to rule out the decompressor as the source of
| the corruption, because you can't determine if the compressed
| data is corrupt.
| FileSorter wrote:
| I recently had to RMA my i9-13900KS because it was faulty. I was
| experiencing some of the weirdest behavior I have ever seen on a
| PC. For example, 1. Whenever I tried to install Nvidia drivers I
| would get "7-zip: Data error" 2. A fresh install of Windows would
| give me SxS error when trying to launch edge 3. I could not open
| the control panel 4. BSOD loop on boot
| dontupvoteme wrote:
| Decompression failure immediately went to a far grimmer failure
| mode in my mind..
| colombiunpride2 wrote:
| I wonder if this is partly related to the LGA1700 frame problems
| that tend to bend the heat spreader.
|
| There are two after market contact frames that drop the
| temperature around 10 Celsius and ensure flat contact with the
| head spreader. The stock frame causes the center of the head
| spreader to dip.
|
| I wonder if the turbo boost is controlled by a Proportional-
| integral-derivative controller.(PID)
|
| The idea that the parameters are fine tuned to slow down
| processor speed as it heats up but before it overshoots its
| maximum threshold.
|
| If those PID values are tuned to assume flat heat spreader/heat
| sink contact, I can see where a bent heat spreader could cause
| the cpu to overshoot its safe limit and cause errors.
| ezekiel68 wrote:
| Based on the article contents, this doesn't seem to be CPU
| errata. We already know that overclocking above a certain point
| will cause OS crashes. This seems to be system instability just
| below the threshold of crashing. Aggressive power and clock
| settings manifest as this instability without causing an actual
| crash.
|
| I don't find this situation much different than needing to dial
| back BIOS settings when actual crashes are observed.
| sedatk wrote:
| Exactly. I remember when I overcloked my 486DX4-100 to 120Mhz,
| everything would work fine but the floppy drive. It just
| wouldn't work for whatever reason. Never thought it was a CPU
| issue, I'd just asked for it.
___________________________________________________________________
(page generated 2024-02-23 23:00 UTC)