Post B2ecfhMjuhbgcFd1fM by slink@fosstodon.org
 (DIR) More posts by slink@fosstodon.org
 (DIR) Post #B2Y3h0ir3I14HIYcrY by gabrielesvelto@mas.to
       2026-01-22T16:00:44Z
       
       1 likes, 2 repeats
       
       In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
       
 (DIR) Post #B2Y3h2eJt7ZqFoS24W by gabrielesvelto@mas.to
       2026-01-22T16:01:05Z
       
       0 likes, 0 repeats
       
       The root of all these issues is fundamentally the same: complexity. Modern cores have become so complex that it's impossible to demonstrate at design time that they will work reliably under all possible conditions, and thoroughly testing them is also infeasible. In addition to ever increasing logic complexity the the conditions in which they operate have also changed: fixed voltages and frequencies are a thing of the past, complicating physical design. 2/31
       
 (DIR) Post #B2Y3h47QQOS8oN4n4q by gabrielesvelto@mas.to
       2026-01-22T16:01:17Z
       
       0 likes, 0 repeats
       
       Let's start with logic bugs. As you probably know a CPU has a certain amount of visible state: a set of registers that contain data and which is manipulated by the instructions, an instruction pointer pointing holding the address of the executing instructions, a set of special registers to alter the core's behavior, for example by changing how floating-point operations round their operands. 3/31
       
 (DIR) Post #B2Y3h55goKetpHGv0i by gabrielesvelto@mas.to
       2026-01-22T16:01:27Z
       
       0 likes, 0 repeats
       
       In the early days of integrated CPUs this state was not only visible to the user, but corresponded physically to what was in the core. The registers corresponded to entries in an actual bank of SRAMs inside the core, the instruction pointer was a physical register that would be read every cycle to fetch instructions from memory. In today's CPUs all these things are merely abstractions and the underlying physical reality is dramatically more complex. 4/31
       
 (DIR) Post #B2Y3h6PZtsA5vFaJeK by gabrielesvelto@mas.to
       2026-01-22T16:01:35Z
       
       0 likes, 0 repeats
       
       Modern CPUs contain a tremendous amount of state that they need to track. Hundreds of instructions can be in-flight at any given moment, each of them operates on physical registers which are assigned just-in-time via a mechanism that maps the registers in the ISA to spare physical slots in very large banks. Each instruction is associated with a set of data that is entirely speculative for a very long time, including the instruction itself. 5/31
       
 (DIR) Post #B2Y3h7UvrS2TI966dM by gabrielesvelto@mas.to
       2026-01-22T16:01:48Z
       
       0 likes, 0 repeats
       
       Instruction addresses, operands, dependencies are tracked as the CPU fetches and executes a stream of instructions which might or might not have to be executed, depending on branch prediction. In the case of a misprediction the state of the "wrong" instructions needs to be discarded. Similarly instruction timing is not predictable anymore. Memory accesses can take anything from a few cycles to hundreds, and fetch their data through different structures both inside and outside of the core. 6/31
       
 (DIR) Post #B2Y3h8o6zcyVLv4wAS by gabrielesvelto@mas.to
       2026-01-22T16:02:00Z
       
       0 likes, 0 repeats
       
       Instruction faults cannot be predicted either: if a memory access fails because it access a protected memory area the flow of instructions must be stopped, undoing everything that should have happened before the faulting instruction and steering the core towards executing code provided by the operating system for such case, giving the impression that execution stopped right there in a perfectly sequential way. 7/31
       
 (DIR) Post #B2Y3h9ymdR6azJ4yRM by gabrielesvelto@mas.to
       2026-01-22T16:02:08Z
       
       0 likes, 0 repeats
       
       And that's without mentioning the large amount of hidden state carried by a CPU purely for performance reasons: physical-to-memory address translations are done via tables stored in memory, but this data needs to be cached in a translation lookaside buffer inside the core. Cache lines can be shared by different cores and must track their state, are they owned by a core? Shared? Is the data dirty locally and needs to be fetched remotely? 8/31
       
 (DIR) Post #B2Y3hB20ivHUFbb46q by gabrielesvelto@mas.to
       2026-01-22T16:02:16Z
       
       0 likes, 0 repeats
       
       The execution of every single instruction can alter a significant chunk of this enormous amount of state and must do so reliably. But as I mentioned before it's impossible to test all possible combinations and some sequences might lead to inconsistent or corrupted state, which in turn will manifest itself as a software bug. 9/31
       
 (DIR) Post #B2Y3hCByPMqPqnGXHE by gabrielesvelto@mas.to
       2026-01-22T16:02:23Z
       
       0 likes, 0 repeats
       
       Here's a few examples I've encountered: the instruction pointer is fetched from the stack while returning from a function call but it appears wrong, possibly because the wrong instruction pointer was sent to the instruction fetch pipeline: https://bugzilla.mozilla.org/show_bug.cgi?id=1746270 10/31
       
 (DIR) Post #B2Y3hDQXog5tgH5gcy by gabrielesvelto@mas.to
       2026-01-22T16:02:31Z
       
       0 likes, 0 repeats
       
       The code expected a piece of data to be loaded from memory, but the load/store unit returned stale data from a previous fetch: https://bugzilla.mozilla.org/show_bug.cgi?id=1687914 11/31
       
 (DIR) Post #B2Y3hEEWoq4oBITbXM by gabrielesvelto@mas.to
       2026-01-22T16:02:38Z
       
       0 likes, 0 repeats
       
       The instruction pointer associated with an instruction is corrupted, and what appears to be the currently executing instruction is not. You can tell because a load causes a store exception, or a jump causes an access exception: https://bugzilla.mozilla.org/show_bug.cgi?id=1820832 12/31
       
 (DIR) Post #B2Y3hF8BTub0xuW3Hs by gabrielesvelto@mas.to
       2026-01-22T16:02:45Z
       
       0 likes, 0 repeats
       
       Other bugs could have even worse effects, such as AMD's infamous Barcelona TLB bug which would put the core from which recovery wasn't possible, effectively halting execution: https://arstechnica.com/gadgets/2007/12/linux-patch-sheds-light-on-amds-tlb-errata/ 13/31
       
 (DIR) Post #B2Y3hG4Jzl6HsDiTuC by gabrielesvelto@mas.to
       2026-01-22T16:02:53Z
       
       0 likes, 0 repeats
       
       In all these cases the likely culprit is a bug in the machinery that tracks the internal CPU state when an unlikely sequence of events happens: a rapid series of interrupt or context switches, execution timing of certain instructions while the processor leaves or enters a particular mode of execution. These are not unlikely software bugs where you missed checking a particular condition at a specific time, and most of the time it doesn't matter except for that one time when it does. 14/31
       
 (DIR) Post #B2Y3hH5mBprH31P9oO by gabrielesvelto@mas.to
       2026-01-22T16:03:00Z
       
       0 likes, 0 repeats
       
       Reading the errata of any relatively recent CPU you will find the same wording applied to every known issue: "Under complex microarchitectural conditions...". That's hardwarese for "a state we had not anticipated we could end up in". Try looking it up yourself on an errata document such as this one: https://edc.intel.com/content/www/us/en/secure/design/confidential/products-and-solutions/processors-and-chipsets/tiger-lake/11th-generation-intel-core-processor-family-specification-update/errata-details/ 15/31
       
 (DIR) Post #B2Y3hIMTTEoEz6E0Tg by gabrielesvelto@mas.to
       2026-01-22T16:03:07Z
       
       0 likes, 0 repeats
       
       Now you might wonder if these kinds of bugs can be fixed after the fact. Well, sometimes they can, sometimes they can't. CPUs are not purely hard-coded beasts, they rely on microcode for part of their operation. Traditionally microcode is a set of internal instructions that the CPU ran to execute external instructions. That's mostly not the case anymore, and modern microcode ships not only with implementations of complex instructions but also a significant amount of configuration. 16/31
       
 (DIR) Post #B2Y3hJGU6zc1moQjmS by gabrielesvelto@mas.to
       2026-01-22T16:03:14Z
       
       0 likes, 0 repeats
       
       As an example microcode can be used to disable certain circuits. Imagine something like a loop buffer, a structure that captures decoded instructions and re-executes them in a loop bypassing instruction fetches. If it turns out to be buggy a microcode update might disable it entirely, effectively sacrificing an optimization for stability. 17/31
       
 (DIR) Post #B2Y3hKLq4ZUP9hwWlU by gabrielesvelto@mas.to
       2026-01-22T16:03:21Z
       
       0 likes, 0 repeats
       
       When implementing a new core it is commonplace to implement new structures, and especially more aggressive performance features, in a way that makes it possible to disable them via microcode. This gives the design team the flexibility to ship a feature only if it's been proven to be reliable, or delay it for the next iteration. 18/31
       
 (DIR) Post #B2Y3hLFqiKIBxQ9G4G by gabrielesvelto@mas.to
       2026-01-22T16:03:28Z
       
       0 likes, 0 repeats
       
       Microcode can also be used to work around conditions caused by data races, by injecting bubbles in the pipeline under certain conditions. If the execution of two back-to-back operations is known to cause a problem it might be possible to avoid it by delaying the execution of the second operation by one cycle, again trading performance for stability. 19/31
       
 (DIR) Post #B2Y3hMRaIBH1e6e8zw by gabrielesvelto@mas.to
       2026-01-22T16:03:34Z
       
       0 likes, 0 repeats
       
       However not all bugs can be fixed this way. Bugs within logic that sits on a critical path can rarely be fixed. Additionally some microcode fixes can only be made to work if the microcode is loaded at boot time, right when the CPU is initialized. If the updated microcode is loaded by the operating system it might be too late to reconfigure the core's operation, you'll need an updated UEFI firmware for some fix to work. 20/31
       
 (DIR) Post #B2Y3hNRGaqc6jPVP8q by gabrielesvelto@mas.to
       2026-01-22T16:03:41Z
       
       0 likes, 0 repeats
       
       But this is just logic bugs and unfortunately there's a lot more than that nowadays. If you've followed the controversy around Intel's first-generation Raptor Lake CPUs you'll know that they had issues that would cause seemingly random failures to happen. These bugs were caused by too little voltage being provided to the core under certain conditions which in turn would often cause a race condition within certain circuits leading to the wrong results being delivered. 21/31
       
 (DIR) Post #B2Y3hOiJqvqegaUXMO by gabrielesvelto@mas.to
       2026-01-22T16:03:49Z
       
       0 likes, 0 repeats
       
       To understand how this works keep this in mind: the maximum frequency at which a CPU can operate is dictated by the longest path through the circuits that make up a pipeline stage. Signals propagating via wires and turning transistors on and off take time, and because modern circuit design is strictly synchronous, all the signals must reach the end of the stage before the end of a clock cycle. 22/31
       
 (DIR) Post #B2Y3hPmbsSsI0BVTge by gabrielesvelto@mas.to
       2026-01-22T16:03:56Z
       
       0 likes, 0 repeats
       
       When a clock cycle ends, all the signals resulting from a pipeline stage are stored in a pipeline register. A storage element - invisible to the user - that separates pipeline stages. So if a stage adds two numbers for example, the pipeline register will hold the result of this addition. The next cycle this result will be fed to the circuits that make up the next pipeline stage. If the result of the addition I mentioned is an address for example, then it might be used to access the cache. 23/31
       
 (DIR) Post #B2Y3hQtjjSAZSZqgQy by gabrielesvelto@mas.to
       2026-01-22T16:04:05Z
       
       1 likes, 0 repeats
       
       The speed at which signals propagate in circuits is proportional to how much voltage is being applied. In older CPUs this voltage was fixed, but in modern ones it changes thousands of times per second to save power. Providing just as little voltage needed for a certain clock frequency can dramatically reduce power consumption, but providing too little voltage may cause a signal to arrive late, or the wrong signal to reach the pipeline register, causing in turn a cascade of failures. 24/31
       
 (DIR) Post #B2Y3hVfjyuweGcL4oS by gabrielesvelto@mas.to
       2026-01-22T16:04:12Z
       
       0 likes, 0 repeats
       
       In Raptor Lake's case a very common pattern that me and others have noticed is that sometimes the wrong 8-bit value is delivered. This happens when reading 8-bit registers such as AH or AL, which are just slices of larger integer registers, and don't have dedicated physical storage. The operation that pulls out the higher or lower 8 bits of the last 16 bits of a regular register is usually done via a multiplexer or MUX. 25/31
       
 (DIR) Post #B2Y3hapqml0EHQ6ilU by gabrielesvelto@mas.to
       2026-01-22T16:04:20Z
       
       0 likes, 0 repeats
       
       This is a circuit with two sets of 8 wires that go into it, plus one wire to select which inputs will go to the output, and a single set of 8 wires going out. Depending on the value of the select signal you'll get one or the other set of inputs. Guess what happens if the select signal arrives too late, for example right after the end of the clock cycle? You get the wrong set of bits in the output. 26/31
       
 (DIR) Post #B2Y3hfmqNMZJdXjtGy by gabrielesvelto@mas.to
       2026-01-22T16:04:33Z
       
       0 likes, 0 repeats
       
       I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31
       
 (DIR) Post #B2Y3hkuTKQdpWeLYMC by gabrielesvelto@mas.to
       2026-01-22T16:04:43Z
       
       0 likes, 0 repeats
       
       You might also remember that Raptor Lake CPU problems get worse over time. That's because circuits degrade, and applying the wrong voltage can cause them to degrade faster. Circuit degradation is a research field of its own, but its effects are broadly the same: resistance in wires go up, capacity of trench capacitors go down, etc… and the combined effect of these changes is that circuits get slower and need more voltage to operate at the same frequency. 28/31
       
 (DIR) Post #B2Y3hpm9EnxCcHUTZo by gabrielesvelto@mas.to
       2026-01-22T16:04:50Z
       
       0 likes, 0 repeats
       
       When CPUs ship their most performance critical circuits are supposed to come with a certain timing slack that will compensate for this effect. Over time this timing slack gets smaller. If a CPU is already operating near the edge, aging might cut this slack all the way down to zero, causing the core to fail consistently. 29/31
       
 (DIR) Post #B2Y3hv43ZeFZ1GuLgW by gabrielesvelto@mas.to
       2026-01-22T16:04:56Z
       
       0 likes, 0 repeats
       
       And remember there's a lot of variables involved: timing broadly depends on transistor sizing and wire resistance. Higher voltages improve transistor performance but increase power dissipation and thus temperature. Temperature increases resistance which decreases propagation speed in wires. It's a delicate dance to keep a dynamic equilibrium of optimal power consumption, adequate performance and reliability. 30/31
       
 (DIR) Post #B2Y3i129NeI3X63uXw by gabrielesvelto@mas.to
       2026-01-22T16:05:08Z
       
       0 likes, 0 repeats
       
       All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
       
 (DIR) Post #B2Y3i7KnxCmF4gptSK by gabrielesvelto@mas.to
       2026-01-22T16:07:12Z
       
       0 likes, 0 repeats
       
       Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
       
 (DIR) Post #B2ZlVC2XjpCx2qtyT2 by Suiseiseki@freesoftwareextremist.com
       2026-01-23T12:04:24.021019Z
       
       0 likes, 0 repeats
       
       @gabrielesvelto >1/31Is using Pleroma really that hard?
       
 (DIR) Post #B2ZlcLd2UCckoKNsMC by Zergling_man@sacred.harpy.faith
       2026-01-23T12:05:06.090424Z
       
       0 likes, 0 repeats
       
       @Suiseiseki @gabrielesvelto lmao mastodongs
       
 (DIR) Post #B2ecfhMjuhbgcFd1fM by slink@fosstodon.org
       2026-01-23T06:39:21Z
       
       0 likes, 0 repeats
       
       @gabrielesvelto thank you for this great, informative overview.numerous times, i had asked myself if a reported crash could be caused by a hardware bug, and so far i would think i never saw a real case - possibly due to the software i work on running in more controlled environments.but i would be curious how a crash from a real hardware bug could  be classified automatically. do you have pointers to foss tools?
       
 (DIR) Post #B2ecfiY7VsIwHpxd2m by gabrielesvelto@mas.to
       2026-01-23T07:34:58Z
       
       0 likes, 0 repeats
       
       @slink oh yes, we have tools for that. First however I'd point you to my thread about memory errors because those are even more common when analyzing crashes: https://fosstodon.org/@gabrielesvelto/112407741329145666For crash analysis we have a rust crate to analyze minidumps, which we generate when Firefox crashes. The crate can be used both as a tool and as a library:https://github.com/rust-minidump/rust-minidump
       
 (DIR) Post #B2ecfjQ0HXPEyxAf20 by gabrielesvelto@mas.to
       2026-01-23T07:36:38Z
       
       0 likes, 0 repeats
       
       @slink this crate can detect patterns that suggest a memory error was encountered or that the crash was inconsistent and thus most likely due to a hardware bug. If you check out the output schema of the tool you'll find two fields called "possible_bit_flips" and "crash_inconsistencies" that capture this information: https://github.com/rust-minidump/rust-minidump/blob/main/minidump-processor/json-schema.md
       
 (DIR) Post #B2ecfjc3Yj2zaKoHom by jeffcliff@shitposter.world
       2026-01-25T20:18:58.237514Z
       
       0 likes, 1 repeats
       
       @gabrielesvelto @slink STOP USING GITHUB