[HN Gopher] Intel's $475M error: the silicon behind the Pentium ...
___________________________________________________________________
Intel's $475M error: the silicon behind the Pentium division bug
Author : gslin
Score : 330 points
Date : 2024-12-28 21:48 UTC (1 days ago)
(HTM) web link (www.righto.com)
(TXT) w3m dump (www.righto.com)
| kens wrote:
| Author here if anyone has Pentium questions :-)
|
| My Mastodon thread about the bug was on HN a few weeks ago, so
| this might seem familiar, but now I've finished a detailed blog
| post. The previous HN post has a bunch of comments:
| https://news.ycombinator.com/item?id=42391079
| mras0 wrote:
| Great article and analysis as always, thanks! Somewhat crazy to
| remember that a (as you argue) minor CPU erretum made world
| wide headlines. So many worse ones out there (like you mention
| from Intel) but others as well, that are completely forgotten.
|
| For the Pentium, I'm curious about the FPU value stack (or
| whatever the correct term is) rework they did. It's been a long
| time, but didn't they do some kind of early "register renaming"
| thing that had you had to manually manage doing careful
| fxchg's?
| lallysingh wrote:
| AFAIK, the FPU was a stack calculator. So you pushed things
| on and ran calculations on the stack.
| https://en.wikibooks.org/wiki/X86_Assembly/Floating_Point
| Sesse__ wrote:
| It's only a stack machine in front, really. Behind-the-
| scenes, it's probably just eight registers (the stack is a
| fixed size, it doesn't spill to memory or anything).
| Sesse__ wrote:
| Yes, internally fxch is a register rename--_and_ fxch can go
| in the V-pipe and takes only one cycle (Pentium has two
| pipes, U and V).
|
| IIRC fadd and fmul were both 3/1 (three cycles latency, one
| cycle throughput), so you'd start an operation, use the free
| fxch to get something else to the top, and then do two other
| operations while you were waiting for the operation to
| finish. That way, you could get long strings of FPU
| operations at effectively 1 op/cycle if you planned things
| well.
|
| IIRC, MSVC did a pretty good job of it, too. GCC didn't,
| really (and thus Pentium GCC was born).
| ack_complete wrote:
| FMUL could only be issued every other cycle, which made
| scheduling even more annoying. Doing something like a
| matrix-vector multiplication was a messy game of
| FADD/FMUL/FXCH hot potato since for every operation one of
| the arguments had to be the top of the stack, so the TOS
| was constantly being replaced.
|
| Compilers got pretty good at optimizing straight line math
| but were not as good at cases where variables needed to be
| kept in the stack during a loop, like a running sum. You
| had to get the order of exchanges just right to preserve
| stack order across loop iterations. The compilers at the
| time often had to spill to memory or use multiple FXCHs at
| the end of the loop.
| Sesse__ wrote:
| > FMUL could only be issued every other cycle, which made
| scheduling even more annoying.
|
| Huh, are you sure? Do you have any documentation that
| clarifies the rules for this? I was under the impression
| that something like `FMUL st, st(2) ; FXCH st(1), FMUL
| st, st(2)` would kick off two muls in two cycles, with no
| stall.
| Tuna-Fish wrote:
| Agner Fog's manuals are clear on this. Only the last of
| FMUL's 3 cycles can overlap with another FMUL.
|
| You can immediately overlap with a FADD.
| icehawk wrote:
| I was about to ask if the explanation of floating point numbers
| was using Avogadro's number on purpose, but then I realized the
| other number was Planck's constant.
| kens wrote:
| Yes, I wanted to use meaningful floating point examples
| instead of random numbers. You get a gold star for noticing
| :-)
| skissane wrote:
| > The bug is presumably in the Pentium's voluminous microcode.
| The microcode is too complex for me to analyze, so don't expect
| a detailed blog post on this subject.
|
| How hard is it to "dump" the microcode into a bitstream? Could
| it be done programatically from high resolution die
| photographs? Of course, I appreciate that's probably the easy
| part in comparison to reverse engineering what the bitstream
| means.
|
| > By carefully examining the PLA under a microscope
|
| Do you do this stuff at home? What kind of equipment do you
| have in your lab? How did you develop the skills to do all
| this?
| kens wrote:
| Dumping the microcode into a bitstream can be done in an
| automated way if you have clear, high-resolution die photos.
| There are programs to generate ROM bitsreams from photos.
| Part of the problem is removing all the layers of metal to
| expose the transistors. My process isn't great, so the
| pictures aren't as clear as I'd like. But yes, the hard part
| is figuring out what the microcode bitstream means. Intel's
| patents explained a lot about the 8086 microcode structure,
| but Intel revealed much less about later processors.
|
| I do this stuff at home. I have an AmScope metallurgical
| microscope; a metallurgical microscope shines light down
| through the lens, rather than shining the light from
| underneath like a biological microscope. Thus, the
| metallurgical microscope works for opaque chips. The Pentium
| is reaching the limits of my microscope, since the feature
| size is about the wavelength of light. I don't have any
| training in this; I learned through reading and
| experimentation.
| dekhn wrote:
| One tidbit to add about scopes: some biological scopes do
| use "epi" illumination like metallurgical scopes. It's
| commonly used on high end scopes, in combination with laser
| illumination and fluorescence. They are much more
| complicated and require much better alignment than a
| regular trans illumination scope.
|
| I suppose you might be able to get slightly better
| resolution using a shorter wavelength, but at that point,
| it requires a lot of technical skill and environmental
| conditions and time and money, Just getting to the point
| you've reached (and knowing what the limitations are) can
| be satisfying in itself.
| ernst_mulder wrote:
| Thank you very much for this detailed article.
|
| I never realised this is how floating point division can be
| implemented. Actually funny how I didn't realise that multiple
| integer division steps are required to implement floating point
| division :-)
|
| In hindsight one could wonder why the unused parts of the
| lookup table were not filled with 2 and -2 in the first place.
| ksec wrote:
| In my view, this $475M was perhaps the best marketing spend for
| Intel. Because of the bug and recall, everyone including those
| not in tech knew about Intel. Coming from the 486 when people
| were expecting 586 or 686 but then suddenly "Pentium", this bug
| and recall built a reputation and good will that carried on
| later with Pentium MMX.
| wmf wrote:
| Nah, Intel already did a big Pentium marketing blitz with the
| bunny people before this bug.
| xattt wrote:
| Bunny people were part of the MMX and PII marketing.
| fourseventy wrote:
| Didn't Intel have floating point division issues more recently as
| well?
| hinkley wrote:
| Inflation adjusted that's over 1 billion today. And they do
| more mitigations with microcode these days.
| xattt wrote:
| Some irony if internal calculations of financial damage
| estimates were under or over-estimated because they were done
| on a defective chip.
| fuzztester wrote:
| yes.
|
| their "1 in a billion" (excuse) became $1 billion (cost to
| them).
|
| of course, the CEOs not only go scot-free, but get to bail
| out with their golden parachutes, while the shareholders and
| public take the hit.
|
| https://en.m.wikipedia.org/wiki/Golden_parachute
| bolognafairy wrote:
| Those evil CEOs pulling the wool over the eyes of the poor
| shareholders!
|
| This is the employment contract that was negotiated and
| agreed to by the board / shareholders.
| fuzztester wrote:
| Where yuh bin livin all yore laafe, pilgrim? Under a
| Boulder in Colorado, mebbe? Dontcha know dat contracts
| can be gamed, and hev bin fer yeers, if not deecades? Dis
| here ain't Aahffel, ya know?
|
| Come eat some chili widdus.
|
| Id'll shore put some hair on yore chest, and grey cells
| in yore coconut.
|
| # sorry, in a punny mood and too many spaghetti western
| movies
| KennyBlanken wrote:
| Oh, it gets even better. US taxpayers are giving them
| billions for "national security" reasons.
|
| Nothing like giving piles of cash to a grossly incompetent
| company (the Pentium math bug, Puma cablemodem issues,
| their shitty 4G cellular radios, extensive issues with
| gigabit and 2.5G network interfaces, and now the whole
| 13th/14th gen processor self-destruction mess.)
| kens wrote:
| There's an FSIN trig inaccuracy, but I don't know of other
| division issues:
| https://randomascii.wordpress.com/2014/10/09/intel-underesti...
| hinkley wrote:
| > Intel's whitepaper claimed that a typical user would encounter
| a problem once every 27,000 years, insignificant compared to
| other sources of error such as DRAM bit flips.
|
| > However, IBM performed their own analysis,29 suggesting that
| the problem could hit customers every few days.
|
| I bet these aren't as far off as they seem. Intel seems to be
| considering a single user, while I suspect IBM is thinking in
| terms of support calls.
|
| This is a problem I've had at work. When you process a 100
| million requests a day the one in a billion problem is hitting
| you a few times a month. If it's something a customer or worse a
| manager notices, they ignore the denominator and suspect you all
| of incompetence. Four times a month can translate into "all the
| time" in the manner humans bias their experiences. If you get two
| statistical clusters of three in a week someone will lose their
| shit.
| kens wrote:
| No, IBM's estimate is for a single user. IBM figures that a
| typical spreadsheet user does 5000 divides per second when
| recalculating and does 15 minutes of recalculating a day. IBM
| also figures that the numbers people use are 90 times as likely
| to cause an error as Intel's uniformly-distributed numbers. The
| result is one user will have an error every 24 days.
| hinkley wrote:
| Ah.
|
| The other failure mode that occurred to me is that if a
| spread sheet is involved you could keep running the same calc
| on a bad input for months or even years when aggregating
| intermediate values over units of time. A problem that
| happens every time you run a calculation is very different
| from one that happens at random. Better in some ways and
| worse in others.
| jiggawatts wrote:
| That's also a clearly flawed analysis, because the numbers
| mostly don't change between re-computations of the
| spreadsheet cell values!
|
| E.g.: Adding a row doesn't invalidate calculations for
| previous rows in typical spreadsheet usage. The bug is
| deterministic, so repeating successful calculations over and
| over with the same numbers won't ever trigger the bug.
| kens wrote:
| Yes, the book "Inside Intel" makes the same argument about
| spreadsheets (p364). My opinion is that Intel's analysis is
| mostly objective, while IBM's analysis is kind of a scam.
| cornholio wrote:
| IBM's result is correct if we interpret "one user
| experiences the problem every few days" as "one in a
| million users will experience the problem 5000 times a
| second, for 15 minutes every day they use the spreadsheet
| with certain values". It's an average that makes no
| sense.
| wat10000 wrote:
| Spreadsheets Georg....
| dboreham wrote:
| Another great article from Ken. I remember this particularly
| because the first PC that I bought with my own money had an
| affected CPU. Prior to this era I hadn't been much interested in
| PCs because they couldn't run "real" software. But Windows NT
| changed that (thank you Mr. Cutler), and Taiwanese sourced low
| cost motherboards made it practical to build your own machine, as
| many people still do today. Ken touched on the fact that it was
| easy for users to check if their CPU was affected. I remember
| that this was as easy as typing a division expression with the
| magic numbers into Excel. If MS had released a version of Excel
| that worked around the bug, I suspect fewer users would have
| claimed their replacement device!
| ryao wrote:
| Couldn't these PCs run 386BSD?
| wmf wrote:
| Yeah, there was BSD, Coherent, SCO, Xenix, etc. Arguably OS/2
| was also a "real" operating system.
| urbandw311er wrote:
| What an interesting and utterly dedicated analysis. Thank you so
| much for all your work analysing the silicon and sharing your
| findings. I particularly like how you're able to call out Intel
| on the actual root cause, which their PR made sound like
| something analogous to a trivial omission. But, in fact, was less
| forgivable and more blameworthy, ie they stuffed up their table
| generation algorithm.
| ThrowawayTestr wrote:
| >Smith posted the email on a Compuserve forum, a 1990s version of
| social media.
|
| I hate how this sentence makes me feel.
| lizzas wrote:
| My initial feeling is: that data is probably mostly unmined and
| lost. Lucky bastards!
| lotsofpulp wrote:
| I like to use the 1900s instead of the 1990s.
| lazide wrote:
| Does it help, or make it worse, if you say it as 'late
| 1900s'?
| tgma wrote:
| Intel $475B error: not building a decent GPU
| lizzas wrote:
| Lack of clairvoyance? Missing out on mobile was more obvious
| tho.
| phire wrote:
| More explicitly. In 2006, Apple asked Intel to make a SoC for
| their upcoming product... the iPhone.
|
| At the time, Intel was one of the leading ARM SoC providers,
| their custom XScale ARM cores were faster than anything from
| ARM Inc themselves. It was the perfect line of chips for
| smartphones.
|
| The MBA types at Intel ran some sales projects and decided
| that such a chip wasn't likely to be profitable. There was
| apparently debate within Intel, the engineering types wanted
| to develop the product line anyway, and others wanting to win
| good-will from Apple. But the MBA types won. Not only did
| they reject Apple's request for an iPhone SoC, but they
| immediately sold off their entire XScale division to marvel
| (who did nothing with it) so they wouldn't even be able to
| change their mind later even if they wanted.
|
| With hindsight, I think we can safely say Intel's projections
| for iPhone sales were very wrong. They would have easily made
| their money back on just the sales from the first-gen iPhone,
| and Apple would probably gone back to intel for at least a
| few generations. Even if Apple dumped them, Intel would have
| a great product to sell to the rapidly market of Android
| smartphones in the early 2010s.
|
| -----------
|
| But I think it's actually far worse than just Intel missing
| out on the mobile market.
|
| In 2008, Apple acquired P.A. Semi, and started work on their
| own custom ARM processors (and ARM SoCs). The ARM processors
| which Apple eventually used to replace Intel as suppler in
| laptops and desktops too.
|
| Maybe Apple would have gone down that path anyway, but I
| really suspect Intel's reluctance to work with Apple to
| produce the chips Apple wanted (especially the iPhone chip)
| was a huge motivating factor that drove Apple down the path
| of developing their own CPUs.
|
| Remember, this is 2006. Intel had only just switched to Intel
| in January because IBM had continually failed to deliver
| Apple the laptop-class powerpc chips they needed _[1]_. And
| while at that time, Intel had a good roadmap for laptop-class
| chips, it would have looked to Apple as if history was at
| risk of repeating itself, especially as they moved into the
| mobile market where low power consumption was even more
| important.
|
| _[1]_ _TBH, IBM were failing to provide desktop-class CPUs
| too. But the laptop cpus were the more pressing issue. Fun
| fact: IBM actually tried to sell the PowerPC core they were
| developing for the xbox 360 and PS3 to Apple as a low-power
| laptop core. It was sold to Microsoft /Sony as a low-power
| core too, but if you look at the launch versions of both
| consoles, they run extremely hot, even when paired with
| comically large (for the era) cooling solutions._
| scarface_74 wrote:
| > More explicitly. In 2006, Apple asked Intel to make a SoC
| for their upcoming product... the iPhone.
|
| This isn't strictly true. Tony Fadell and one of t- the
| creator of the iPod and considered co-creator of the iPhone
| - said in an interview with Ben Thompson (Stratechery) that
| Intel was never seriously in the running for iPhone chips.
|
| Jobs wanted it. But the technical people at Apple pushed
| back.
|
| Besides, especially in 2006 less than a year before the
| iPhone was introduced, chip decisions had already been
| made.
| cylemons wrote:
| Was it really? x86 is more performance oriented and not
| efficiency oriented. Its variable length just makes it really
| hard to have a low power CPU that isn't too slow.
| tgma wrote:
| I think the impact of ISA is way overblown. The instruction
| decode pipeline is worse but doesn't consume that many
| transistors in the end relative to the total size of the
| system. I think it has much more to do with the attitude of
| Intel defining the x86 market as desktop and servers and
| not focused on super low power parts; plus their monopoly
| which led to a long stagnation because they didn't have to
| innovate as much.
|
| You can see today with modern Ryzen laptop chips that
| aren't that much worse than ARMs fabbed with the same node
| on perf/watt.
| adrian_b wrote:
| For applications where the performance is determined by
| array operations, which can leverage AVX-512
| instructions, an AMD Zen 5 core has better performance
| per area and per power than any ARM-based core, with the
| possible exception of the Fujitsu custom cores.
|
| The Apple cores themselves do not have great performance
| for array operations, but when considering the CPU cores
| together with the shared SME/AMX accelerator, the
| aggregate might have a good performance per area and per
| power consumption, but that cannot be known with
| certainty, because Apple does not provide information
| usable for comparison purposes.
|
| The comparison is easy only with the cores designed by
| Arm Holdings. For array operations, the best performance
| among the Arm-designed cores is obtained by Cortex-X4
| a.k.a. Neoverse V3. Cortex-A720 and Cortex-A725 have half
| of the number of SIMD pipelines but more than half of the
| area, while Cortex-X925 has only 50% more SIMD pipelines
| but a double area. Intel's Skymont a.k.a. Darkmont have
| the same area and the same number of SIMD pipelines as
| Cortex-X4, so like Cortex-X4 they are also more efficient
| than the much bigger core Lion Cove, which is faster on
| average for non-optimized programs but it has the same
| maximum throughput for optimized programs.
|
| When compared with Cortex-X4/Neoverse V3, a Zen 5 compact
| core has a throughput for array operations that can be up
| to double, while the area of a Zen 5 compact core is less
| than double the area of an Arm Cortex-X4. A high-clock
| frequency Zen 5 core has more than double the area of a
| Cortex-X4, but due to the high clock frequency it still
| has a better performance per area, even if it no longer
| has also a better performance per power consumption, like
| the Zen 5 compact cores.
|
| So the advantage in ISA of Aarch64, which results in a
| simpler and smaller CPU core frontend, is not enough to
| ensure better performance per area and per power
| consumption when the backend, i.e. the execution units,
| does not have itself a good enough performance per area
| and per power consumption.
|
| The area of Arm Cortex-X4 and of the very similar Intel
| Skymont core is about 1.7 square mm in a "3 nm" TSMC
| process (both including 1 MB of L2 cache memory). The
| area of a Zen 5 compact core in a "4 nm" TSMC process
| (with 1 MB of L2) is about 3 square mm (in Strix Point).
| The area of a Zen 5 compact core with full SIMD pipelines
| must be greater, but not by much, perhaps by 10%, and if
| it were done in the same "3 nm" process like Cortex-X4
| and Skymont, the area would shrink , perhaps by 20% to
| 25% (depending on the fraction of the area occupied by
| SRAM). In any case there is little doubt that the area in
| the same fabrication process of a Zen 5 compact with full
| 512-bit SIMD pipelines would be less than 3.4 square mm
| (= double Cortex-X4), leading to a better performance per
| area and per power consumption than for either Cortex-X4
| or Skymont (this considers only the maximum throughput
| for optimized programs, but for non-optimized programs
| the advantage could be even greater for Zen 5, which has
| a higher IPC on average).
|
| Cores like Arm Cortex-X4/Neoverse V3 (also Intel
| Skymont/Darkmont) are optimal from the POV of performance
| per area and power consumption only for applications that
| are dominated by irregular integer and pointer
| operations, which cannot be accelerated using array
| operations (e.g. for the compilation of software
| projects). Until now, with the exception of the Fujitsu
| custom cores, which are inaccessible for most computer
| users, no Arm-based CPU core has been suitable for
| scientific/technical computing, because none has had
| enough performance per area and per power consumption,
| when performing array operations. For a given socket,
| both the total die area inside the package and the total
| power consumption are limited, so the performance per
| area and per power consumption of a CPU core determines
| the performance per socket that can be achieved.
| scarface_74 wrote:
| Innovate on _what_ though? There was no market for
| performant very low power chips before the iPhone and
| then Android took off.
|
| I am sure if IBM had more of a market than the minuscule
| Mac market for laptop class PPC chips back in 2005, they
| could have poured money into making that work.
|
| Even today, I doubt it would be worth Apple's money to
| design and manufacture its own M class desktop chips just
| for around 25 million Macs + iPads if they weren't
| reusing a lot of the R&D
| tgma wrote:
| In 2010s, Intel pretty much sold the same Haswell design
| for more than half a decade and lipsticked the pig. It is
| not just low power that they missed. They had time to
| improve the performance/watt for server use, add core
| counts, do big-little, improve the iGPU, etc.
|
| They just sat on it, their marketing dept made fancy
| boxes for high end CPUs and their HR department innovated
| DEI strategies.
| JustExAWS wrote:
| Yes I'm sure that Intel fell behind because a for profit
| company was more concerned with hiring minorities than
| hiring the best employees they could find.
|
| It's amazing that the "take responsibility", "pull
| yourself up by your bootstraps crowd" has now become the
| "we can't get ahead because of minorities crowd"
| tgma wrote:
| Huh, it's not clear what you are suggesting. Who's "we"
| and who's not taking responsibility?
|
| The best people were clearly not staying at Intel and
| they have been winning hard at AMD, Tesla, NVIDIA, Apple,
| Qualcomm, and TSMC, in case you have not been paying
| attention. They could not stop winning and getting ahead
| in the past 5-10 years, in fact. So much semiconductor
| innovation happened.
|
| Yes, if you start promoting the wrong people, very
| quickly the best ones leave. No one likes to report to
| their stupid peer who just got promoted or the idiot they
| hire from the outside when there are more qualified
| people they could promote from within.
|
| --
|
| And re marketing boxes, just check out where Intel chose
| to innovate:
|
| https://www.reddit.com/r/intel/comments/15dx55m/which_i9_
| box...
| JustExAWS wrote:
| The problem with Intel weren't the technical people. It
| started with the board laying off people, borrowing money
| to pay dividends to investors, bad strategy, not building
| relationships with customers who didn't want to work with
| them for fabs, etc and then firing the CEO who had a
| strategy that they knew was going to take years fo
| implement
|
| It wasn't because of "DI&E" initiatives and a refusal to
| hire white people
| phire wrote:
| Intel had a leading line of ARM SoCs from 2002-2006. Some
| of the best on the market for PDAs and smartphones. Their
| XScale SoCs were very popular.
|
| But Intel gave up and sold it off, right as smartphones
| were reaching mainstream.
| tgma wrote:
| They sold XScale to Marvell which ironically has a higher
| market cap than Intel.
| bee_rider wrote:
| Their iGPUs are good enough for day-to-day (non gaming)
| computer use and rock-solid in Linux.
| tgma wrote:
| Good enough? Maybe better today, but they have been god awful
| compared to AMD and absolute garbage compared to something
| like M1 iGPU. They are responsible for more than half of the
| pain inflicted on users in Vista days.
|
| Ironically, they have lost the driver advantage in Linux with
| their latest Arc stuff.
|
| I trust they could have done a lot better, a lot earlier, if
| they cared to invest in iGPU. Feels like deliberately
| neglected.
| bronson wrote:
| The same way missing mobile feels so nuts that it's gotta
| be deliberate.
| DannyBee wrote:
| ???
|
| The lunar lake Xe (IE the generation before the current one)
| is not rock solid on linux - i can get it to crash the gpu
| consistently just by loading enough things that use GL. Not
| like 100, like 5.
|
| If i start chrome and signal and something else, it often
| crashes the gpu after a few minutes.
|
| I've tried latest kernel and firmware and mesa and ....
|
| The GPU should not crash, period.
| Sniffnoy wrote:
| Given that the fixed table is a much simpler one (by letting out-
| of-bounds just return 2, rather than adding circuitry to make it
| return 0), I wonder why they didn't just do it that way in the
| first place?
| kens wrote:
| Returning 0 for undefined table entries is the obvious thing to
| do. Setting these entries to 2 is a bit of a conceptual leap,
| even though it would have prevented the FDIV error and it makes
| the PLA simpler. So I can't fault Intel for this.
| Sniffnoy wrote:
| It's not really a conceptual leap if you've ever had to work
| with "don't care" cases before...
| mjevans wrote:
| It's a NULL / 'do not care' issue. 0 isn't a reserved out of
| band value, it's payload data and anything beyond the bounds
| should have been DNC.
|
| It's possible some other result, likely aligned to an easy
| binary multiple would still produce a square block of 2, and
| that allowing the far edges to float to some other value
| could yield a slightly more compact logic array. Back-filling
| the entire side to the clamped upper value doesn't cost that
| much more though, and is known to solve the issue. As pointed
| out elsewhere, that sort of solution would also be faster for
| engineering time, fit within the planned space budget, and
| best of all reduces conative load. It's obviously correct
| when looking at the bug.
| ajross wrote:
| "Make it work first before you make it work fast".
| Fundamentally this is a software problem solved with software
| techniques. And like most software there's some optimization
| left on the table just because no one thought of it in time.
| And you can't patch a CPU of this era.
| lizzas wrote:
| That must have been such a satisfying fix for the engineers
| though!
| jandrese wrote:
| More engineering time resulted in a more efficient solution.
| phire wrote:
| It feels like the kind of optimization that gets missed because
| the task was split between multiple people, and nobody had
| complete knowledge of the problem.
|
| The person generating the table didn't realize filling the out-
| of-bounds with two would make for a simpler PLA. And the person
| squishing the table into the PLA didn't realize the zeros were
| "don't care" and assumed they needed to be preserved.
|
| It's also possible they simply stopped optimizing as soon as
| they felt the PLA was small enough for their needs. If they had
| already done the floorplanning, making the PLA even smaller
| wasn't going to make the chip any smaller, and their
| engineering time would be better spent elsewhere.
| evanmoran wrote:
| The bug is super fun, but I also find the Intel response to be
| fascinating on its own. They apparently didn't replace everyone's
| processor with a non faulty version who wanted it, resulting in a
| ton of bad press.
|
| To contrast, I've been thinking a lot about the Amazon Colorsoft
| launch, which had a yellow band graphics issue on some devices
| (mine included). Amazon waited a bit before acknowledging it
| (maybe a day or two, presumably to get the facts right). Then
| they simply quietly replace all of them. No recall. They just
| send you a new one if you ask for it (mine replacement comes
| Friday, hopefully it will fix it). My takeaway is that it's
| pretty clear that having an incredibly robust return/support
| apparatus has a lot of benefits when launches don't go quite
| right. Certainly more than you'd expect from analysis.
|
| Similarly I haven't seen too many recent reports about the Apple
| AirPod Pros crackle issue that happened a couple years ago (my
| AirPods had to be replaced twice), but Apple also just quietly
| replaced them and the support competence really seemed something
| powerful that isn't always noticed.
|
| Colorsoft: https://www.tomsguide.com/tablets/e-readers/amazon-
| kindle-co...
|
| AirPods Pro: https://support.apple.com/airpods-pro-service-
| program-sound-...
| lizzas wrote:
| That is default Amazon - you can return stuff no hassle for
| almost any reason.
| WalterBright wrote:
| Only up to a point. If one is abusing it, expect getting
| locked out. I buy enough stuff from Amazon that they don't
| mind me returning something once in a while.
| dan-robertson wrote:
| I thought the response from intel was to invest a lot in
| correctness for a while and then deciding that AMD were not
| being punished for their higher defect rate and so, more
| recently, investing in other things to try to compete with AMD
| on other metrics than how buggy the cpu is.
| ryao wrote:
| I read a claim that they had gutted their verification team
| several years ago in response to Zen since they claimed that
| they needed to develop faster and verification was slowing
| them down. Then not that long ago we started hearing about
| the raptor lake issues.
| userbinator wrote:
| This article? https://news.ycombinator.com/item?id=16058920
| donio wrote:
| The Kindle and AirPod cases are not really comparable since
| those are relatively minor products for the respective
| companies.
|
| On the Apple side the iPhone 4 antennagate is a better
| comparison since the equivalent fix there would have involved
| free replacements for a flagship and revenue-critical product
| which Apple did _not_ offer.
|
| Intel on the other hand _did_ eventually offer free
| replacements for anybody who asked and took a major financial
| hit.
| wruza wrote:
| Antennagate didn't affect _everyone_ though, only those 90s
| businessman nokia-in-fist style holders.
|
| Anecdata ofc, but everyone I know already held phones in
| fingers back then, rather than hugging it as a brick.
| donio wrote:
| Maybe but by that argument 99% of the affected Pentium
| users could have happily used their computers until they
| became obsolete. The bug went completely unnoticed for over
| a year with millions of units in use.
|
| The media coverage and the fact that "computer can't
| divide" is something that the public could wrap their heads
| around is what made the recall unavoidable.
|
| Intel's own marketing hype around the Pentium has played
| into it too. It would have been a smaller deal during the
| 486 era.
| scarface_74 wrote:
| There were even (bad) jokes about it newspapers at the
| time.
|
| https://www.latimes.com/archives/la-
| xpm-1994-12-14-ls-8729-s...
|
| > Why didn't Intel call the Pentium the 586? Because they
| added 486 and 100 on the first Pentium and got
| 585.999983605 ."
| scarface_74 wrote:
| And Apple sold the same GSM iPhone 4 without making any
| changes to it for 3 years and the uproar died down.
|
| Before anyone well actually's me, yes they did come out
| with a separate CDMA iPhone 4 for Verizon where they
| changed the antenna design
| mikepurvis wrote:
| I had the first gen white MacBook with the magnetic closure
| that resulted in chipped, discoloured topcases. I had it
| replaced for free like three or four times over the lifespan of
| that computer, including past the three year AppleCare expiry.
|
| I really respected Apple's commitment to standing behind their
| product in that way.
| colechristensen wrote:
| I thought I remembered at least some of those replacements
| were class-action settlements and not Apple's good will.
| flomo wrote:
| For the most part, this wasn't an individual problem.
| Corporations purchased these pretty expensive Pentium computers
| through a distributor, and just got them replaced by the
| vendor, per their support contract.
|
| I've been in some consumer Apple "shadow warranty" situations,
| so I know what you are talking about, but IMO very different
| than the "IT crisis" that intel was facing. "IBM said so" had a
| ton of IT weight back then.
| coin wrote:
| > He called Intel tech support but was brushed off
|
| I laughed when I read this. It's hard enough to get support for
| basic issues, good luck explaining a hardware bug.
| WalterBright wrote:
| > It appears that only one person (Professor Nicely) noticed the
| bug in actual use.
|
| I recall a study done years ago where students were supplied
| calculators for their math class. The calculators had been
| doctored to produce incorrect results. The researchers wanted to
| know how wrong the calculators had to be before the students
| noticed something was amiss.
|
| It was a factor of 2.
|
| Noticing the error, and being affected by the error, are two
| entirely different things.
|
| I.e. how many people check to see if the computer's output is
| correct? I'd say very, very, very few. Not me, either, except in
| one case - when I was doing engineering computations at Boeing,
| I'd run the equations backwards to verify the outputs matched the
| inputs.
| kllrnohj wrote:
| > Noticing the error, and being affected by the error, are two
| entirely different things.
|
| Only somewhat true. Take any consumer usage here for example.
| If you're playing a game and it hits this incorrect output but
| you don't notice anything as a result, were you actually
| affected?
|
| How much usage of FDIV on a Pentium was for numerically
| significant output instead of just multimedia?
| WalterBright wrote:
| If your game has some artifacts in the display, nobody cares.
|
| But if you're doing financial work, scientific work, or
| engineering work, the results matter. An awful lot of people
| used Excel.
|
| BTW, telling a customer that a bug doesn't matter doesn't
| work out very well.
| wat10000 wrote:
| I used to tutor physics in college. My students would show a
| problem they worked and ask for feedback, and I'd tell them
| that they definitely went wrong _somewhere_ since they
| calculated that the rollercoaster was 23,000 miles tall.
|
| Which is to say, it will depend a lot on the context and the
| understanding of the person doing the calculation.
| WalterBright wrote:
| It is institute policy at Caltech (at least when I attended)
| that obviously wrong answers would get you zero credit, even
| if the result came from a minor error. However, if you
| concluded after solving the problem that the answer was
| absurd, but you didn't know where the calculation went wrong,
| you'd get partial credit.
| WalterBright wrote:
| I remember that bug. Because I could not control what CPU my
| customers were running on, I had to add special code in the
| library to detect the bad FPU and execute workaround code (this
| code was supplied by Intel).
|
| I.e. Intel's problem became my problem, grrrr
| stickfigure wrote:
| Reminds me of a joke floating around at the time that captures a
| couple different 90s themes: I AM PENTIUM OF
| BORG. DIVISION IS FUTILE. YOU WILL BE
| APPROXIMATED.
| fortran77 wrote:
| "At Intel, Quality is job 0.9999999999999999762"
| keshavmr wrote:
| At the 2012 Turning Award conference in San Francisco, Prof
| William Kahan mentioned that he had a newer test suite available
| in 1993 that would have caught Intel's bug. Still, Intel did not
| run that.. Prof. Kahan was actively involved in its analysis and
| further testing. (I'm stating this just from memory).
| hyperman1 wrote:
| How did idiv work on the pentium. Was it also optimized, or
| somehow connected to fdiv, or just the old slow algorithm?
| pieterr wrote:
| Reminds me of part 2 of day24. Some wrong wirings. ;-)
|
| https://adventofcode.com/2024/day/24
| Unearned5161 wrote:
| From someone who had to mentally let go once you started talking
| about planes crossing each other, thank you for such an amazingly
| detailed writeup. It's not everyday that you learn a new cool way
| to divide numbers!
| ijustlovemath wrote:
| > Curiously, the adder is an 8-bit adder but only 7 bits are
| used; perhaps the 8-bit adder was a standard logic block at
| Intel.
|
| I believe this is because for any adder you always want 1 bit
| extra to detect overflow! This is why 9 bit adders are a common
| component in MCUs
| kens wrote:
| The weird thing is that I traced out the circuitry and the
| bottom bit of the adder is discarded, not the top bit where
| overflow would happen. (Note that you won't get overflow for
| this addition because the partial remainder is in range, just
| split into the sum and carry parts.)
| chiph wrote:
| I'm surprised they took the risk of extending the lookup table to
| have all 2's in the undefined region. A safer route would have
| been to just fix the 5 entries. Someone was pretty confident!
| justsid wrote:
| It actually seems like it becomes much easier to reason about
| because you remove a ton of (literal in the diagram) edge
| cases.
| CaliforniaKarl wrote:
| > The explanation is that Intel didn't just fill in the five
| missing table entries with the correct value of 2. Instead, Intel
| filled all the unused table entries with 2.
|
| I wonder why they didn't do this in the first place.
| Panzer04 wrote:
| Implementation detail. Somone overspecified it and didn't
| realise that it didn't matter.
|
| Look at it again later, someone asks why not just fill
| everything in instead and everyone feels a bit silly XD.
___________________________________________________________________
(page generated 2024-12-29 23:00 UTC)