[HN Gopher] Chip aging becomes design problem (2018)
___________________________________________________________________
Chip aging becomes design problem (2018)
Author : CTOSian
Score : 34 points
Date : 2022-01-11 11:01 UTC (1 days ago)
(HTM) web link (semiengineering.com)
(TXT) w3m dump (semiengineering.com)
| ohazi wrote:
| How long should I expect to be able to use a new 5nm CPU (at
| reasonable temperatures) before these issues are likely to make
| it fail?
|
| All of the desktop/laptop CPUs that I currently use are 14nm, and
| I think the oldest is around 7 years old and still working fine.
| In the past I've tended to use personal machines for around a
| decade, and I don't really have any desire to move to a shorter
| cycle. Better battery life is great, but most things are already
| plenty thin and fast.
| Robotbeat wrote:
| 2-3 years for a consumer device and 10 for a telecommunications
| device s em both massively too short by a factor of 2 or more.
|
| Do we really want our devices to stop working physically, at the
| chip level, in less than a decade? Some people still play on game
| consoles decades old; these are most certainly consumer devices.
|
| I'd much rather we over-design stuff to last decades, at least at
| the chip level where overedesigning is super cheap.
|
| Solid state electronics was supposed to mean longer life for
| everything. No tubes to burn out, no mechanical parts to wear
| out. It would suck so bad if we nickel and dimed what should be
| fundamentally physically robust devices to last for much less
| time than complicated mechanical devices from the past.
| zozbot234 wrote:
| Many of these effects are highly temperature- and voltage-
| dependent, so they're way more likely to show up in chips that
| are overvolted, overclocked and inadequately cooled. But we're
| yet to see overclocking enthusiasts run into these reliability
| concerns, so my best guess is that the safety factors embedded
| in current designs are enough to make them less of an issue.
| nashashmi wrote:
| They don't stop working after 2_3 years. They just don't work
| as fast. The electronics industry wants to promote change of
| products. so they do this. Or they don't work on long term
| durability.
|
| My take: for as long as recycling performance is terrible, this
| should be a no no. Social movements should start demanding
| better products.
| nickff wrote:
| > _" Do we really want our devices to stop working physically,
| at the chip level, in less than a decade? Some people still
| play on game consoles decades old; these are most certainly
| consumer devices."_
|
| Many components in your consumer devices age; my greatest
| concern is electrolytic capacitors (specifically the tantalum
| ones). I think many game consoles and computers with switched-
| mode power supplies are unlikely to last decades. If you want a
| simple way to get your devices to last longer, I suggest that
| you pick ones with external (brick) power supplies, as the SMPS
| is likely to be one major cause of failures.
|
| > _" I'd much rather we over-design stuff to last decades, at
| least at the chip level where overedesigning is super cheap."_
|
| I am not sure that people are willing to pay significantly more
| for these longer-lasting devices, and all that extra effort
| will be wasted if the devices are scrapped prematurely. It may
| be more (environmentally) efficient to simply replace the
| devices upon failure rather than over-designing them.
| Lind5 wrote:
| also relevant & more recent
| https://semiengineering.com/reliability-concerns-shift-left-...
| trasz wrote:
| So, how serious it is for current and upcoming hardware,
| realistically?
| monocasa wrote:
| A real problem for smaller nodes, 7nm and under starts really
| cutting into expected lifetime.
| Animats wrote:
| This is a big problem for automotive. The average age of cars
| in the US is over 12 years now. Which is a technical
| achievement. It was around 6 in the 1970s. Longer for trucks.
|
| Automotive electronics design lives used to be much longer. I
| happen to know that the design life for the Ford EEC IV
| engine control unit from the 1980s was 30 years. That's been
| exceeded in the field; many 1980s Ford trucks with that unit
| are still running.
|
| There are now cars running electronics that's way overkill in
| performance, probably at the cost of lifetime. Unreal Engine
| in the dashboard is a thing.
|
| Realistically, you need 20 years of life in automotive
| electronics.
| versteegen wrote:
| Honestly, I wouldn't buy a car if I knew its electronics
| were only designed to last 20 years. I wouldn't expect it
| to be reliable. Mind you, I drive a 1994 Toyota Corolla,
| which has needed about $50 of repairs in the last decade,
| so I may have high standards.
| tester756 wrote:
| Reminds me of
|
| [Cores that don't count]https://research.google/pubs/pub50337/
|
| >We are accustomed to thinking of computers as fail-stop,
| especially the cores that execute instructions, and most system
| software implicitly relies on that assumption. During most of the
| VLSI era, processors that passed manufacturing tests and were
| operated within specifications have insulated us from this
| fiction. As fabrication pushes towards smaller feature sizes and
| more elaborate computational structures, and as increasingly
| specialized instruction-silicon pairings are introduced to
| improve performance, we have observed ephemeral computational
| errors that were not detected during manufacturing tests. These
| defects cannot always be mitigated by techniques such as
| microcode updates, and may be correlated to specific components
| within the processor, allowing small code changes to effect large
| shifts in reliability. Worse, these failures are often "silent'':
| the only symptom is an erroneous computation.
|
| >We refer to a core that develops such behavior as "mercurial.''
| Mercurial cores are extremely rare, but in a large fleet of
| servers we can observe the correlated disruption they cause,
| often enough to see them as a distinct problem -- one that will
| require collaboration between hardware designers, processor
| vendors, and systems software architects.
|
| >We have observed various kinds of symptoms caused by mercurial
| cores.
|
| > Violations of lock semantics leading to application data
| corruption and crashes.
|
| > Data corruptions exhibited by various load, store, vector, >
| and coherence operations.
|
| > A deterministic AES mis-computation, which was "self
| inverting": encrypting and decrypting on the same core yielded
| the identity function, but decryption elsewhere yielded
| gibberish.
|
| > Corruption affecting garbage collection, in a storage system,
| causing live data to be lost.
|
| > Database index corruption leading to some queries, depending on
| which replica (core) serves them, being non deterministically
| corrupted.
|
| > Repeated bit-flips in strings, at a particular bit position
| (which stuck out as unlikely to be coding bugs).
|
| > Corruption of kernel state resulting in process and kernel
| crashes and application malfunctions
|
| ___________
|
| >Not all mercurial-core screening can be done before CPUs are put
| into service - first, because some cores only become defective
| after considerable time has passed,
| ccbccccbbcccbb wrote:
| Planned obsolescence.
|
| This may well end up with CPU vendors adopting food storage cant.
|
| "Ignel i11-23017K, MFD: 2025.05.01, EXP: 2026.09.03, consume
| within 3 months of first power-on".
___________________________________________________________________
(page generated 2022-01-12 23:00 UTC)