[HN Gopher] Chip aging becomes design problem (2018)
       ___________________________________________________________________
        
       Chip aging becomes design problem (2018)
        
       Author : CTOSian
       Score  : 34 points
       Date   : 2022-01-11 11:01 UTC (1 days ago)
        
 (HTM) web link (semiengineering.com)
 (TXT) w3m dump (semiengineering.com)
        
       | ohazi wrote:
       | How long should I expect to be able to use a new 5nm CPU (at
       | reasonable temperatures) before these issues are likely to make
       | it fail?
       | 
       | All of the desktop/laptop CPUs that I currently use are 14nm, and
       | I think the oldest is around 7 years old and still working fine.
       | In the past I've tended to use personal machines for around a
       | decade, and I don't really have any desire to move to a shorter
       | cycle. Better battery life is great, but most things are already
       | plenty thin and fast.
        
       | Robotbeat wrote:
       | 2-3 years for a consumer device and 10 for a telecommunications
       | device s em both massively too short by a factor of 2 or more.
       | 
       | Do we really want our devices to stop working physically, at the
       | chip level, in less than a decade? Some people still play on game
       | consoles decades old; these are most certainly consumer devices.
       | 
       | I'd much rather we over-design stuff to last decades, at least at
       | the chip level where overedesigning is super cheap.
       | 
       | Solid state electronics was supposed to mean longer life for
       | everything. No tubes to burn out, no mechanical parts to wear
       | out. It would suck so bad if we nickel and dimed what should be
       | fundamentally physically robust devices to last for much less
       | time than complicated mechanical devices from the past.
        
         | zozbot234 wrote:
         | Many of these effects are highly temperature- and voltage-
         | dependent, so they're way more likely to show up in chips that
         | are overvolted, overclocked and inadequately cooled. But we're
         | yet to see overclocking enthusiasts run into these reliability
         | concerns, so my best guess is that the safety factors embedded
         | in current designs are enough to make them less of an issue.
        
         | nashashmi wrote:
         | They don't stop working after 2_3 years. They just don't work
         | as fast. The electronics industry wants to promote change of
         | products. so they do this. Or they don't work on long term
         | durability.
         | 
         | My take: for as long as recycling performance is terrible, this
         | should be a no no. Social movements should start demanding
         | better products.
        
         | nickff wrote:
         | > _" Do we really want our devices to stop working physically,
         | at the chip level, in less than a decade? Some people still
         | play on game consoles decades old; these are most certainly
         | consumer devices."_
         | 
         | Many components in your consumer devices age; my greatest
         | concern is electrolytic capacitors (specifically the tantalum
         | ones). I think many game consoles and computers with switched-
         | mode power supplies are unlikely to last decades. If you want a
         | simple way to get your devices to last longer, I suggest that
         | you pick ones with external (brick) power supplies, as the SMPS
         | is likely to be one major cause of failures.
         | 
         | > _" I'd much rather we over-design stuff to last decades, at
         | least at the chip level where overedesigning is super cheap."_
         | 
         | I am not sure that people are willing to pay significantly more
         | for these longer-lasting devices, and all that extra effort
         | will be wasted if the devices are scrapped prematurely. It may
         | be more (environmentally) efficient to simply replace the
         | devices upon failure rather than over-designing them.
        
       | Lind5 wrote:
       | also relevant & more recent
       | https://semiengineering.com/reliability-concerns-shift-left-...
        
       | trasz wrote:
       | So, how serious it is for current and upcoming hardware,
       | realistically?
        
         | monocasa wrote:
         | A real problem for smaller nodes, 7nm and under starts really
         | cutting into expected lifetime.
        
           | Animats wrote:
           | This is a big problem for automotive. The average age of cars
           | in the US is over 12 years now. Which is a technical
           | achievement. It was around 6 in the 1970s. Longer for trucks.
           | 
           | Automotive electronics design lives used to be much longer. I
           | happen to know that the design life for the Ford EEC IV
           | engine control unit from the 1980s was 30 years. That's been
           | exceeded in the field; many 1980s Ford trucks with that unit
           | are still running.
           | 
           | There are now cars running electronics that's way overkill in
           | performance, probably at the cost of lifetime. Unreal Engine
           | in the dashboard is a thing.
           | 
           | Realistically, you need 20 years of life in automotive
           | electronics.
        
             | versteegen wrote:
             | Honestly, I wouldn't buy a car if I knew its electronics
             | were only designed to last 20 years. I wouldn't expect it
             | to be reliable. Mind you, I drive a 1994 Toyota Corolla,
             | which has needed about $50 of repairs in the last decade,
             | so I may have high standards.
        
       | tester756 wrote:
       | Reminds me of
       | 
       | [Cores that don't count]https://research.google/pubs/pub50337/
       | 
       | >We are accustomed to thinking of computers as fail-stop,
       | especially the cores that execute instructions, and most system
       | software implicitly relies on that assumption. During most of the
       | VLSI era, processors that passed manufacturing tests and were
       | operated within specifications have insulated us from this
       | fiction. As fabrication pushes towards smaller feature sizes and
       | more elaborate computational structures, and as increasingly
       | specialized instruction-silicon pairings are introduced to
       | improve performance, we have observed ephemeral computational
       | errors that were not detected during manufacturing tests. These
       | defects cannot always be mitigated by techniques such as
       | microcode updates, and may be correlated to specific components
       | within the processor, allowing small code changes to effect large
       | shifts in reliability. Worse, these failures are often "silent'':
       | the only symptom is an erroneous computation.
       | 
       | >We refer to a core that develops such behavior as "mercurial.''
       | Mercurial cores are extremely rare, but in a large fleet of
       | servers we can observe the correlated disruption they cause,
       | often enough to see them as a distinct problem -- one that will
       | require collaboration between hardware designers, processor
       | vendors, and systems software architects.
       | 
       | >We have observed various kinds of symptoms caused by mercurial
       | cores.
       | 
       | > Violations of lock semantics leading to application data
       | corruption and crashes.
       | 
       | > Data corruptions exhibited by various load, store, vector, >
       | and coherence operations.
       | 
       | > A deterministic AES mis-computation, which was "self
       | inverting": encrypting and decrypting on the same core yielded
       | the identity function, but decryption elsewhere yielded
       | gibberish.
       | 
       | > Corruption affecting garbage collection, in a storage system,
       | causing live data to be lost.
       | 
       | > Database index corruption leading to some queries, depending on
       | which replica (core) serves them, being non deterministically
       | corrupted.
       | 
       | > Repeated bit-flips in strings, at a particular bit position
       | (which stuck out as unlikely to be coding bugs).
       | 
       | > Corruption of kernel state resulting in process and kernel
       | crashes and application malfunctions
       | 
       | ___________
       | 
       | >Not all mercurial-core screening can be done before CPUs are put
       | into service - first, because some cores only become defective
       | after considerable time has passed,
        
       | ccbccccbbcccbb wrote:
       | Planned obsolescence.
       | 
       | This may well end up with CPU vendors adopting food storage cant.
       | 
       | "Ignel i11-23017K, MFD: 2025.05.01, EXP: 2026.09.03, consume
       | within 3 months of first power-on".
        
       ___________________________________________________________________
       (page generated 2022-01-12 23:00 UTC)