[HN Gopher] SSD will fail at 40k power-on hours (2021)
       ___________________________________________________________________
        
       SSD will fail at 40k power-on hours (2021)
        
       Author : dredmorbius
       Score  : 143 points
       Date   : 2022-07-10 19:36 UTC (3 hours ago)
        
 (HTM) web link (www.cisco.com)
 (TXT) w3m dump (www.cisco.com)
        
       | lucb1e wrote:
       | Check your power-on hours:                   $ sudo smartctl -a
       | /dev/sda | grep -e Power_On_Hours -e ^ID         ID#
       | ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
       | UPDATED  WHEN_FAILED RAW_VALUE           9 Power_On_Hours
       | 0x0032   098   098   000    Old_age   Always       -       9743
       | 
       | Just looking at the raw value, it seems to be 9'743 hours in my
       | case
        
         | borplk wrote:
         | Mine is above 53,000 hours ... time to check my backups!
        
           | lucb1e wrote:
           | Sounds like you're in the clear for this particular bug...
           | 
           | ...but always check your backups regularly for data that is
           | dear to you!
           | 
           | Protip of the day: that includes things on someone else's
           | server. I remember when Grooveshark went offline from one day
           | to the next and I lost nearly my whole library because I
           | remembered only some artists and had to go through thousands
           | of songs to find which ones I actually liked from them. My
           | browser's localStorage object contained a few playlists but I
           | didn't use those much. Or when 000webhost cancelled my
           | account because I was using the 100MB(?) to back up some
           | files that were most important to me, rather than for actual
           | webhosting (in my defense, I was 15 at the time), and so when
           | I returned from a holiday with my parents with an actual
           | crashed hard drive, that turned double sour. Backing up
           | things from what they now call the "cloud" is something I
           | learned early, as I have virtually no code I wrote before
           | that summer, only some of the music, only essays with WordArt
           | if they were printed, etc.
        
       | dredmorbius wrote:
       | Possibly related to recent HN issues, see:
       | https://news.ycombinator.com/item?id=32031243
        
         | solardev wrote:
         | Wow, thanks for sharing. I didn't realize how closely related
         | they were.
         | 
         | (TLDR For anyone wondering, "recent HN issues" means HN very
         | likely went down yesterday because of this same bug, when two
         | (edit: two pairs, four total) enterprise SSDs with old firmware
         | died after 40,000 hours close together. An admin of HN and its
         | host both like this theory. See details in that thread.)
         | 
         | Edit: If you want to discuss that theory, it's probably better
         | to do it in that other thread directly instead... dang and a
         | person from M5 Hosting (HN's previous host) are both
         | participating there.
        
           | mkl wrote:
           | Not two SSDs, _four_ : two in the main server, and two in the
           | backup server.
        
             | solardev wrote:
             | Thanks for the correction!
        
           | [deleted]
        
           | kazen44 wrote:
           | the chance of two SSD's failing at the same time under normal
           | circumstances is extremely slim. So this might actually be a
           | good cause of this incident.
        
             | MBCook wrote:
             | Especially since one pair was a nearly unused backup server
             | that had a totally different use profile.
        
       | DoneWithAllThat wrote:
       | What the hell is that ridiculous "bias-free language" claptrap at
       | the beginning? Man DEI is seriously out of control.
        
         | UkrainianJew wrote:
         | It's a bug in the human firmware. People seem to have an
         | instinctive and subconscious need for property - out of all
         | things occupying our attention span, being able to arbitrarily
         | change some on a whim.
         | 
         | I think, this instinct is responsible for humans figuring out
         | farming (as in developing the land near you to your liking) and
         | many cultural achievements.
         | 
         | Except, with the information society, our attention is being
         | constantly overwhelmed by the stream of information produced by
         | other people, so this instinct kicks in and makes some people
         | want to control what language others use. I don't think we will
         | see any studies of this soon, but my hunch is that there is
         | reverse correlation between the amount of one's physical
         | property and one's sensitivity to the language and content of
         | others' speech.
         | 
         | Corporations happily abused it, since letting your employees
         | "own" pronouns and acknowledgements is cheaper than paying them
         | enough to own their houses (let alone start competing
         | companies). Now it has spun into a de-facto religion where many
         | people's weight in the society depends on perpetuating (and
         | intensifying) the dogmas. Kinda similar to late USSR where most
         | people didn't believe in communism anymore, but not having a
         | Lenin's room in your office would get you labeled as an
         | American spy.
         | 
         | From what we can learn from the history, it will intensify
         | until the movement splinters into competing factions, that will
         | heavily oppose each other, and will eventually settle on some
         | common ground to avoid continuous mutual damage.
        
         | ParetoOptimal wrote:
         | It's a device to identify certain kinds of people who would
         | have a problem with less loaded language without any loss in
         | clarity.
        
           | rajamaka wrote:
           | I would love to see some examples of Cisco documentation that
           | ever offended anyone.
        
             | mlyle wrote:
             | I miss old Cisco documentation, with IP addresses and
             | router names like SanJose3 and 408 phone numbers on PRIs
             | etc.
        
             | bombcar wrote:
             | It's a warning that the documentation may refer to
             | master/slave or something like that because Cisco cares
             | enough about DEI to update documentation but not enough to
             | actually update out-of-support firmware.
        
             | 13of40 wrote:
             | OK, here's one they need to update:
             | 
             | https://blogs.cisco.com/news/digital-transformation-
             | requires...
             | 
             | The offending text:
             | 
             | Act now by adding "equality, inclusion, and diversity" as
             | an agenda item for your next staff meeting, brownbag, or
             | employee gathering.
             | 
             | What, why?
             | 
             | https://www.upi.com/Odd_News/2013/08/02/In-Seattle-the-
             | terms...
             | 
             | "Brownbag" is offensive in one context, and that means it's
             | offensive in all contexts.
        
           | mancerayder wrote:
           | Other than DEI administrators, trainers and people in
           | positions with DEI in them, who is actually getting offended?
        
             | powerhour wrote:
             | People that have to move their mouse a bit to hit the x
             | button, apparently.
        
         | deigestapo wrote:
         | It's so they can track it, compute metrics, report, etc.
         | 
         | This gets fed into indicators that can be used to boost the ESG
         | (communism) score of publicly-traded companies.
        
         | GuB-42 wrote:
         | Goes well with the legal disclaimer that follows.
         | 
         | The legal or whatever-not-technical department wanted to leave
         | their mark.
        
         | TaylorAlexander wrote:
         | Having a statement like that shows people that they are open to
         | suggestions on improvements. Since a lot of people are not so
         | open to suggestions, it makes sense to me to include this
         | language. They added a little X button so you can close it
         | easily.
        
         | [deleted]
        
         | hn_throwaway_99 wrote:
         | My reaction was "If you want to write some documentation with
         | bias-free language, just write the documentation with bias-free
         | language." Why the need for a long paragraph explaining "Look
         | how great and sensitive we are!"
         | 
         | I understand, and agree with, the desire to use inclusive
         | language, but so much of this has just devolved into
         | performative nonsense.
        
           | mlyle wrote:
           | Else you get questions, like, "why don't you say master/slave
           | like everyone else?!@!!"
        
             | kwhitefoot wrote:
             | At this stage I think such questions can just be ignored.
        
               | MarcoZavala wrote:
        
           | alpb wrote:
           | Saying that and usernames like "DoneWithAllThat" and
           | "hn_throwaway", yeah, it checks out.
        
             | hn_throwaway_99 wrote:
             | Not sure exactly what point you're trying to make, but if
             | it's "the risk of saying anything even _remotely_ critical
             | of DEI tactics is a huge, gargantuan, giant career risk
             | these days ", then I wholeheartedly agree.
        
         | 0xbadcafebee wrote:
         | The docs may include "master/slave", and they don't want to get
         | sued or bad PR, so this generic notice says "we don't like bad
         | words but sometimes the industry uses bad words and that's
         | unfortunate". If you click the _Learn More_ link in the
         | paragraph, you 'll learn more.
        
           | redeeman wrote:
        
             | deigestapo wrote:
        
             | zorpner wrote:
             | There is -- it's using words other than those, which is
             | both easy and considerate.
        
       | civilized wrote:
       | It's been over two years since this was first identified... since
       | this apparently affected many makes and models of SSDs, it would
       | be nice to know if my laptop could be affected and if there's
       | anything I could do about it.
        
         | pmoriarty wrote:
         | One thing everyone could and should be doing is backups.
        
           | m0llusk wrote:
           | Two things: Test restores or you don't actually have backups.
           | Just saying.
        
             | chrischen wrote:
             | I got bit by this with iPhone backups. I did a phone trade
             | in and followed the backup before trading in instructions.
             | Problem is after the trade in the backup failed to restore
             | due to an unknown error. The whole manual syncing and
             | backing up with a cable workflow with Apple is super fickle
             | and riddled with bugs.
             | 
             | Luckily I had Time Machine backups of my iOS backups and I
             | managed to avoid losing too much data.
             | 
             | As a sidenote it seems like Apple has pretty much neglected
             | their offline backup and syncing workflow to drive more
             | people to just pay for iCloud storage. Half the time my
             | iPhone takes hours just to get detected by the mac when
             | _plugged in._
        
         | opencl wrote:
         | This will not affect your laptop, all of the models affected by
         | this are enterprise SAS SSDs.
         | 
         | Of course your SSD might have some _other_ firmware bug that
         | would eat your data, all you can do is search for the model
         | number and see if the manufacturer has issued any notices
         | /firmware updates.
        
           | robocat wrote:
           | > This will not affect your laptop
           | 
           | That's just your presumptive opinion, right?
        
             | Sakos wrote:
             | How likely is it that they're using an enterprise SAS SSD
             | in their laptop?
        
       | yomkippur wrote:
       | crap so its certainly HP laptops. so which laptops are safe from
       | this?
        
         | mrkramer wrote:
         | My HP laptop has Toshiba SSD. I'm not sure about other models.
         | But I think only enterprise SSDs are affected.
        
       | mistrial9 wrote:
       | related topic - leaving SMART control tests ON for a (non-SSD)
       | drive, apparently interferes with sleep; the drive will wake up
       | to test itself. For some drives, I would prefer that not to
       | happen and just stay quiet. Yet, testing for this behavior seems
       | elusive -- querying the disk wakes it, and most linux disk tools
       | seem unaware of sleep state. I just listen for the disk spinning,
       | or notice a long pause before an operation.
        
       | onion2k wrote:
       | Backblaze have a great blog about things they learn about hard
       | drives. It's been going for years, less about firmware issues and
       | more about general usage.
       | https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-...
        
       | usr1106 wrote:
       | Cisco is not a SSD manufacturer. They write industry-wide bug.
       | Does that mean that more than one SSD manufacturer is affected
       | (because they use partially the same firmware)? Further down they
       | mention only Sandisk. Or is the industry-wide just their newspeak
       | for saying any Sandisk of affected model, regardless whether
       | installed in a Cisco box or somewhere else?
        
         | dr_zoidberg wrote:
         | I'm interested here too. I've got a Crucial SSD from 2015
         | that's been on about:
         | 
         | * 100% of 2015-2017, let's add 2 years here
         | 
         | * Aboutish 50% of days since 2018 to 2020
         | 
         | * On and off again (5%?) since then until now.
         | 
         | So it's about 3 years of full use? I'm eyeballing the use here.
         | So it may be close to the numbers that were given, but I'm not
         | sure. Guess I could check the SMART stats to get a precise
         | number and from there decide what to do about it.
         | 
         | Searching a bit it seems it's a well-known bug in "enterprise
         | SSDs"[0, 1] (which my drive certainly isn't) but there aren't
         | any real details about what causes it, other than "a firmaware
         | bug".
         | 
         | [0] https://www.servethehome.com/hpe-issues-hpd7-fix-for-ssds-
         | th...
         | 
         | [1] https://www.anandtech.com/show/15673/dell-hpe-updates-
         | for-40...
        
         | dredmorbius wrote:
         | The problem seems to be widely experienced.
         | 
         | The Cisco report turned up in response to a post I'd made of
         | the HN issue on the Fediverse:
         | 
         | https://mastodon.infra.de/@galaxis/108622795822100862
        
       | userbinator wrote:
       | 40000 (or even 40960) seems an odd number to fail at. 64k or 32k
       | would make the cause pretty obvious, but 40000 doesn't seem all
       | that round in binary. Perhaps a 12-bit counter incrementing every
       | 10h? This is puzzling.
       | 
       | Of course, I am also entertaining the possibility that no one
       | thought they would be in use for this long, which would certainly
       | be evidence of planned obsolescence.
        
         | twawaaay wrote:
         | Very strange understanding of the word "evidence".
         | 
         | No sane SSD manufacturer would do such thing on purpose. You do
         | it and you loose business, that's it.
         | 
         | The simplest explanation is that somebody made an honest
         | engineering mistake.
        
           | bayindirh wrote:
           | When you purchase a server (fleet), you get a long warranty
           | with it. Generally 3 to 5 years. So you expect this fleet to
           | stay in service for <=5 years mostly.
           | 
           | Unless you burn through your SSDs, you're very unlikely to
           | hit this event.
           | 
           | When these servers' continue to be used and disks all start
           | to fail at the same time, this will obviously stink.
           | 
           | The bathtub curve is not like this. You can _feel_ that.
        
           | fartcannon wrote:
           | Given the power dynamic between a single customer and large
           | corporations, the smart thing to do is to assume malice until
           | prove otherwise. This puts the onus on the corporations and,
           | if we're lucky, creates an environment where they compete
           | with each other to be seen as the most honest. The worst
           | thing that happens is the single customer has to buy an SSD
           | from someone they don't trust.
           | 
           | If we do the opposite, as you say, and assume everything is
           | an honest mistake, that puts pressure on the single customer
           | to prove that the organization with a huge marketing budget
           | is doing something wrong. In this situation, the worst thing
           | that happens is we all get taken advantage of.
           | 
           | Our collective distrust is the only power we have against
           | massive marketing/PR budgets. It doesn't have to be angry, or
           | sour, or cranky, we just collectively need to not take their
           | word until we have a reason to do so.
        
             | charcircuit wrote:
             | Are you seriously saying that by default we should believe
             | they intentionally planned to cause their customers to lose
             | all of their data?
        
               | [deleted]
        
               | alliao wrote:
               | planned obsolescence is quite a thing...?
        
               | dtjb wrote:
               | In some cases, but a product must fulfill its core
               | purpose. If a SSD intentionally dumped data and self
               | destructed at a set time, that would be disastrous for
               | the brand. Same way a car doesn't adopt planned
               | obsolescence by blowing up after 200k miles.
        
               | bayindirh wrote:
               | If a spinning rust can run for ~8 years without any
               | problems,a consumer SSD can hit beyond 40K hours
               | reliably, and everything is checked and tested tens of
               | times because of the complexity of flash storage, I'd get
               | suspicious too.
               | 
               | Also, enterprise drives get firmware updates (regardless
               | of spinning or not), and this firmware is automatically
               | applied via RAID controller, so it could be remedied
               | easily before it got this big if it's an actual error.
        
         | justinsaccount wrote:
         | Someone pointed out on the other thread that it could be 2^57
         | nanoseconds:                 >>> 2**57/10**9/3600
         | 40031.996687737745
        
           | AaronFriel wrote:
           | If it were 53, I'd wonder "are they storing the time in the
           | integer part of a double precision float?" That wouldn't go
           | negative, it'd just start absorbing increments without
           | changing the value.
           | 
           | Though that might cause a divide by zero?
           | 
           | What could cause unexpected behavior at 57 bits?
           | 
           | Perhaps storing fractions of an hour, like incrementing it
           | every 1/16th of an hour and calculating a relative rate of
           | change, causing a divide by zero?
        
             | mkl wrote:
             | Do embedded CPUs like the one in an SSD have floating point
             | units? It seems more likely to me that the upper bits in a
             | 64 bit integer counter were used for something else.
        
             | danielheath wrote:
             | Packing a type flag into the upper bits of a 64 bit value
             | is a reasonably common optimisation in dynamic language
             | implementations (because it lets you use unboxed number
             | arithmetic).
        
             | jonas21 wrote:
             | My overactive imagination thinks it went something like
             | this:
             | 
             | Engineer A: Gee, I need to store a few flags with each
             | block, but there's nowhere to put them. Ah! We're storing
             | timestamps as 64-bit _microseconds_. I can use a few of
             | those bits and there 'll still be enough to go thousands of
             | years without overflowing.
             | 
             | Engineer B: Gee, our SSDs are getting so fast, soon we'll
             | be able to hit 1M writes/sec. But we're storing timestamps
             | as microseconds. How can we generate unique timestamps for
             | each write? Ah! I'll switch to nanoseconds. It's a good
             | thing we have plenty of space in this 64-bit int.
             | 
             | BOOM!
        
       ___________________________________________________________________
       (page generated 2022-07-10 23:00 UTC)