[HN Gopher] Predicting Hard Drive Failure with Machine Learning
       ___________________________________________________________________
        
       Predicting Hard Drive Failure with Machine Learning
        
       Author : binwiederhier
       Score  : 187 points
       Date   : 2021-01-21 14:28 UTC (8 hours ago)
        
 (HTM) web link (datto.engineering)
 (TXT) w3m dump (datto.engineering)
        
       | MaxBarraclough wrote:
       | These guys offer something similar as a service:
       | https://www.senseye.io/how-it-works
        
       | gerdesj wrote:
       | "Only one drive manufacturer had enough failed drives in the
       | dataset to produce a large enough subset, which we'll refer to as
       | Manufacturer A. Because this data treatment was so effective, and
       | because most of the drives that had been in our fleet long enough
       | to begin failing were already from Manufacturer A, we decided to
       | reduce the scope of smarterCTL to only their drives."
       | 
       | I'd have dropped Manufacturer A from the supplier list and used
       | the A only model for the remainder of their drives. Then I'd have
       | another go with SmarterCTL for "not A".
        
         | dehrmann wrote:
         | There are really only 2.5 manufacturers, though.
        
         | rasz wrote:
         | Seagate, just call it what it is. You would not drop Seagate
         | because their drives are >20% cheaper than competition.
        
       | jaas wrote:
       | It's great that they took the time to write this up even though
       | the conclusion is that it doesn't work.
       | 
       | Too much valuable information is lost because people don't value
       | negative results enough.
        
       | Abishek_Muthian wrote:
       | Neat ML exercise, though it wasn't the right solution for the
       | problem.
       | 
       | What is the right solution at consumer level though? I currently
       | run smartmontools at regular intervals, compare the TBW with
       | manufacturer's TBW and send a MQTT message using a script.
        
       | PaulHoule wrote:
       | My understanding is that SMART is not effective because it only
       | catches (some) mechanical problems and that about half of the
       | time it is the logic board that went bad.
       | 
       | You can filter out some of the bad drives ahead of time, but you
       | will still get blindsided much of the time.
        
         | 8fingerlouie wrote:
         | Assuming you don't have an early fail drive, it has one usable
         | stat, "Power On Hours". When that begins to creep towards
         | 40,000 you will have a failed drive sooner than later.
         | 
         | As for the rest of the counter, raw read error rate, relocated
         | sector count and friends are also good indications that your
         | drive might not be in the best shape. They don't say anything
         | about when it will fail though. I've had a drive rack up 20
         | relocated sectors in a couple of weeks, only to stay at 20 for
         | 4 years after that.
         | 
         | And load cycle count is only good for telling if you're within
         | manufacturer parameters. I frequently see old drives with Load
         | Cycle Count 10-25 times the rated value, and they work just
         | "fine".
        
       | bryanculver wrote:
       | The article is fascinating but one thing I am finding very common
       | on these engineering blogs is no easy path back to the company or
       | service. Understandably the logo at the top goes back to the blog
       | homepage and the copyright at the bottom is just static text.
       | 
       | I could search or visit one of the social pages linked to in the
       | footer but it seem like an obvious link one would want to have.
        
         | binwiederhier wrote:
         | Thanks for bringing this up. We'll add a link to the main Datto
         | website shortly. :-)
        
         | whoisburbansky wrote:
         | Hitting the "Join The Team" button up top takes you back to the
         | main site, which seems reasonable since in most cases a
         | technical blog is meant to be a recruitment funnel more than a
         | general public-facing PR document, no?
        
           | bryanculver wrote:
           | I see that now. I wouldn't disagree, but if you're a CTO for
           | example you might have interest in the services more than
           | joining the team.
        
       | jeffbee wrote:
       | This is consistent with prior work showing that SMART has no
       | predictive value.
       | 
       | Failure Trends in a Large Disk Drive Population, FAST '07
       | 
       | https://static.googleusercontent.com/media/research.google.c...
        
         | WrtCdEvrydy wrote:
         | I might disagree with this, I distinctly remember hearing that
         | some SMART stats only update when you perform a SMART extended
         | test.
         | 
         | If the stats that matter only update after a test, and noone
         | runs those tests, then is SMART the failure or is the lack of
         | caring the failure.
        
         | rasz wrote:
         | Do we need actual prediction? Ill settle for Just in time
         | warning. This is what SMART parameter C4 gives you. You want to
         | stop using the drive at the first sign of trouble, thats what
         | C4 is good for.
         | 
         | >Only one drive manufacturer had enough failed drives in the
         | dataset to produce a large enough subset, which we'll refer to
         | as Manufacturer A
         | 
         | Seagate, we will refer to this manufacturer as SEAGATE.
        
           | jeffbee wrote:
           | I agree that the article fails to make an argument for why
           | this prediction would be useful. What would you do with it,
           | even if it was perfect?
        
             | toast0 wrote:
             | With perfect (or good enough) prediction, you can replace
             | drives before they fail, in controlled conditions in a
             | maintenance window.
        
               | jeffbee wrote:
               | Assuming your prediction is so good that it completely
               | eliminates surprise failures, OK. But if your software
               | can handle surprise failures then it can handle them, and
               | in that case I don't see the value of the prediction. The
               | only valuable thing I can see is if a system could
               | predict that you were going to have widespread,
               | coordinated failure of so many drives that your ability
               | to reconstruct a stripe was at elevated risk. But such a
               | thing is science fiction (as the article demonstrates)
               | and you can approximate it quite well by assuming that
               | all drives of the same make and model with the same in-
               | service date will fail at the same moment.
        
               | toast0 wrote:
               | The difference is urgency, and knock on effects.
               | 
               | If your system is designed for hot swap of failed disks,
               | but 1% of the time, you need to reboot the system for the
               | new disk to be detected; predictive replacement lets you
               | move traffic off the system at a convenient time to do
               | the swap (when traffic is low, and operations staff isn't
               | super busy). Predictive replacement can be deferred more
               | easily, because the system is still working; maybe you
               | can batch it with some other maintenance.
               | 
               | A surprise failure needs to be dealt with in a more
               | timely fashion.
               | 
               | In my experience (which I didn't write a blog post for),
               | monitoring the right SMART values and doing predictive
               | replacement eliminated most of the surprise failures for
               | spinning drives. SSDs had less failures, but I wasn't
               | able to find any predictive indicators; they generally
               | just disappeared from the OS perspective.
        
               | jeffbee wrote:
               | You are trading off cost, though. According to that paper
               | I referenced, the SMART reallocated sector count is just
               | binary (the critical threshold is 1 sector). Drives with
               | non-zero counts are much more likely to fail but 85% of
               | them are still in service a year later. If you
               | proactively replace them you may indeed have avoided some
               | surprise failures but it's actually not a great deal more
               | effective than replacing drives randomly, and it costs
               | money.
               | 
               | Also we haven't even discussed the online disk repair
               | phenomenon, in which you take a "failed" disk, perform
               | whatever lowest-level format routine the manufacturer has
               | furnished, and get another 5 years of service out of it.
               | This is done without ever touching it.
        
             | rasz wrote:
             | If prediction would be possible you could do unethical
             | things, like selling off units bound to fail in near future
             | without disclosing this information.
             | 
             | I wonder if there are any existing examples of such
             | practices. Supermarket chains sell products near their
             | expiration date at steep discounts, of course the
             | difference is you know what you are buying and can still
             | eat a 20 days until expiry 50% price Nutella jar.
        
           | toast0 wrote:
           | Reallocated sectors is pretty good, but I'd also look at
           | pending and offline uncorrectables to get a potential earlier
           | warning.
           | 
           | I've seen some drives that were apparently working fine with
           | about 1000 reallocated sectors; and the smart firmware
           | usually indicates a maximum of quite a few more than that, so
           | if you replace at 100 problem sectors, I would consider that
           | 'predictive' replacement. It's debatable of course; something
           | that predicted failure based on changes in data transfer
           | (that presumably precede sectors being flagged), would be a
           | bigger predictive step, but I never got that far in modeling;
           | using flagged sector counts was enough to turn unscheduled
           | drive failures into scheduled drive replacements for me.
        
         | toast0 wrote:
         | Didn't read your linked study, the linked blog post threw out
         | the actual data though. Not surprising you get nothing of use
         | when you start with:
         | 
         | > Each attribute also has a feature named raw_value, but this
         | is discarded due to inconsistent reporting standards between
         | drive manufacturers.
         | 
         | Sure, raw_value is inconsistent between manufacturers and
         | sometimes models, but it's the most real data.
         | 
         | Edit to add: reread the Google survey, which says more or less
         | bad sector counts were indicative of failure in about half the
         | cases; in 2007.
         | 
         | One thing to note is drives made now may perform differently
         | than those in the 2007 survey. Different firmware, different
         | recording, better soldering (200x is full of poor soldering
         | because of RoHS mandating lead free solder in most
         | applications, and lack of experience with lead free solder
         | leading to failing connections).
         | 
         | I found, in 2013ish, with a couple thousand WD enterprise
         | branded drives that just looking at the sum of all the sector
         | counts and thresholding that was enough to catch drives that
         | would soon fail. If your fleet is large enough, and you have
         | things setup so failed disks is not a disaster (which you
         | should!), it's pretty easy to run the collection and figure out
         | your thresholds. IIRC, we would replace at 100+ damaged
         | sectors, or if growth was rapid. Some drives in service before
         | collecting smart data did seem to be working fine at around
         | 1000 damaged sectors though, so you could make a case for a
         | higher threshold.
         | 
         | SSDs on the otherhand seemed to fail spectacularly (no longer
         | visible from the OS), with no prefailure symptoms that I could
         | see. Thankfully the failure rates were much lower than spinning
         | disks, but a lot more of a pain.
        
           | heipei wrote:
           | Hmm, one could encode manufacturer and maybe even model as a
           | categorical or one-hot encoded variable. Then again, the
           | small number of failed drives in their dataset would probably
           | make this hard.
        
       | CraigJPerry wrote:
       | For enriching the model, a few ideas for discriminating failure
       | drives:
       | 
       | Some hard drives have a vibration sensor(s) in them - i wonder if
       | a daemon could sample throughout the day and dump a summarisation
       | of the data when the daily SMART data dump is being generated.
       | 
       | The drives report temperature via SMART, i wonder if anything
       | meaningful can be extracted from the temperature differential
       | between the HDD's sensor and a chassis temp sensor?
       | 
       | I wonder if more discrimination by manufacturing detail (date of
       | mfr, which factory etc) could help?
       | 
       | I was going to suggest drive supply voltages but those are hard
       | to compare across chassis - calibration of voltage measuring
       | circuitry in PSUs isn't too hot for the millivolt differences
       | you'd want to measure - the differences would be so small to be
       | lost in the inaccuracies between boards. Also many PSUs lack a
       | method to query their measurements.
       | 
       | Fascinating study and excellent writeup. I have zero ML
       | background but i really enjoyed this.
        
       | ploika wrote:
       | Surprised that survival analysis wasn't mentioned anywhere. Cox
       | regression would have been my go-to approach.
        
         | conjectures wrote:
         | I respect the fact that you can be surprised by that. But even
         | rewinding a bit: where are the data plotted, e.g. time vs
         | failures?
        
         | wcrossbow wrote:
         | I'm currently using a non-linear version of it, DeepSURV [*],
         | implemented in pycox for a predictive maintenance job. They are
         | much more informative than a binary label and give space to
         | make a business decicision as to how close to EOL you want to
         | take care of the asset.
         | 
         | [*] The underlying neural networks aren't deep at all.
        
       | whoisburbansky wrote:
       | Wow, this is a surprisingly sober look at using ML for this
       | particular problem. I really appreciate how the author walks
       | through every step and the rationale behind it, spending as much
       | time on explaining the dataset and cleaning it up as she does on,
       | say, the model they end up using.
       | 
       | It's a bit of a breath of fresh air to see a post like this where
       | someone clearly puts the work in to try and solve a given problem
       | with ML and is able to articulate the failure modes of the
       | approach so well.
        
         | Karrot_Kream wrote:
         | What was the author's rationale for choosing XGBoost? Failure
         | analysis is an old field so I'm very surprised that there's not
         | even a cursory nod to the exponentially distributed failure
         | times that most failure models use. If anything, this post just
         | spends a lot more time than usual talking about data cleanup.
        
         | mlthoughts2018 wrote:
         | This is very standard in industry (as opposed to academia) and
         | highlights a lot of the reasons why you need ML specialists
         | operating these models, and why you will get garbage in garbage
         | out if you try having non-specialist engineers build, evaluate
         | & operate these types of systems.
         | 
         | For example, in this post the engineer does not adequately
         | handle class imbalance at all, and reaches for a hugely
         | complicated model framework (XGBoost) without even trying
         | something like bias-corrected logistic regression first.
         | 
         | I think this post is just an example of an infra engineer
         | thinking "ML is easy" and when their quick attempt doesn't
         | work, then just bailing and saying "ML is overhyped and doesn't
         | work for my problem."
        
           | causalmodels wrote:
           | This is not the correct modeling approach. All hard drives
           | will fail given enough time, so labeling the failed hard
           | drives as the positive class will bias your results.
           | 
           | Stuff like this really should be handled using survival
           | analysis.
        
             | mlthoughts2018 wrote:
             | In fact it screams Bayesian survival analysis, since there
             | is so much prior knowledge both of general hard drive
             | failure rates and SMART stats.
        
               | astrophysician wrote:
               | I agree and have been doing some similar analysis on the
               | backblaze dataset. I suppose you can use this for
               | prediction but I'm personally just interested in post-hoc
               | analysis and (1) getting better AFR estimates when
               | failure counts are low + (2) exploring time-dependence of
               | hazard functions with different priors (GP priors, etc.).
               | This post and your comment have motivated me to make a
               | post this weekend! Thanks!
        
               | jgalentine007 wrote:
               | SMART is a liar sometimes. I have first hand experience
               | with faulty Seagate firmware and Equallogic SANs - where
               | errant statistics caused disks to be ejected from the
               | volumes before you could finish a rebuild. Nothing like
               | watching 40TB of data disappear on multiple installations
               | over the course of a few weeks!
        
               | R0b0t1 wrote:
               | SMART seems to be extremely useless in practice.
               | Manufacturers don't seem to expose actual failure
               | statistics through it, likely for fear of making their
               | product look bad.
        
           | [deleted]
        
       | moonbug wrote:
       | Ugh. Statistics, people.
        
       | ryeguy_24 wrote:
       | It would be interesting to empirically identify the state space
       | that leads to increased probability of failure and then seek to
       | understand the rationale for this failure through physics.
       | Ultimately, the ML model is picking up on some combination of
       | factors that exposes physical weaknesses of the device.
        
       | washedup wrote:
       | Just curious: how come the total observations of the confusion
       | matrices are so different? Shouldn't the same data set be used in
       | each trial to fully evaluate the performance differences? Maybe I
       | missed some details in the article...
        
         | rachelwenzel wrote:
         | Hey there! Article author here-
         | 
         | This was written as an after-the-fact look back after a year
         | long project and the target testing set did change a little
         | along the way. The big change was splitting by manufacturer
         | (and eventually by model family), which changed the number of
         | applicable samples to compare performance on.
         | 
         | I cut some stuff from the article for length, so I totally get
         | how that could have been unclear.
        
           | washedup wrote:
           | Appreciate the explanation, thanks! Enjoyed the article.
        
       | NikolaeVarius wrote:
       | I'm not surprised by the conclusions, I've never seen anyone
       | claim SMART has ever been useful, unless the HDD is already known
       | to have issues that are already noticeable without SMART, SMART
       | just makes it possible to diagnose the actual problem. Though, of
       | course this is limited by my own anecdotal research and personal
       | experience.
       | 
       | What I wonder than, is if sensitive isolated microphones could be
       | tried for this purpose? We already know that sounds (ie yelling
       | at the drive) can vibrate the platter enough to cause performance
       | degradation. If there were internal mics in each HDD recording
       | sound as the HDD spins and correlate that to HDD activity, could
       | that be correlated with failure rate?
        
         | toast0 wrote:
         | If you have spinning drives, monitor the (raw) sector count
         | stats. Add together reallocated sectors, uncorrectable and
         | offline uncorrectable.
         | 
         | If you have good operations, replace when convenient after it
         | hits 100. If you have poor operations (like in home use, with
         | no backups and only ocassional SMART checks) replace if it hits
         | 10 and try to run a full SMART surface scan before using a new
         | drive.
         | 
         | For SSDs, good luck, I haven't seen prefailure indicators.
        
         | tinus_hn wrote:
         | I have had drives tell me something was wrong before they
         | developed bad sectors. It does take a lot of monitoring though
         | and probably leads one to replace drives earlier than might be
         | necessary.
        
         | dehrmann wrote:
         | I've also never found anything meaningful out of SMART at home
         | lab scale. I do know the way vendors report it is a shitshow,
         | so I wouldn't suspect it'd be great training data.
         | 
         | What I'd be really curious to try is run IO benchmarks on the
         | drive and see if there are performance issues that indicate a
         | drive is failing.
        
       | oli5679 wrote:
       | Fitting a binary classifier on failure doesn't seem quite right
       | to me.
       | 
       | There are some quite natural questions you can't answer unless
       | you can predict some stats on when failure will occur, and if you
       | have any drift in the dataset, you might just find features
       | correlated with age (and hence having more time to fail) that
       | don't inherently increase risk.
       | 
       | I would recommend:
       | 
       | (a) using survival analysis
       | 
       | https://lifelines.readthedocs.io/en/latest/
       | 
       | https://xgboost.readthedocs.io/en/latest/tutorials/aft_survi...
       | 
       | or (b) flattening the data to be one observation per time window,
       | and predicting failure over specific time windows.
       | 
       | This is a good talk also. https://dataorigami.net/blogs/napkin-
       | folding/18308867-video-...
       | 
       | Although I'm not sure about this case, in general the approach in
       | the article (tree-ensemble predicting binary classier) works well
       | for many business problems!
        
       | heipei wrote:
       | Great post, glad to see something written from beginning to end.
       | One thing that surprised me though is that there wasn't a single
       | word lost on feature selection (by looking at feature importance
       | for correct predictions) or feature engineering by combining
       | features or somehow else applying domain knowledge. So for
       | example, trying to figure out if you can put different attributes
       | in relation to each other to get a better feature (e.g. number-
       | of-spinups / operating-time-in-hours). Personally that would be
       | the first thing I'd try, and coincidentally it's also the most
       | fun aspect of machine learning for me ;)
        
         | mlthoughts2018 wrote:
         | I think it's not actually a very good ML case study for this
         | reason. Feature engineering would be a huge part of a problem
         | like this. Additionally, jumping to start with XGBoost is a
         | pretty amateurish thing to do, and the very first problem to
         | attack is class imbalance.
         | 
         | In the bio blurb they self-describe as an infra engineer who
         | also enjoys data science. In some sense I really don't like to
         | see that. The quality of the data science / ML work in this is
         | actually quite bad, but people use these blog posts as resume
         | padders to try to jump into ML jobs without ever having any
         | real experience or training.
         | 
         | I think it's a bad thing because it devalues the importance of
         | real statistical computing skills, which are very hard to
         | develop through many years of education and experience -
         | absolutely not the sort of thing you can get by dabbling in
         | some Python packages on the weekend to do a little project like
         | this.
         | 
         | The amount of waste I see from companies trying to avoid paying
         | higher wages and avoid team structures that facilitate
         | productivity of statistics experts is staggering - with all
         | kinds of hacked up scripts and notebooks stitched together
         | without proper backing statistical understanding, making ML
         | engineers manage their own devops, and just ignoring base
         | statistical questions.
         | 
         | For this drive problem for example, I expect to see a
         | progression from simple models, each of which should address
         | class imbalance as a first order concern. I expect to see how
         | Bayesian modeling can help and how simple lifetime survivorship
         | models can help. I expect to see a lot of feature engineering.
         | 
         | Instead I see an infra engineer playing around with data and
         | trying one off the shelf framework, then claiming the whole
         | premise can't work in production.
        
       | jmpman wrote:
       | The author doesn't explicitly call out the classic reliability
       | engineer's concept of a bathtub failure model.
       | https://en.m.wikipedia.org/wiki/Bathtub_curve
       | 
       | Thermal abuse (running at a high temp for many hours), duty
       | cycle, iop rate (high head acceleration versus head dwell) all
       | impact the lifetime.
       | 
       | Early life media issues are predictive of failure and can be
       | detected through stress testing with various seek ranges. It's
       | not as simple as a SMART metric, but did appear to have
       | predictive value. The one experiment I ran discovered a set of
       | drives which ultimately the manufacturer admitted to a since
       | corrected process problem.
       | 
       | The above factors are different on each drive family and can't
       | normally be generalized.
        
       ___________________________________________________________________
       (page generated 2021-01-21 23:01 UTC)