[HN Gopher] Predicting Hard Drive Failure with Machine Learning
___________________________________________________________________
Predicting Hard Drive Failure with Machine Learning
Author : binwiederhier
Score : 187 points
Date : 2021-01-21 14:28 UTC (8 hours ago)
(HTM) web link (datto.engineering)
(TXT) w3m dump (datto.engineering)
| MaxBarraclough wrote:
| These guys offer something similar as a service:
| https://www.senseye.io/how-it-works
| gerdesj wrote:
| "Only one drive manufacturer had enough failed drives in the
| dataset to produce a large enough subset, which we'll refer to as
| Manufacturer A. Because this data treatment was so effective, and
| because most of the drives that had been in our fleet long enough
| to begin failing were already from Manufacturer A, we decided to
| reduce the scope of smarterCTL to only their drives."
|
| I'd have dropped Manufacturer A from the supplier list and used
| the A only model for the remainder of their drives. Then I'd have
| another go with SmarterCTL for "not A".
| dehrmann wrote:
| There are really only 2.5 manufacturers, though.
| rasz wrote:
| Seagate, just call it what it is. You would not drop Seagate
| because their drives are >20% cheaper than competition.
| jaas wrote:
| It's great that they took the time to write this up even though
| the conclusion is that it doesn't work.
|
| Too much valuable information is lost because people don't value
| negative results enough.
| Abishek_Muthian wrote:
| Neat ML exercise, though it wasn't the right solution for the
| problem.
|
| What is the right solution at consumer level though? I currently
| run smartmontools at regular intervals, compare the TBW with
| manufacturer's TBW and send a MQTT message using a script.
| PaulHoule wrote:
| My understanding is that SMART is not effective because it only
| catches (some) mechanical problems and that about half of the
| time it is the logic board that went bad.
|
| You can filter out some of the bad drives ahead of time, but you
| will still get blindsided much of the time.
| 8fingerlouie wrote:
| Assuming you don't have an early fail drive, it has one usable
| stat, "Power On Hours". When that begins to creep towards
| 40,000 you will have a failed drive sooner than later.
|
| As for the rest of the counter, raw read error rate, relocated
| sector count and friends are also good indications that your
| drive might not be in the best shape. They don't say anything
| about when it will fail though. I've had a drive rack up 20
| relocated sectors in a couple of weeks, only to stay at 20 for
| 4 years after that.
|
| And load cycle count is only good for telling if you're within
| manufacturer parameters. I frequently see old drives with Load
| Cycle Count 10-25 times the rated value, and they work just
| "fine".
| bryanculver wrote:
| The article is fascinating but one thing I am finding very common
| on these engineering blogs is no easy path back to the company or
| service. Understandably the logo at the top goes back to the blog
| homepage and the copyright at the bottom is just static text.
|
| I could search or visit one of the social pages linked to in the
| footer but it seem like an obvious link one would want to have.
| binwiederhier wrote:
| Thanks for bringing this up. We'll add a link to the main Datto
| website shortly. :-)
| whoisburbansky wrote:
| Hitting the "Join The Team" button up top takes you back to the
| main site, which seems reasonable since in most cases a
| technical blog is meant to be a recruitment funnel more than a
| general public-facing PR document, no?
| bryanculver wrote:
| I see that now. I wouldn't disagree, but if you're a CTO for
| example you might have interest in the services more than
| joining the team.
| jeffbee wrote:
| This is consistent with prior work showing that SMART has no
| predictive value.
|
| Failure Trends in a Large Disk Drive Population, FAST '07
|
| https://static.googleusercontent.com/media/research.google.c...
| WrtCdEvrydy wrote:
| I might disagree with this, I distinctly remember hearing that
| some SMART stats only update when you perform a SMART extended
| test.
|
| If the stats that matter only update after a test, and noone
| runs those tests, then is SMART the failure or is the lack of
| caring the failure.
| rasz wrote:
| Do we need actual prediction? Ill settle for Just in time
| warning. This is what SMART parameter C4 gives you. You want to
| stop using the drive at the first sign of trouble, thats what
| C4 is good for.
|
| >Only one drive manufacturer had enough failed drives in the
| dataset to produce a large enough subset, which we'll refer to
| as Manufacturer A
|
| Seagate, we will refer to this manufacturer as SEAGATE.
| jeffbee wrote:
| I agree that the article fails to make an argument for why
| this prediction would be useful. What would you do with it,
| even if it was perfect?
| toast0 wrote:
| With perfect (or good enough) prediction, you can replace
| drives before they fail, in controlled conditions in a
| maintenance window.
| jeffbee wrote:
| Assuming your prediction is so good that it completely
| eliminates surprise failures, OK. But if your software
| can handle surprise failures then it can handle them, and
| in that case I don't see the value of the prediction. The
| only valuable thing I can see is if a system could
| predict that you were going to have widespread,
| coordinated failure of so many drives that your ability
| to reconstruct a stripe was at elevated risk. But such a
| thing is science fiction (as the article demonstrates)
| and you can approximate it quite well by assuming that
| all drives of the same make and model with the same in-
| service date will fail at the same moment.
| toast0 wrote:
| The difference is urgency, and knock on effects.
|
| If your system is designed for hot swap of failed disks,
| but 1% of the time, you need to reboot the system for the
| new disk to be detected; predictive replacement lets you
| move traffic off the system at a convenient time to do
| the swap (when traffic is low, and operations staff isn't
| super busy). Predictive replacement can be deferred more
| easily, because the system is still working; maybe you
| can batch it with some other maintenance.
|
| A surprise failure needs to be dealt with in a more
| timely fashion.
|
| In my experience (which I didn't write a blog post for),
| monitoring the right SMART values and doing predictive
| replacement eliminated most of the surprise failures for
| spinning drives. SSDs had less failures, but I wasn't
| able to find any predictive indicators; they generally
| just disappeared from the OS perspective.
| jeffbee wrote:
| You are trading off cost, though. According to that paper
| I referenced, the SMART reallocated sector count is just
| binary (the critical threshold is 1 sector). Drives with
| non-zero counts are much more likely to fail but 85% of
| them are still in service a year later. If you
| proactively replace them you may indeed have avoided some
| surprise failures but it's actually not a great deal more
| effective than replacing drives randomly, and it costs
| money.
|
| Also we haven't even discussed the online disk repair
| phenomenon, in which you take a "failed" disk, perform
| whatever lowest-level format routine the manufacturer has
| furnished, and get another 5 years of service out of it.
| This is done without ever touching it.
| rasz wrote:
| If prediction would be possible you could do unethical
| things, like selling off units bound to fail in near future
| without disclosing this information.
|
| I wonder if there are any existing examples of such
| practices. Supermarket chains sell products near their
| expiration date at steep discounts, of course the
| difference is you know what you are buying and can still
| eat a 20 days until expiry 50% price Nutella jar.
| toast0 wrote:
| Reallocated sectors is pretty good, but I'd also look at
| pending and offline uncorrectables to get a potential earlier
| warning.
|
| I've seen some drives that were apparently working fine with
| about 1000 reallocated sectors; and the smart firmware
| usually indicates a maximum of quite a few more than that, so
| if you replace at 100 problem sectors, I would consider that
| 'predictive' replacement. It's debatable of course; something
| that predicted failure based on changes in data transfer
| (that presumably precede sectors being flagged), would be a
| bigger predictive step, but I never got that far in modeling;
| using flagged sector counts was enough to turn unscheduled
| drive failures into scheduled drive replacements for me.
| toast0 wrote:
| Didn't read your linked study, the linked blog post threw out
| the actual data though. Not surprising you get nothing of use
| when you start with:
|
| > Each attribute also has a feature named raw_value, but this
| is discarded due to inconsistent reporting standards between
| drive manufacturers.
|
| Sure, raw_value is inconsistent between manufacturers and
| sometimes models, but it's the most real data.
|
| Edit to add: reread the Google survey, which says more or less
| bad sector counts were indicative of failure in about half the
| cases; in 2007.
|
| One thing to note is drives made now may perform differently
| than those in the 2007 survey. Different firmware, different
| recording, better soldering (200x is full of poor soldering
| because of RoHS mandating lead free solder in most
| applications, and lack of experience with lead free solder
| leading to failing connections).
|
| I found, in 2013ish, with a couple thousand WD enterprise
| branded drives that just looking at the sum of all the sector
| counts and thresholding that was enough to catch drives that
| would soon fail. If your fleet is large enough, and you have
| things setup so failed disks is not a disaster (which you
| should!), it's pretty easy to run the collection and figure out
| your thresholds. IIRC, we would replace at 100+ damaged
| sectors, or if growth was rapid. Some drives in service before
| collecting smart data did seem to be working fine at around
| 1000 damaged sectors though, so you could make a case for a
| higher threshold.
|
| SSDs on the otherhand seemed to fail spectacularly (no longer
| visible from the OS), with no prefailure symptoms that I could
| see. Thankfully the failure rates were much lower than spinning
| disks, but a lot more of a pain.
| heipei wrote:
| Hmm, one could encode manufacturer and maybe even model as a
| categorical or one-hot encoded variable. Then again, the
| small number of failed drives in their dataset would probably
| make this hard.
| CraigJPerry wrote:
| For enriching the model, a few ideas for discriminating failure
| drives:
|
| Some hard drives have a vibration sensor(s) in them - i wonder if
| a daemon could sample throughout the day and dump a summarisation
| of the data when the daily SMART data dump is being generated.
|
| The drives report temperature via SMART, i wonder if anything
| meaningful can be extracted from the temperature differential
| between the HDD's sensor and a chassis temp sensor?
|
| I wonder if more discrimination by manufacturing detail (date of
| mfr, which factory etc) could help?
|
| I was going to suggest drive supply voltages but those are hard
| to compare across chassis - calibration of voltage measuring
| circuitry in PSUs isn't too hot for the millivolt differences
| you'd want to measure - the differences would be so small to be
| lost in the inaccuracies between boards. Also many PSUs lack a
| method to query their measurements.
|
| Fascinating study and excellent writeup. I have zero ML
| background but i really enjoyed this.
| ploika wrote:
| Surprised that survival analysis wasn't mentioned anywhere. Cox
| regression would have been my go-to approach.
| conjectures wrote:
| I respect the fact that you can be surprised by that. But even
| rewinding a bit: where are the data plotted, e.g. time vs
| failures?
| wcrossbow wrote:
| I'm currently using a non-linear version of it, DeepSURV [*],
| implemented in pycox for a predictive maintenance job. They are
| much more informative than a binary label and give space to
| make a business decicision as to how close to EOL you want to
| take care of the asset.
|
| [*] The underlying neural networks aren't deep at all.
| whoisburbansky wrote:
| Wow, this is a surprisingly sober look at using ML for this
| particular problem. I really appreciate how the author walks
| through every step and the rationale behind it, spending as much
| time on explaining the dataset and cleaning it up as she does on,
| say, the model they end up using.
|
| It's a bit of a breath of fresh air to see a post like this where
| someone clearly puts the work in to try and solve a given problem
| with ML and is able to articulate the failure modes of the
| approach so well.
| Karrot_Kream wrote:
| What was the author's rationale for choosing XGBoost? Failure
| analysis is an old field so I'm very surprised that there's not
| even a cursory nod to the exponentially distributed failure
| times that most failure models use. If anything, this post just
| spends a lot more time than usual talking about data cleanup.
| mlthoughts2018 wrote:
| This is very standard in industry (as opposed to academia) and
| highlights a lot of the reasons why you need ML specialists
| operating these models, and why you will get garbage in garbage
| out if you try having non-specialist engineers build, evaluate
| & operate these types of systems.
|
| For example, in this post the engineer does not adequately
| handle class imbalance at all, and reaches for a hugely
| complicated model framework (XGBoost) without even trying
| something like bias-corrected logistic regression first.
|
| I think this post is just an example of an infra engineer
| thinking "ML is easy" and when their quick attempt doesn't
| work, then just bailing and saying "ML is overhyped and doesn't
| work for my problem."
| causalmodels wrote:
| This is not the correct modeling approach. All hard drives
| will fail given enough time, so labeling the failed hard
| drives as the positive class will bias your results.
|
| Stuff like this really should be handled using survival
| analysis.
| mlthoughts2018 wrote:
| In fact it screams Bayesian survival analysis, since there
| is so much prior knowledge both of general hard drive
| failure rates and SMART stats.
| astrophysician wrote:
| I agree and have been doing some similar analysis on the
| backblaze dataset. I suppose you can use this for
| prediction but I'm personally just interested in post-hoc
| analysis and (1) getting better AFR estimates when
| failure counts are low + (2) exploring time-dependence of
| hazard functions with different priors (GP priors, etc.).
| This post and your comment have motivated me to make a
| post this weekend! Thanks!
| jgalentine007 wrote:
| SMART is a liar sometimes. I have first hand experience
| with faulty Seagate firmware and Equallogic SANs - where
| errant statistics caused disks to be ejected from the
| volumes before you could finish a rebuild. Nothing like
| watching 40TB of data disappear on multiple installations
| over the course of a few weeks!
| R0b0t1 wrote:
| SMART seems to be extremely useless in practice.
| Manufacturers don't seem to expose actual failure
| statistics through it, likely for fear of making their
| product look bad.
| [deleted]
| moonbug wrote:
| Ugh. Statistics, people.
| ryeguy_24 wrote:
| It would be interesting to empirically identify the state space
| that leads to increased probability of failure and then seek to
| understand the rationale for this failure through physics.
| Ultimately, the ML model is picking up on some combination of
| factors that exposes physical weaknesses of the device.
| washedup wrote:
| Just curious: how come the total observations of the confusion
| matrices are so different? Shouldn't the same data set be used in
| each trial to fully evaluate the performance differences? Maybe I
| missed some details in the article...
| rachelwenzel wrote:
| Hey there! Article author here-
|
| This was written as an after-the-fact look back after a year
| long project and the target testing set did change a little
| along the way. The big change was splitting by manufacturer
| (and eventually by model family), which changed the number of
| applicable samples to compare performance on.
|
| I cut some stuff from the article for length, so I totally get
| how that could have been unclear.
| washedup wrote:
| Appreciate the explanation, thanks! Enjoyed the article.
| NikolaeVarius wrote:
| I'm not surprised by the conclusions, I've never seen anyone
| claim SMART has ever been useful, unless the HDD is already known
| to have issues that are already noticeable without SMART, SMART
| just makes it possible to diagnose the actual problem. Though, of
| course this is limited by my own anecdotal research and personal
| experience.
|
| What I wonder than, is if sensitive isolated microphones could be
| tried for this purpose? We already know that sounds (ie yelling
| at the drive) can vibrate the platter enough to cause performance
| degradation. If there were internal mics in each HDD recording
| sound as the HDD spins and correlate that to HDD activity, could
| that be correlated with failure rate?
| toast0 wrote:
| If you have spinning drives, monitor the (raw) sector count
| stats. Add together reallocated sectors, uncorrectable and
| offline uncorrectable.
|
| If you have good operations, replace when convenient after it
| hits 100. If you have poor operations (like in home use, with
| no backups and only ocassional SMART checks) replace if it hits
| 10 and try to run a full SMART surface scan before using a new
| drive.
|
| For SSDs, good luck, I haven't seen prefailure indicators.
| tinus_hn wrote:
| I have had drives tell me something was wrong before they
| developed bad sectors. It does take a lot of monitoring though
| and probably leads one to replace drives earlier than might be
| necessary.
| dehrmann wrote:
| I've also never found anything meaningful out of SMART at home
| lab scale. I do know the way vendors report it is a shitshow,
| so I wouldn't suspect it'd be great training data.
|
| What I'd be really curious to try is run IO benchmarks on the
| drive and see if there are performance issues that indicate a
| drive is failing.
| oli5679 wrote:
| Fitting a binary classifier on failure doesn't seem quite right
| to me.
|
| There are some quite natural questions you can't answer unless
| you can predict some stats on when failure will occur, and if you
| have any drift in the dataset, you might just find features
| correlated with age (and hence having more time to fail) that
| don't inherently increase risk.
|
| I would recommend:
|
| (a) using survival analysis
|
| https://lifelines.readthedocs.io/en/latest/
|
| https://xgboost.readthedocs.io/en/latest/tutorials/aft_survi...
|
| or (b) flattening the data to be one observation per time window,
| and predicting failure over specific time windows.
|
| This is a good talk also. https://dataorigami.net/blogs/napkin-
| folding/18308867-video-...
|
| Although I'm not sure about this case, in general the approach in
| the article (tree-ensemble predicting binary classier) works well
| for many business problems!
| heipei wrote:
| Great post, glad to see something written from beginning to end.
| One thing that surprised me though is that there wasn't a single
| word lost on feature selection (by looking at feature importance
| for correct predictions) or feature engineering by combining
| features or somehow else applying domain knowledge. So for
| example, trying to figure out if you can put different attributes
| in relation to each other to get a better feature (e.g. number-
| of-spinups / operating-time-in-hours). Personally that would be
| the first thing I'd try, and coincidentally it's also the most
| fun aspect of machine learning for me ;)
| mlthoughts2018 wrote:
| I think it's not actually a very good ML case study for this
| reason. Feature engineering would be a huge part of a problem
| like this. Additionally, jumping to start with XGBoost is a
| pretty amateurish thing to do, and the very first problem to
| attack is class imbalance.
|
| In the bio blurb they self-describe as an infra engineer who
| also enjoys data science. In some sense I really don't like to
| see that. The quality of the data science / ML work in this is
| actually quite bad, but people use these blog posts as resume
| padders to try to jump into ML jobs without ever having any
| real experience or training.
|
| I think it's a bad thing because it devalues the importance of
| real statistical computing skills, which are very hard to
| develop through many years of education and experience -
| absolutely not the sort of thing you can get by dabbling in
| some Python packages on the weekend to do a little project like
| this.
|
| The amount of waste I see from companies trying to avoid paying
| higher wages and avoid team structures that facilitate
| productivity of statistics experts is staggering - with all
| kinds of hacked up scripts and notebooks stitched together
| without proper backing statistical understanding, making ML
| engineers manage their own devops, and just ignoring base
| statistical questions.
|
| For this drive problem for example, I expect to see a
| progression from simple models, each of which should address
| class imbalance as a first order concern. I expect to see how
| Bayesian modeling can help and how simple lifetime survivorship
| models can help. I expect to see a lot of feature engineering.
|
| Instead I see an infra engineer playing around with data and
| trying one off the shelf framework, then claiming the whole
| premise can't work in production.
| jmpman wrote:
| The author doesn't explicitly call out the classic reliability
| engineer's concept of a bathtub failure model.
| https://en.m.wikipedia.org/wiki/Bathtub_curve
|
| Thermal abuse (running at a high temp for many hours), duty
| cycle, iop rate (high head acceleration versus head dwell) all
| impact the lifetime.
|
| Early life media issues are predictive of failure and can be
| detected through stress testing with various seek ranges. It's
| not as simple as a SMART metric, but did appear to have
| predictive value. The one experiment I ran discovered a set of
| drives which ultimately the manufacturer admitted to a since
| corrected process problem.
|
| The above factors are different on each drive family and can't
| normally be generalized.
___________________________________________________________________
(page generated 2021-01-21 23:01 UTC)