Higher capacity and larger numbers of disks are being commissioned to address the challenge of provisioning higher storage capacities. Disks are electromechanical devices and have relatively high failure rates by nature, and higher capacity disks tend to be more prone to errors and failures. Actual failures encountered in the field are likely far in excess of the Annual Failure Rate (“AFR”) of the disks as reported by disk manufacturers (i.e., vendors). Additionally, a large number of failures occur within the warranty period from the vendor. Failure of a disk is a very serious event. Although disk failure may not be preventable, ideally, disk failure may be predicted such that action can be taken to prevent data loss or service disruption.
One predictive technique implemented by disk manufacturers uses Self-test, Monitoring, Analysis and Reporting Technique (“S.M.A.R.T.”) notifications. S.M.A.R.T. is implemented by disk firmware, where a set of crucial parameters including but not limited to operational temperature, head flying height, spin-up time, and relocated sector count among others are monitored with sensors and analyzed to predict disk failures based on some logical deductions. Today, a large number of vendors now implement this functionality. However, the exact technique followed and the attributes monitored vary from model-to-model. S.M.A.R.T. notifications cannot be fully relied upon for at least two reasons—false alarm rate (“FAR”) and disk failures without S.M.A.R.T. notifications.
Vendors have devoted resources to research and develop improvements to the failure prediction using S.M.A.R.T. data. One area that has received attention is improving the accuracy of failure predictions by bringing in some evidence-based conclusions instead of relying only on empirical logic. For example, various Bayesian methods have been developed, and some vendors have developed adaptive learning-based methodology to enhance failure predictions. In these cases, vendors record all of the pre-failure disk events in a central repository and then analyze the events to learn the sequence of events that could result in disk failure. The goal of these learning-based methods is to predict disk failures where a more elementary logic-based prediction algorithm would fail. However complex the method, failure prediction remains an inexact science.