The present invention relates generally to digital processing systems. More specifically, the present invention relates to a method of prevention of failures of disk drives in high availability storage systems.
Typically, in computing applications, data storage systems include storage devices such as hard disk drives, floppy drives, tape drives, compact disks, and the like. An increase in the amount and complexity of these applications has resulted in a proportional increase in the demand for larger storage capacities. Consequently, the production of high capacity storage devices has increased in the past few years. However, large storage capacities demand reliable storage devices with reasonably high data transfer rates. Moreover, the storage capacity of a single storage device cannot be increased beyond a certain limit. Hence, various data storage system configurations and topologies using multiple storage devices are commonly used to meet the growing demand for increased storage capacity.
A configuration of the data storage system to meet the growing demand involves the use of multiple small disk drives. Such a configuration permits redundancy of stored data. Redundancy ensures data integrity in case of device failures. In many such data storage systems, recovery from common failures can be automated within the data storage system itself using data redundancy, such as parity, and its generation with the help of a central controller. However, such data redundancy schemes may be an overhead to the data storage system. These data storage systems are typically referred to as Redundant Array of Inexpensive/Independent Disks (RAID). The 1988 publication by David A. Patterson, et al., from University of California at Berkeley, titled ‘A Case for Redundant Arrays of Inexpensive Disks (RAID)’, describes the fundamental concepts of the RAID technology.
RAID storage systems suffer from inherent drawbacks that reduce their availability. In case one disk drive in the RAID storage system fails, data can be reconstructed with the help of redundant drives. The reconstructed data is then stored in a replacement disk drive. During reconstruction, the data on the failed drive is unavailable. Further, if more than one disk drive fails, data on both drives cannot be reconstructed if there is single drive redundancy, typical of most RAID storage systems. The probability of failure increases as the number of disk drives in a RAID storage system increases. Therefore, RAID storage systems with large numbers of disk drives are typically organized into several smaller RAID systems. This reduces the probability of failure of large RAID systems. Further, the use of smaller RAID systems also reduces the time it takes to reconstruct data on a spare disk drive in the event of a disk drive failure. When a RAID system loses a critical number of disk drives, there is a period of vulnerability from the time the disk drives fail until the time data reconstruction on the spare drives completes. During this time interval, the RAID system is exposed to the possibility of additional disk drives failing which would cause a catastrophic failure. A catastrophic failure of a RAID system results in unrecoverable data loss. If the failure of a one or more disk drives can be predicted with sufficient time to replace the drive or drives before a failure or failures, and a drive or drives can be replaced without sacrificing fault tolerance, the data reliability and availability can be considerably enhanced.
There exist a number of methods for predicting impending failure of disk drives in storage systems. One such method is described in U.S. Pat. No. 5,727,144, titled ‘Failure Prediction for Disk Arrays’, assigned to International Business Machines Corporation, NY, and filed on Jul. 12, 1996. In this method, failure is predicted with the help of error analysis. This includes flyheight analysis and error log analysis. In flyheight analysis, the failure is predicted if the flyheight of the read/write head above the disk head is too low. In error log analysis, seek error rates, sector reassign rates, and the like, are compared with thresholds. If these factors exceed the thresholds, then failure is predicted. Data of the disk drive, for which the monitored factors have exceeded the thresholds, is copied onto a spare disk drive before the failure occurs. Further, if the disk drive fails before the data is completely copied, the contents of the failed disk drive are rebuilt.
Storageflex RAID systems, manufactured by Storageflex, Ontario, Canada predict failure of disk drives with the help of Self-Monitoring, Analysis, and Reporting Technology (SMART) attributes. SMART is an interface between a disk drive and a drive controller. The drive controller receives information from disk drives, through the SMART interface, in the form of attributes. SMART attributes that are monitored in Storageflex RAID systems include head flying height, data throughput performance, spin-up time, reallocated sector count, seek error rate, seek time performance, spin try recount and drive calibration retry count.
However, the methods and systems described above suffer from one or more of the following shortcomings. Disk drive manufacturers recommend some key factors for predicting disk drive failure. The manufacturers also recommend thresholds, which the factors should not exceed. The systems described above do not consider these factors. Further, the systems do not consider the sudden rise of these factors for predicting failure of disk drives.