1. Field of the Invention
The present invention relates to techniques for providing fault-tolerance in computer systems. More specifically, the present invention relates to a method and apparatus for using vibration signatures to detect the onset of hard disk drive failures.
2. Related Art
As computer systems are becoming more powerful, they are increasingly being used to manipulate larger volumes of data and are being used to execute larger and more sophisticated computer programs. Today, computer systems often have a large number of hard disk drives. For example, a single server system can sometimes have as many as 15,000 hard disk drives.
An increasing number of businesses are using servers for mission critical applications. Losing or corrupting data stored on disk drives can have a devastating effect on such businesses. For example, airlines rely on the integrity of data stored in their reservation systems for most of their day-to-day operations, and would essentially cease to function if this data became lost or corrupted. Note that if hard disk drives are identified before they fail, preventative measures can be taken to avoid such catastrophes. Hence, identifying hard disk drives that are likely to fail is critically important.
Present techniques for identifying hard disk drives that are likely to fail have many drawbacks. One technique relies on analysis of internal counter-type variables, such as read retries, write retries, seek errors, dwell time (time between reads/writes). Unfortunately, these techniques suffer from a high missed-alarm probability (MAP) of 50%, and a false-alarm probability of 1% (FAP). The high MAP causes an increased probability of massive data loss. The FAP causes a large numbers of No-Trouble-Found (NTF) drives to be returned, resulting in increased warranty costs.
Another technique relies on monitoring internal hard disk drive discrete performance metrics. This technique usually monitors internal diagnostic counter-type variables called “SMART variables.” However, hard disk drive manufacturers are reluctant to add extra diagnostics to monitor these variables, because doing so increases the cost of the commodity hard disk drives. Unfortunately, this technique also fails to identify approximately 50% of imminent hard disk drive failures.
To prevent catastrophic data loss due to hard disk drive failure, systems often use redundant arrays of inexpensive disks (RAID). Unfortunately, since the capacity of hard disk drives have increased dramatically in recent years, the time needed to rebuild a RAID disk after a failure of one of the disks has also increased dramatically. The rebuild process can take many hours to several days, during which the system is susceptible to a second hard disk drive failure which would result in massive data loss. Hence, even the most advanced redundancy-based solutions are susceptible to data loss. Furthermore, note that a RAID array tends to contain hard disk drives from the same manufacturing lot. This lot might have an age specific defect that was not caught during qualification tests of the lot. This can further increase the susceptibility of RAID arrays.
Hence, what is needed is a method and an apparatus for detecting the onset of hard disk drive failure without the above-described problems.