1. Field of the Invention
The present invention relates to techniques for providing fault-tolerance for disk drives in computer systems. More specifically, the present invention relates to a method and an apparatus for proactively monitoring disk drives to identify impending disk drive failures using phase-sensitive detection.
2. Related Art
As computer systems grow increasingly more powerful, they are able to process larger volumes of data and are able to execute larger and more sophisticated computer programs. In order to accommodate these larger volumes of data and larger programs, computer systems are using increasingly higher capacity hard-disk drives (HDD), as well as larger numbers of HDDs, typically organized into disk arrays. For example, some server systems currently support more than 15,000 disk drives. Meanwhile, the storage capacity of a single HDD is quickly approaching 1 Terabyte.
While storage arrays attached to computer systems have become ubiquitous, providing the ability to monitor the health and performance of individual HDDs in a storage array and to perform remedial actions if necessary is extremely advantageous. Allowing data to be corrupted or lost can have a devastating effect on businesses that rely on the data. For example, airlines rely on the integrity of data stored in their reservation systems for most of their day-to-day operations, and would essentially cease to function if this data became lost or corrupted.
Currently, the standard HDD interfaces (SCSI, fiber channel, etc.) can report certain catastrophic malfunctions such as non-spinning disk, head misalignment, etc. This information is processed by the operating system (typically in lower-end storage systems) or a dedicated controller (typically in a service processor in higher-end storage arrays). However, in most cases, by the time the warning messages reach the user (to the console or to a log file), the HDD has already failed.
Consequently, in more sophisticated storage system designs, storage system designers have developed techniques to mitigate the loss of data caused by disk drive failures. In particular, disk drives are often organized into “Redundant Array of Independent Disks” or “RAID” arrays which employ two or more drives in combination to provide data redundancy. For example, in enterprise computer systems, most HDDs are organized into RAID array configurations, so that data loss due to a HDD failure can be recovered from associated drives. Hence, a single HDD failure is not catastrophic for the customer's critical data. Note that even though these redundancy-based techniques can help prevent the loss of data, a failed disk drive must be replaced quickly to maintain system reliability.
Unfortunately, because the capacities of the drives continue to climb exponentially, it can take as long as 10-12 hours for the RAID management software to migrate data following an unexpected drive failure in a storage array. During this time window, if a redundant HDD fails (called a “partner pair” failure), all data on the failed HDDs can be lost. It has been observed that the number of partner pair failures has been climbing steadily while the disk capacity is increasing exponentially.
Note that disk drives can fail in a number of ways. A failure in the electrical circuitry of a disk drive is typically instantaneous and catastrophic. On the other hand, more common mechanical failures often develop over an extended period of time. For example, one of the most common disk drive failures is a failure of the spindle in a HDD. Spindle failures typically take place over an extended period of time as frictional forces gradually wear away at the spindle bearing. In many cases, a spindle can change from being fully functional to completely failed over several hours, or even days. Hence, providing a proactive warning about an incipient problem with the spindle can allow the user to take preventive actions well before a failure actually occurs. In particular, for single HDD systems such as low-end personal systems, such proactive warning can enable the user to do one more backup and then replace the HDD. On the other hand, for systems with HDD arrays, this proactive warning can allow migration software to kick in well in advance of failure, thereby significantly reducing the likelihood of a catastrophic partner pair failure.
Some existing software techniques attempt to detect incipient failures by analyzing read/write errors and retry attempts. While these techniques can be effective in some situations, a disk drive needs to be very close to failure before the software can detect the impending failure. This leaves very little time to replace the failing disk drive.
Another existing technique uses acoustic resonance spectroscopy for high-sensitivity annunciation of disk drives with mechanical problems in advance of failure. More specifically, a microphone records the “sound” generated by each spindle in the HDD array and acquires time series of the Fourier transform of these signals. Subsequent spectral analysis on these signals can detect the onset of failure for individual HDDs in the storage arrays. This technique is described in U.S. Pat. No. 6,782,324 B2 issued on Aug. 24, 2004, entitled, “Method and Apparatus for Using Acoustic Signals to Identify One or More Disk Drives That are Likely to Fail,” by inventors Kenny C. Gross and Wendy Lu.
However, because of privacy concerns, businesses are increasingly reluctant to allow “open” microphones to be installed in their computer systems. Moreover, acoustic spectra in large storage arrays have been found to be contaminated with noise associated with the read/write head slider arm control mechanism, which diminishes the signal-to-noise ratio (SNR) for proactive fault monitoring based on acoustics.
Hence, what is needed is a method and an apparatus for providing proactive warning on an incipient problem with the spindle of a HDD without the above-described problems.