1. Field of the Invention
The present invention relates to systems for providing fault-tolerance for disk drives in computer systems. More specifically, the present invention relates to a method and an apparatus for using acoustic signals to identify disk drives that are likely to fail in a computer system.
2. Related Art
As computer systems grow increasingly more powerful, they are able to manipulate larger volumes of data and are able to execute larger and more sophisticated computer programs. In order to accommodate these larger volumes of data and larger programs, computer systems are using larger amounts of disk storage. For example, some existing server systems currently support more than 15,000 disk drives.
Ensuring the reliability of disk storage in these systems is critically important for most applications. Allowing data to be corrupted or lost can have a devastating effect on businesses that rely on the data. For example, airlines rely on the integrity of data stored in their reservation systems for most of their day-to-day operations, and would essentially cease to function if this data became lost or corrupted.
About one percent of disk drives within a computer system fail each year. This has motivated system designers to develop techniques to mitigate the loss of data caused by disk drive failures. For example, disk drives are often organized into “RAID” arrays to ameliorate the effects of a drive failure by providing data redundancy.
Although these redundancy-based techniques can help prevent the loss of data, a failed disk drive must be replaced quickly to maintain system reliability. If a second disk drive fails before the first failed disk drive can be replaced, data can be lost.
Note that disk drives can fail in a number of ways. A failure in the electrical circuitry of a disk drive is typically instantaneous and catastrophic. On the other hand, more common mechanical failures often develop over an extended period of time. For example, one of the most common disk drive failures is a failure of a spindle bearing. Spindle bearing failures typically take place over an extended period of time as frictional forces gradually wear away at the spindle bearing. In many cases, a spindle bearing can change from being fully functional to completely failed over several hours, or even days.
Some software solutions attempt to detect incipient failures by analyzing read/write errors and retry attempts. While this technique can be effective in some situations, a disk drive needs to be very close to failure before the software can detect the impending failure. This leaves very little time to replace the failing disk drive.
What is needed is a method and an apparatus for identifying disk drives that are likely to fail without the problems described above.