1. Field of the Invention
Embodiments of the present invention relate to techniques for performing reliability tests on computer systems. More specifically, embodiments of the present invention relate to a method and an apparatus for monitoring vibrations to facilitate reliability studies on a computer system.
2. Related Art
Enterprise computer systems often include a large number of hard disk drives. For example, a single server system can sometimes include as many as 15,000 hard disk drives. Losing the data which is stored on these disk drives can have a devastating effect on an organization. For example, airlines rely on the integrity of data stored in their reservation systems for most of their day-to-day operations, and would essentially cease to function if this data became lost or corrupted. If fault-prone hard disk drives can be identified before they fail, preventative measures can be taken to avoid such failures.
Existing techniques for identifying hard disk drives that are likely to fail have many drawbacks. One technique analyzes internal counter-type variables, such as read retries, write retries, seek errors, dwell time (time between reads/writes) to determine whether a disk drive is likely to fail. Unfortunately, in practice, this technique suffers from a high missed-alarm probability (MAP) of about 50%, and a false-alarm probability (FAP) of about 1%. This high MAP increases the probability of massive data loss, and the FAP causes a large number of drives to be returned for which there is No-Trouble-Found (NTF), resulting in increased warranty costs.
Another technique monitors internal discrete performance metrics within disk drives, for example, by monitoring internal diagnostic counter-type variables called “SMART variables.” However, hard disk drive manufacturers are reluctant to add extra diagnostics to monitor these variables, because doing so increases the costs. Furthermore, in practice, this technique fails to identify approximately 50% of imminent hard disk drive failures.
To prevent catastrophic data loss due to hard disk failures, systems often use redundant arrays of inexpensive disks (RAID) to provide fault tolerance. Unfortunately, because the capacity of hard disk drives has increased dramatically in recent years, the time required to rebuild a RAID array after a failure of one of the disks has also increased dramatically. Consequently, the rebuilding process can take many hours to several days, during which time the system is susceptible to a second hard disk drive failure which would cause massive data loss.
During operation, a disk drive produces vibrations (and/or acoustic signatures) which can contain important diagnostic information (e.g., frequency, amplitude, and phase) related to the health of the disk drive. For example, the vibration information for hard disk drives can indicate whether a spindle assembly is failing. Furthermore, vibration signatures are typically unique for different failure modes. For example, ball bearing imperfections have a unique frequency related to the spindle rotational frequency. This vibration information is useful for predicting hard disk drive failures. Hence, accelerometers or microphones can be used to acquire vibration or acoustic signatures from hard disk drives. Unfortunately, accelerometers are too complicated and expensive to deploy across large systems. On the other hand, microphones are cheaper, but they pick up external sounds (e.g., human voices) which are not related to the hard disk drives being monitored, and the recording of these external sounds is highly undesirable for security and privacy reasons.
Hence, what is needed is a method and an apparatus for detecting vibrations without the above-described problems.