1. Field of the Invention
The present invention relates to a method and apparatus for detecting degradation in a remote storage device.
2. Related Art
Enterprise computer systems often include a large number of hard disk drives. For example, a single server system can sometimes have as many as 15,000 hard disk drives. Losing data stored on these disk drives can have a devastating effect on an organization. For example, airlines rely on the integrity of data stored in their reservation systems for most of their day-to-day operations, and would essentially cease to function if this data became lost or corrupted. If fault-prone hard disk drives can be identified before they fail, preventative measures can be taken to avoid such failures.
Present techniques that are used to identify hard disk drives that are likely to fail have many shortcomings. One technique analyzes internal counter-type variables, such as read retries, write retries, seek errors, dwell time (time between reads/writes) to determine whether a disk drive is likely to fail. Unfortunately, in practice, this technique suffers from a high missed-alarm probability (MAP) of 50%, and a false-alarm probability (FAP) of 1%. This high MAP increases the probability of massive data loss, and the FAP causes a large number of No-Trouble-Found (NTF) drives to be returned, resulting in increased warranty costs.
Another technique monitors internal discrete performance metrics within disk drives, for example, by monitoring internal diagnostic counter-type variables called “SMART variables.” However, hard disk drive manufacturers are reluctant to add extra diagnostics to monitor these variables, because doing so increases the cost of the commodity hard disk drives. Furthermore, in practice, this technique fails to identify approximately 50% of imminent hard disk drive failures.
To prevent catastrophic data loss due to hard disk failures, systems often use redundant arrays of inexpensive disks (RAID). Unfortunately, because the capacity of hard disk drives have increased dramatically in recent years, the time needed to rebuild a RAID disk after a failure of one of the disks has also increased dramatically. Consequently, the rebuild process can take many hours to several days, during which time the system is susceptible to a second hard disk drive failure which would cause massive data loss. Furthermore, data loss can occur if a second disk fails before a first disk is replaced. Hence, even the most advanced redundancy-based solutions are susceptible to data loss.
Moreover, some computer systems store data to remote storage devices. Typically, information about the health of the remote storage device is not available to the computer system. Hence, the computer system cannot determine whether the remote storage device is at the onset of degradation.
Hence, what is needed is a method and an apparatus for detecting degradation of a remote storage device without the problems described above.