Computer systems typically include some form of storage system for storing data in non-volatile form. These storage systems may include a plurality of storage devices, such as magnetic hard disk drives (“disk drives”), arranged in an array such that increased storage capacity and data redundancy may be achieved. Periodically, these storage devices may experience errors of various origin. Disk drives, for example, are subject to a number of possible failures which can compromise data integrity. Certain tracks on a particular disk may be affected by defects in the magnetic recording media. Data errors can be produced by the non-uniform flying height of the read/write head over the magnetic disk. Power outages can also cause spindle-motor or servo-motor seizures. In some cases, the power supply or the controller board for a disk drive can fail completely, or a disk drive can lose functionality while the data is being written to the disk. All of these potential failures pose a threat to the integrity of data or may result in performance degradation, as error recovery systems work to repair or reconstruct lost data.
In computing systems for large data processing and data storage applications, redundant storage devices are often provided to enhance the integrity of data maintained on the system in the event of a failure of a storage device. For example, RAID (“Redundant Array of Inexpensive Disks”) technology utilizes an array of disk drives which contain data and parity information distributed across each disk drive in the array. The parity information is additional information stored on the disks and can be used to reconstruct data contained on any of the disk drives in the array in the event of a single disk drive failure. In this manner, these RAID disk arrays can improve the data integrity of the storage system by providing for data recovery despite the failure of one disk drive. However, the use of a large number of inexpensive disks in an array can pose reliability issues because the predicted frequency of failure in an array is equal to the predicted failure rate for each disk drive multiplied by the number of disk drives in the array. As the total number of disk drives increases, the frequency of failure in the array increases accordingly.
Another difficulty which may be encountered when operating a storage system is determining when a particular storage device has “failed.” In the event of total storage device failure, the problem storage device may simply stop responding to commands. Such a complete inability to respond may be characterized as an absolute or permanent failure. However, not all problems with storage devices manifest themselves as absolute failures. Instead, the faulty storage device may produce less catastrophic errors, in which the storage device may respond to commands, but introduce some sort of error in its response, such as reading the wrong sector of the disk platter. While absolute failures typically result in immediate cessation of storage device operation, the presence of a lesser error may not have any noticeable affect on further operations of the storage device. It may not be immediately apparent that the storage device's response to the command was faulty and erroneous data may be returned to the requesting system without warning.