Storage systems for storing data in non-volatile form may include a plurality of storage devices, such as magnetic hard disk drives (“disk drives”), arranged in an array such that increased storage capacity and data redundancy may be achieved. Periodically, these storage devices may experience errors of various origin. Disk drives, for example, are subject to a number of possible failures which can compromise data integrity. Certain tracks on a particular disk may be affected by defects in the magnetic recording media. Data errors can be produced by the non-uniform flying height of the read/write head over the magnetic disk. Power outages can also cause spindle-motor or servo-motor seizures. In some cases, the power supply or the controller board for a disk drive can fail completely, or a disk drive can lose functionality while the data is being written to the disk. All of these potential failures pose a threat to the integrity of data or may result in performance degradation, as error recovery systems work to repair or reconstruct lost data.
These types of errors may be “silent” because the drive does not always detect that an error has occurred. If left undetected, such errors may have detrimental consequences such as undetected long term data corruption that is not repairable via backup. All of these potential failures pose a threat to the integrity of data or may result in performance degradation, as error recovery systems work to repair or reconstruct lost data.
In computing systems for large data processing and data storage applications, redundant storage devices are often provided to enhance the integrity of data maintained on the system in the event of a failure of a storage device. For example, RAID (“Redundant Array of Inexpensive Disks”) technology utilizes an array of disk drives which can contain data and parity information distributed across each disk drive in the array. The parity information is additional information stored on the disks which can be used to reconstruct data contained on any of the disk drives in the array in the event of a single disk drive failure. In this manner, these RAID disk arrays can improve the data integrity of the storage system by providing for data recovery despite the failure of a disk drive. However, the use of a large number of inexpensive disks in an array can pose reliability issues because the predicted frequency of failure in an array is equal to the predicted failure rate for each disk drive multiplied by the number of disk drives in the array. As the total number of disk drives increases, the frequency of failure in the array increases accordingly.
In addition, conventional RAID systems often do not provide sufficient mechanisms for diagnosing and repairing errors, particularly when the errors are silent or when there are multiple disk drive failures. RAID-style redundancy is typically intended to improve availability by enabling systems to recover from clearly identified failures. For instance, RAID 5 can recover the data on a disk drive when the disk drive is known to have failed (i.e., when the disk drive stops serving requests). The RAID 5 redundancy itself is not used to identify the failure. Therefore, silent errors can exist and propagate without warning.