In a non-RAID computer system, if a disk drive fails, all or part of the stored customer data may be permanently lost (or possibly partially or fully recoverable but at some expense and effort). Although backup and archiving devices and procedures may preserve all but the most recently saved data, there are certain applications in which the risk of any data loss and the time required to restore data from a backup copy is unacceptable. Therefore, RAID (“redundant array of inexpensive disks”) storage subsystems are frequently used to provide improved data integrity and device fault tolerance. If a drive in a RAID system fails, the entire data may be quickly and inexpensively recovered.
There are numerous methods of implementing RAID systems. Such methods are commonly known in the industry and only a few will be described, and only generally, herein. A very basic RAID system, RAID level 1, employs simple mirroring of data on two parallel drives. If one drive fails, customer data may be read from the other. In RAID level 2, bits of a data word are written to separate drives, with ECC (error correction code) being written to additional drives. When data is read, the ECC verifies that the data is correct and may correct incorrect data caused by the failure of a single drive. In RAID 3, data blocks are divided and written across two or more drives. Parity information is written to another, dedicated drive. Similar to RAID 2, data is parity checked when read and may be corrected if one drive fails.
In RAID level 5, data blocks are not split but are written block by block across two or more disks. Parity information is distributed across the same drives. Thus, again, customer data may be recovered in the event of the failure of a single drive. RAID 6 is an extension of RAID 5 and allows recovery from the simultaneous failure of multiple drives through the use of a second, independent, distributed parity scheme. Finally, RAID 10 (or 1-0) combines the mirroring of RAID 1 with data striping. Recovery from multiple simultaneous drive errors may be possible.
The types of errors from which traditionally implemented RAID systems may recover only include those which the RAID controller detects. One common error detectable by the controller is a media error. In certain systems developed and sold by International Business Machines (IBM®), another controller-detectable error is one which is detectable through the use of block LRCs appended to each sector. (“LRC” refers to a longitudinal redundancy check word attached to a block of data and used to ensure that the block is delivered error-free.)
However, other errors may not be detectable by a RAID controller. For example, when the LRCs are generated across multiple sectors, the RAID controller may not able to detect certain errors. The controller may also not be able detect errors in sequence numbers embedded in the data. Another example of an error which may not be detectable by the RAID controller can occur when data is not actually written to one of the drives but the RAID controller, not detecting the failure, directs that the correct parity be written.
While the host or client may be able to detect some errors which the RAID controller does not, there is currently no recovery procedure available. Thus, a need exists to permit recovery of data errors which are not detectable by the RAID controller.