1. Field of the Invention
The present invention relates to a method, system, and article of manufacture for error checking addressable blocks in storage
2. Description of the Related Art
In a Redundant Array of Independent Disks (RAID), a RAID controller stripes data for an addressable block, such as a logical block address (LBA), tracks, etc., to multiple disk drives, calculates checksum blocks for the data, and writes the checksum blocks to a separate disk. Data or checksum blocks written to each disk in a RAID rank are referred to as a stripe or stride, where a stripe comprises the consecutive sectors written to a single disk in the rank of storage devices across which data and checksum information are written. RAID schemes, such as RAID levels 1, 2, 3, 4, 5, 10 [0+1, 1+0], provide a single level of redundant protection and are tolerant of a single device failure prior to being exposed to data loss from an additional failure. Single error correction codes such as used in RAID3, RAID4 and RAID5 provide the capability to correct for an erasure when the location of the data error can be pinpointed by some independent means. For hard disk drives, the error may be pinpointed and corrected because the disk does not respond or other checkers (checksum, CRCs, LRCs, etc) on the disk make it easy to locate the source of the data error independent of the RAID checksum. RAID 6 provides an additional checksum block, or RAID checksum code, that can be used to pinpoint the location of and correct for a single symbol error or multiple failures, such as double disk failures. RAID 6 may utilize Reed-Solomon (R-S) codes comprised of symbols calculated from polynomials.
Online RAID array repair, generally known as hot sparing, restores RAID array redundancy following a failure of a storage device. During the online array repair the RAID array is in a rebuilding state and remains susceptible to additional failures resulting in an unrecoverable data loss. Recent increases in the storage capacity of storage devices has increased the statistical probability that data in a single storage array may experience data loss events (either from media errors or device failures).
RAID storage algorithms may operate at a controller level and are dependent on the correct operation of the storage devices to properly store the data written in the error correct location of the media. Storage devices have been observed to improperly report successful completion of a write operation. For instance, the storage device read/write head mechanism may not write the data for a data block in a stripe, resulting in a “dropped write”. Additional, data may be written to a wrong location on the storage media, resulting in an “off track write”. These errors create data integrity issues (data loss) that may result in incorrect data being returned to the requestor and in corruption of the checksum protection data, which can prevent the successful recovery of lost data.
To limit the exposure related to these types of errors, RAID controller error checking operations may run as background tasks to verify that the data and checksum blocks in the stripes are consistent within a data increment within an array. In the case of mirrored RAID schemes, the two copies of the data are read and compared to verify consistency. The earlier the detection of this condition the better the isolation and limiting of the propagation of the data integrity.
There is a need in the art for improved techniques for error correction in storage arrays.