This invention relates to mechanisms for ensuring data consistency in a data store. More specifically, the illustrative embodiments provide mechanisms for maintaining RAID consistency and performance through rate-adaptive parity checking.
RAID (Redundant Array of Independent Disks) is a technology that uses two or more hard disk drives. RAID is used to refer to computer data storage systems that divide and replicate data among multiple hard disk drives. RAID systems, in general, provide increased data reliability and increased input/output performance. When several physical disks are set up to use RAID technology, they are said to be in a RAID array. This array distributes data across several disks.
Some arrays are “redundant” in that extra data derived from the original data across the array is organized so that the failure of one (or sometimes more) disks in the array will not result in any loss of data. In this scenario, a bad disk is replaced by a new one and the data on the new disk can be reconstructed from the data on the other disks and the extra data. RAID arrays maintain access to data in the presence of disk errors either by means of redundant copies of data (for example in RAID levels 1 and 10), or by means of parity data computed from data stored on several disks (for example in RAID levels 5 and 6). It is desirable that these views of data are kept consistent. However, errors in the RAID software or drive firmware may cause these views of the data to become inconsistent. During normal operation parity is not used until it is needed and thus, without pro-actively seeking these latent inconsistencies they will not be found until it is too late and data has been lost.
Currently there are mechanisms to detect latent medium errors on disks. For example, in existing IBM products there background data scrubbing is performed where SCSI verify-read operations are sent to the disks over a period of several days, causing the disks to verify that the data on the disk platter is still readable. This does not detect whether the data is consistent with its redundant copies, or that the parity is consistent with the data.
There are also mechanisms to scan an array to compare the data with its redundant copies or parity. For example, in Linux software RAID, such an operation can be scheduled to run when the array is expected to be under low utilization, e.g., in the middle of the night, however this causes degradation of performance while the check is underway, similar to that caused by an array rebuild operation.
Improvements in ensuring data veracity have been proposed. For example, “Background Patrol Read”, Dell Power Solutions February 2006 pages 73 to 75, discloses a system which is designed to proactively detect hard-drive media defects while an array is online and redundant, and then proceed to recover data. The tool provides data protection. This function concerns data reconstruction and remapping. Background Patrol Read issues commands to each drive in the array to test all sectors. When a bad sector is found, a controller instructs the hard drive to reassign the bad sector, and then reconstructs the data using the other drives. The affected hard drive then writes data to the newly assigned good sector. These operations continue so that all sectors of each configured drive are checked, including hot spares. As a result, bad sectors can be remapped before data loss occurs. The problem with systems such as these lies in the bandwidth that is taken up by such background tasks.