RAID (Redundant Array of Independent/Inexpensive Disks) is an organization of data on a plurality of disks to achieve varying levels of availability and performance. Performance is typically evaluated by balancing the three basic elements of I/O workloads, namely request rate, data rate and read/write ratio. The request rate is the number of I/O requests per second the system workload generates. Data rate is the amount of user data that can be transferred per second by the I/O subsystem. Of course, the read/write ratio is the ratio of read requests to write requests. One performance enhancing feature of RAID is "striping" which spreads user data across the disks in the array. Each disk in the RAID array is referred to as a member of the array. Furthermore, while disks are referred to throughout, any equivalent storage media could be used as would be apparent to one of ordinary skill in the field. The user data is broken down into segments referred to as "chunks." A chunk is a group of consecutively numbered blocks that are placed consecutively on a single disk before placing the next blocks on a different disk. A block is the smallest unit of data that can be read or written to a disk. Thus, a chunk is the unit of data interleaving for a RAID array. For example, in a four member disk RAID array the first chunk is placed on the first disk, the second chunk is placed on the second disk, the third chunk is placed on the third disk, the fourth chunk is placed on the fourth disk, the fifth chunk is placed on the first disk and so on. This spreading of data increases performance through load balancing. In a standard data storage system, if all the frequently accessed files, referred to as hot files, are on one disk, the access to the one disk creates a bottleneck. The RAID striping naturally spreads data across multiple disks and reduces the contention caused by hot files being located on a single disk.
RAID enhances availability of data through data redundancy. In RAID data redundancy is achieved by "shadowing" or "parity." Shadowing is simply having a duplicate for each disk which contains exactly the same data. Parity involves the use of error correction codes (ECC) such as Exclusive-OR or Reed-Solomon. Parity data is stored in the RAID array and is used to reconstruct the data if a disk fails or a data block otherwise becomes unavailable.
As is well known, there are several levels of RAID, each of which has different characteristics that affect performance and availability. RAID storage systems can be implemented in hardware or software. In the hardware implementation the RAID algorithms are built into a controller that connects to the computer I/O bus. In the software implementation the RAID algorithms are incorporated into software that runs on the main processor in conjunction with the operating system. In addition, the software implementation can be affected through software running on well known RAID controllers. Both the hardware and software implementations of RAID are well known to those of ordinary skill in the field.
RAID level 4 (RAID-4) and RAID level 5 (RAID-5) are organizations of data for an array of n+1 disks that provide enhanced performance through the use of striping and enhanced data availability through the use of parity. A parity block is associated with every n data blocks. The data and parity information is distributed over the n+1 disks so that if any single disk falls, all of the data can be recovered. RAID-4 is a level of organization of data for a RAID array where data blocks are organized into chunks which are interleaved among the disks and protected by parity and all of the parity is written on a single disk. RAID-5 is a level of organization of data for a RAID array where data blocks are organized into chunks which are interleaved among the disks and protected by parity and the parity is distributed over all of the disks in the array. In both RAID-4 and RAID-5 the ensemble or array of n+1 disks appears to the user as a single, more highly available virtual disk.
The contents of each bit of the parity block is the Exclusive-OR of the corresponding bit in each of the n corresponding data blocks. In the event of the failure of a single disk in the array, the data from a given data block on the failed disk is regenerated by calculating the Exclusive-OR of the contents of the corresponding parity block and the n-1 data blocks remaining on the surviving disks that contributed to that parity block. The same procedure is followed if a single block or group of blocks is unavailable or unreadable. A block or set of blocks is repaired by writing the regenerated data. The regeneration and repair of data for a data block or set of data blocks on a disk in a RAID array is referred to as reconstruction.
In a RAID array organized at RAID-4 or RAID-5, when a write operation is performed, at least two disks in the array must be updated. The disk containing the parity for the data block being updated must be changed to correspond to the new data and the disk containing the data block that is being updated must be written. These two write operations can occur in any sequence or order. Thus, at least two write operations are required to implement a single write operation to the virtual disk.
The typical disk storage system does not have any means to ensure that a pair of write operations to two separate disks either both happen or neither happens. Thus, there is a failure mode in which one, but not both, of a pair of write operations happens. Such a failure could occur in any number of ways, for example, when the controller implementing the array function fails. In the event of such a failure the write operation is not successful and there is an inconsistency between the data blocks and the corresponding parity block. If a subsequent failure occurs that renders a different one of the disks in the array unavailable, the RAID-4 or RAID-5 algorithms attempt to regenerate the data on the now unavailable disk by computing the Exclusive-OR of the data and parity on the remaining disks. But due to the prior failure occurring during the pair of write operations, the data or parity information being used to regenerate the data on the unavailable disk does not correspond and the regenerated data will not be the data that was stored on the unavailable disk. The same procedure is followed if the subsequent failure involves a single data block or group of data blocks on a different one of the disks in the array that is unreadable. In either event, the regenerated data is written at the unavailable data block and sent to the requesting application user or client. Thus, undetected data corruption has occurred.
One known method of reducing the problem of undetected corrupt data as described above is to execute a "scrubber" operation following the failure during the pair of write operations and before any other disk fails in order to render all of the parity blocks consistent with the associated data blocks. The problem with the use of a "scrubber" operation is that the data remains vulnerable to corruption until the scrubber has completed its function. Furthermore, the scrubbing function is a resource intensive task that requires reading the equivalent of the entire contents of n disks and writing the equivalent of the entire contents of one disk. Thus, it is desirable to identify an inconsistency between parity and data to prevent the use of the inconsistent parity in the subsequent regeneration of unavailable data and to send an error signal to the client or user application.