For many years computer systems have included disk controllers capable of striping data across a group, or array, of multiple physical disks such that the controller presents a single logical disk to the computer operating system. To illustrate the notion of striping, assume a striped array of four physical disks each having a capacity of 100 GB, and the array stripe size, or block size, is eight sectors, or 4 KB. In this example, the controller stores the first, fifth, ninth, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the first physical disk; the controller stores the second, sixth, tenth, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the second physical disk; the controller stores the third, seventh, eleventh, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the third physical disk; and the controller stores the fourth, eighth, and twelfth, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the fourth physical disk.
One advantage of striping is the ability to provide a logical disk with a larger storage capacity than is possible with the largest capacity individual physical disk. In the example above, the result is a logical disk having a storage capacity of 400 GB.
Perhaps a more important advantage of striping is the improved performance that a striped array can provide. The performance improvement in a random I/O environment—such as a multi-user fileserver, database, or transaction processing server—is mainly achieved by selecting a stripe size that causes a typical read I/O request from the server to statistically require only one disk in the array to be accessed. Consequently, each disk in the array may be seeking concurrently to a different cylinder each to satisfy a different I/O request, thereby taking advantage of the multiple spindles of the array. The performance improvement in a throughput-intensive environment—such as a video-on-demand server—is mainly achieved by selecting a stripe size that causes a typical read I/O request to span all the disks in the array so that the controller can read the disks in parallel and keep them all seeking to the same cylinder. In this environment, the spindles of the various disks in the array are often synchronized.
However, a problem with striped arrays of disks is that the reliability of the array taken as a whole is lower than the reliability of each of the single disks separately. This is because if the data stored on one disk becomes unavailable due to a failure of the disk, then from the computer's perspective all the data of the logical disk is unavailable, since it is unacceptable for the controller to return only part of the data. The reliability of disks is commonly measured in mean time between failure (MTBF). As the number of disks in a RAID 0 array increases, the MTBF decreases, perhaps to a level that is unacceptable in many applications.
To solve this problem, the notion of redundancy was introduced into arrays of disks. In a redundant array of disks, an additional, or redundant, disk is added to the array that does not increase the storage capacity of the logical disk, but instead enables redundant data to be stored on one or more of the disks of the array such that even if one of the disks in the array fails, the controller can still provide the requested data of the logical disk to the computer. For this reason, when an array is in a redundant state, i.e., when none of the disks of the array have failed, the array is said to be fault tolerant because it can tolerate one disk failure and still provide the user data. The predominant forms of redundant data are mirrored data and parity data. In many cases, the MTBF of a redundant array of disks may be greater than the MTBF of a single, non-redundant, disk.
RAID is an acronym for Redundant Arrays of Inexpensive Disks, which was coined in 1987 by Patterson, Gibson, and Katz of the University of California, Berkeley in their seminal paper entitled “A Case for Redundant Arrays of Inexpensive Disks (RAID).” The late 1980's witnessed the proliferation of RAID systems which have become the predominant form of mass storage for server-class computing environments. The original RAID paper defined five different forms of redundant arrays of disks, referred to as RAID levels 1 through 5. Others have been developed since then, and striped arrays have come to be referred to as RAID level 0. The various RAID levels and their relative performance and reliability characteristics are well-known in the art, but will be discussed here briefly for ease of understanding of the problems solved by the present invention.
RAID level 1 employs disk mirroring. A RAID 1 array consists of a pair of disks. Each time the computer issues a write to a RAID controller for a RAID 1 logical disk, the RAID controller writes the data to both of the disks in the pair in order to maintain mirrored copies of the data on the pair of disks. Each time the computer issues a read to the RAID controller for a RAID 1 logical disk, the RAID controller reads only one of the disks. If one disk in a RAID 1 array fails, data may be read from the remaining disk in the array. An extension of RAID 1 is RAID 10, which comprises an array of striped mirrored pairs of disks. RAID 10 provides the reliability benefits of RAID 1 and the performance benefits of RAID 0.
RAID level 4 employs striping with parity. A RAID 4 array requires at least three physical disks. Assume, for example, a four disk RAID 4 array with a stripe size of 4 KB. Three of the disks are data disks and the fourth disk is a parity disk. In this example, the controller stores the first, fourth, seventh, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the first data disk; the controller stores the second, fifth, eighth, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the second data disk; and the controller stores the third, sixth, ninth, etc. 4 KB block of the logical disk on the first, second, third, etc. group of eight sectors, respectively, on the third data disk. The controller stores the parity (binary XOR, or exclusive-OR) of the first 4 KB block of the three data disks onto the first 4 KB block of the parity disk, the binary XOR of the second 4 KB block of the three data disks onto the second 4 KB block of the parity disk, the binary XOR of the third 4 KB block of the three data disks onto the third 4 KB block of the parity disk, etc. Thus, any time the controller writes one or more of the data disks, the controller must calculate the parity of all the data in the corresponding blocks of all the data disks and write the parity to the corresponding block of the parity disk. When the controller reads data, it only reads from the data disks, not the parity disk.
If one of the data disks in the RAID 4 array fails, the data on the failed data disk can be recreated by reading from the remaining data disks and from the parity disk and binary XORing the data together. This is a property of binary XOR used to advantage in parity-redundant arrays of disks. This enables the RAID controller to return the user data to the computer even when a data disk has failed.
RAID level 5 is similar to RAID level 4, except that there is no dedicated parity disk. Instead, the parity disk is a different disk for each stripe in the array such that the parity is distributed across all disks. In particular, the parity disk is rotated for each stripe along the array. RAID level 5 improves write performance in a random I/O environment by eliminating the write bottleneck of the parity drive.
As may be observed from the foregoing, when a disk in a redundant array fails, the array is no longer fault-tolerant, i.e., it cannot tolerate a failure of a second disk. An exception to this rule is a RAID level that provides multiple redundancy, such as RAID level 6, which is similar to RAID 5, but provides two-dimensional parity such that a RAID 6 array can tolerate two disk failures and continue to provide user data. That is, a RAID 6 array having one failed disk is still fault-tolerant, although not fully redundant. Once two disks in a RAID 6 array have failed, the array is no longer fault-tolerant.
In order to restore a redundant array of disks from a non-fault-tolerant (or non-fully redundant) state to its fault-tolerant (or fully redundant) state, the array must be reconstructed. In particular, the data on the failed disk must be recreated and written to a new disk to be included in the array. For a parity-redundant array, recreating the data of the failed disk comprises reading the data from the remaining disks and binary-XORing the data together. For a mirrored-redundant array, recreating the data of the failed disk comprises simply reading the data from the failed disk's mirror disk. Once the RAID controller recreates the data, writes it to the new disk, and logically replaces the failed disk with the new disk into the array, the array is restored to fault-tolerance (or full redundancy), i.e., is reconstructed.
When a disk failure occurs, most RAID controllers notify a system administrator in some manner so that the administrator can reconstruct the redundant array. This may require the administrator to physically swap out the failed disk with a new disk and instruct the RAID controller to perform the reconstruct. Some RAID controllers attempt to reduce the amount of time a redundant array of disks is non-fault-tolerant (or not fully redundant) by automatically performing a reconstruct of the array in response to a disk failure. Typically, when the administrator initially configures the redundant arrays of the system, the administrator configures one or more spare disks connected to the RAID controller that the RAID controller can automatically use as the new disk for an array in the event of a disk failure.
Other RAID controllers have attempted to anticipate that a disk in an array will fail by detecting non-fatal errors generated by a disk, i.e., that do not cause a disk failure. The RAID controllers notify the system administrator that a disk is generating errors so that the administrator can initiate a reconstruct of the array. However, because the reconstruct removes the error-generating disk from the array to perform the reconstruct, the array is non-fault-tolerant (or not fully redundant) during the reconstruct period, which might be fatal in the event of the failure of another disk of the array during the reconstruct period.
Therefore, what is needed is a RAID controller that can take action to prevent an array having a disk that is anticipated to fail from entering a non-fault-tolerant state by performing a reconstruct of the array, but in a manner that enables the array to remain fault-tolerant (or fully redundant) during the reconstruct period.