RAID is an acronym for Redundant Array of Independent Disks, and is a system for storing data on multiple disks in which redundancy of data storage between the disks ensures recovery of the data in the event of failure. This is achieved by combining multiple disk drive components into a logical unit, where data is distributed across the drives in one of several ways called RAID levels.
RAID is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple physical disk drives. The terms disks and drives will be used interchangeably henceforth. The physical disks are said to be in a RAID array, which is accessed by the operating system as one single disk. The different schemes or architectures are named by the word RAID followed by a number (e.g., RAID-0, RAID-1). Each scheme provides a different balance between two key goals: increasing data reliability and increasing input/output performance.
The most basic form of RAID—a building block for the other levels but not used for data protection, is RAID-0, which has high performance but no redundancy. The data is spread evenly between N disks. RAID-0 gives maximum performance since data retrieval is carried out on all N disks in parallel. However each data item is stored exactly once so disk failure always loses some data.
RAID-1 requires mirroring of all the data. Capacity drops by 50% since all data is stored twice, but excellent performance is still achieved since the data is still spread between disks in the same way, allowing for parallel reads. RAID-1 can support failure of one of each pair of disks; however, the price is the loss of half of the capacity. Although multiple disk failures can be tolerated, only one failure is possible per mirrored pair without loss of data.
In greater detail, RAID-1 is mirroring. Mirroring comprises writing each block of data to two disks, D0 and D1, and reconstructing a disk by copying its mirror disk upon failure. This method requires performing two disk writes per user write, and consumes an overhead of 100% in capacity. Its rebuild requires performing reads and writes in proportion to the size of the failed disk, without additional computation penalties. Additionally, reading data which resided on the failed disk while in degraded mode requires a single disk read, just as under a normal system operation.
In general, RAID-1 protects from single disk failure. It may protect from more than one failure if no two failed disks are part of the same pair, known as a “RAID group”. RAID-1 may also be implemented in “n-way mirroring” mode to protect against any n−1 disk failures. An example is RAID 1.3 which introduced three-way mirroring, so that any two disks could fail and all the data could still be recovered. The cost however is that there is only 33% utilization of the disks.
A requirement thus became apparent, to somehow develop a system that allowed for the system to recover all data after the failure of any disk at the cost of a more reasonable overhead, and as a result RAID-4 was developed.
RAID-4 uses a parity bit to allow data recovery following failure of a bit. In RAID-4 data is written over a series of N disks and then a parity bit is set on the N+1 disk. Thus if N is 9, then data is written to 9 disks, and on the tenth, a parity of the nine bits is written. If one disk fails the parity allows for recovery of the lost bit. The failure problem is solved without any major loss of capacity. The utilization rate is 90%. However the tenth disk has to be changed with every change of every single bit on any of the nine disks, thus causing a system bottleneck.
In greater detail, a RAID-4 group contains k data disks and a single parity disk. Each block i in the parity disk P contains the XOR of the blocks at location i in each of the data disks. Reconstructing a failed disk is done by computing the parity of the remaining k disks. The capacity overhead is 1/k. This method contains two types of user writes—full stripe writes known as “encode” and partial stripe modifications known as “update”. When encoding a full stripe, an additional disk write must be performed for every k user writes, and k−1 XORs must be performed to calculate the parity. When modifying a single block in the stripe, two disk reads and two disk writes must be performed, as well as two XORs to compute the new parity value. The rebuild of a failed block requires reading k blocks, performing k−1 XORs, and writing the computed value. Reading data which resided on the failed disk while in degraded mode also requires k disk reads and k−1 XOR computations. RAID-4, like RAID-1, protects from a single disk failure.
RAID-5 solves the bottleneck problem of RAID-4 in that parity stripes are spread over all the disks. Thus, although some parity bit somewhere has to be changed with every single change in the data, the changes are spread over all the disks and no bottleneck develops.
However RAID-5 still only allows for a single disk failure.
In order to combine the multiple disk failure of RAID 1.3 with the high utilization rates of RAID-4 and 5, and in addition to avoid system bottlenecks, Raid 6 was specified to use an N+2 parity scheme that allows failure of two disks. RAID-6 defines block-level striping with double distributed parity and provides fault tolerance of two drive failures, so that the array continues to operate with up to two failed drives, irrespective of which two drives fail. Larger RAID disk groups become more practical, especially for high-availability systems. This becomes increasingly important as large-capacity drives lengthen the time needed to recover from the failure of a single drive. Following loss of a drive, single-parity RAID levels are as vulnerable to data loss as a RAID-0 array until the failed drive is replaced and its data rebuilt, but of course the larger the drive, the longer the rebuild takes, causing a large vulnerability interval. The double parity provided by RAID-6 gives time to rebuild the array without the data being at risk if a single additional drive fails before the rebuild is complete.
Reference is now made to FIG. 1, which illustrates a general scheme for RAID-6. RAID-6 is similar to RAID-4 and RAID-5, and can be seen as an extension of these schemes. The main difference is that RAID-6 schemes can tolerate up to two disk failures. The implementation of RAID-6 is not well defined, and several coding schemes are known. RAID-6 is herein defined as any N+2 coding scheme which tolerates double disk failure, while user data is kept in the clear. This additional requirement assures that user reads are not affected by the RAID scheme under normal system operation. The different possible coding schemes vary in performance with respect to various parameters.
There are main parameters used to measure such a RAID scheme. The first parameter is capacity overhead. The optimal scheme includes two redundancy disks (which may or may not be parity based) for every k data disks, thus reaching a capacity overhead of 2/k. It should be noted, that based on statistical considerations of double disk failure, under a RAID-6 scheme k can easily be set to be twice as large as under RAID-5, thus keeping the same capacity overhead ratio.
When updating a certain block in a stripe, we are interested in the number of IOs required and the number of calculations that must be performed. The optimal is three reads, three writes and three XORs.
RAID-6 rebuild includes two different processes—rebuilding after one disk failure, and rebuilding after two disk failures. After a single disk failure, the optimal number of reads needed is k/2, as opposed to k reads in RAID-4. Such optimal performance requires codes which permit reading partial columns, by taking advantage of both redundancy blocks of the stripe, as described in greater detail herein. The minimal number of XORs required is k−1. After the second disk failure, rebuilding a failed block, on average, requires reading k/2 blocks, performing k−1 XORs, and writing the computed value. It should be noted that this does not imply that rebuilding a specific block can be done efficiently, since the rebuilding of one block may depend upon the rebuilding of a different block.
In order to prevent bottlenecks, RAID-6 may also be implemented in the manner of RAID-5, where redundancy information is spread on the various disks in a well-balanced manner.
The specification for RAID-6 does not specify how the data recovery is to be achieved and each storage manufacturer embodies RAID-6 in a different way.
Several RAID-6 schemes have been proposed and used in practice. One solution is to use the Reed Solomon error correction code, which is expensive to calculate.
Another possibility is with parity bits. N Data disks are supported by two redundancy disks p1 and p2, each one holding a different parity bit. Again, if all the parity bits are on the same two disks then the bottleneck becomes a problem. However the problem can be solved by use of distributed parity stripes over N+2 disks as was specified in RAID-5.
The following describes two such coding schemes which are based on parity calculations of rows and diagonals in a matrix of blocks. These two codes are known as Even/Odd and RDP. They both add a second parity disk, labeled Q, which contains blocks that hold the parity of diagonals of the data blocks. P, as before, contains blocks that hold the parities of rows of blocks. Note that in both schemes, it is advantageous to work with a block size which is smaller than the native page size, for the examples in this section we assume the native page size is 4 KB, and that the block size is 1 KB. Each stripe contains four rows, and thus the four blocks present on each disk form a single native page. It is assumed that pages are read and written using a single disk operation.
Reference is now made to FIG. 2, which illustrates a version of RAID-6 called “even odd”, which again uses two parity disks P and Q. A P disk is set up exactly as in RAID-4 and 5, to give a row parity, and Q is the parity of the diagonals. The system requires a prime number of diagonals k, and one less number of rows (k−1). The geometry of the situation gives one more diagonal than there are rows and so the Even Odd scheme adds the extra diagonal's parity to each of the other diagonal parity blocks. The resulting scheme works but the update overhead is sub-optimal.
Under Even/Odd, each stripe contains k (k must be prime) data disks, and two parity disks P and Q. The stripe is composed of a matrix of blocks, which contains k−1 rows. Each of the k+2 disks is viewed as a column in the matrix. Disk P contains k−1 blocks, each consisting of the parity of the k data disk blocks in its row. The k by k−1 matrix made up by the blocks in the data disks contains k diagonals, each of size k−1. The k−1 first diagonals are considered “regular” diagonals, and the last diagonal is known as the “extra” diagonal. Each of the k−1 blocks in disk Q, holds the parity of one of the regular diagonals XORed with the parity of the extra diagonal.
It is not coincidental that there exist more diagonals than rows. It is this asymmetry that allows the recovery of two disk (column) failures. The asymmetry provides that for any two disks that fail, each of their respective columns contains at least one block which belongs to a diagonal not present in the second column. This allows the beginning of the recovery process, by reconstructing this block according to its diagonal information alone. The recovery process continues by reconstructing the block in the same row as the recovered block, using their row information. Performing these two steps iteratively yields a complete recovery. Of course, this entire process can begin only after the parity blocks of the diagonals are decoded. To achieve this, the parity of the extra diagonal is decoded by XORing all blocks in the stripe, and then XORing this value with the rest of the diagonals' parity blocks.
Let us now analyze the efficiency of Even/Odd. It is optimal in terms of capacity overhead, and also in terms of the I/O overhead imposed upon update operations. In terms of computation, however, it is not optimal. The average number of XORs needed when performing an update operation, is almost 4. The reason for this is that updating the blocks of the “extra” diagonal requires many more XORs than updating the blocks of the “regular” diagonals. An updated block in a regular diagonal requires (the optimal) 3 XORs. An updated block in the extra diagonal requires k+1 XORs. Since there are k−1 blocks in the extra diagonal, and (k−1)2 blocks in regular diagonals, the average number of XORs is 3(k−1)2+(k−1)(k+1) divided by k(k−1) total blocks. This equals (4k−2)/k which approaches 4 as k grows. That is to say, a particularly high update overhead is encountered when updating the kth diagonal (the one that has no corresponding row) since it is spread over all the other diagonal parities. The overhead can be reduced by using data blocks of 1K, and then updating a whole column in one go. In this case just three reads and three writes are required. However four XOR operations are still required per update.
Rebuild efficiency for first disk failure requires k reads and the optimal k−1 XORs. This operation is performed using row parity only, just as in RAID-4. Rebuild efficiency for two disk failure requires more XORs than optimal, due to extra XORs performed to decode the extra diagonal's parity information.
Reference is now made to FIG. 3, which is a simplified schematic diagram illustrating an alternative scheme to Even Odd known as RDP or Row Diagonal Parity. RDP is the same as Even Odd except that it deals with the extra parity data (the additional diagonal in the Even Odd scheme) in a different way. RDP arranges the data in a prime minus one number of rows and data columns K (where K+1 is prime). The row parity data P is then included in calculation of the diagonal parities. The data matrix is then one place short for the K diagonals, so that the Kth diagonal is not written. However, since the row parities are themselves included in calculating the remaining diagonal parities, the necessary information is present and full two-disk failure data recovery is possible.
In greater detail, RDP is very similar to Even/Odd. The main difference is in the handling of the extra diagonal. Instead of adding its parity to all of the blocks in Q, RDP simply does not keep parity information for the extra diagonal. This of course is not enough, since now the blocks in the extra diagonal are “represented” only in one parity block. To remedy this, RDP adds the blocks of the first parity column (P) to the diagonals. In this way, if a block in the extra diagonal is updated, it induces a change in two parity blocks. The first is its row parity block in P, and the second is its row parity block's diagonal parity block in Q.
Under RDP, each stripe contains k (k+1 must be prime) data disks, and two parity disks P and Q. The stripe is composed of a matrix of blocks, which contains k rows. Each of the k+2 disks is viewed as a column in the matrix. Disk P contains k blocks, each consisting of the parity of the k data disk blocks in its row. The k by k+1 matrix made up by the blocks in the data disks and P contains k regular diagonals and one extra diagonal, each of size k. Each of the k blocks in disk Q, holds the parity of one of the regular diagonals.
The efficiency of RDP is similar to Even/Odd. Again, the average number of XORs needed when performing an update operation is almost 4 (in contrast to an optimal of 3), and the number of reads needed when reconstructing a block after a single disk failure is k (where the optimal is k/2). The reason for the extra XORs is that when updating a block, its row parity block in P must be updated as well as two diagonal parity blocks in Q—the block of its own diagonal and the block of its parity block's diagonal. In general, (k−1)2 blocks require 4 XORs, and the remaining 2k−1 blocks require 3 XORs. Thus the average over all k2 blocks is 3 plus (k−1)2/k2 which approaches 4 as k grows.
The extra XORs mandate that each column is mapped to a page instead of each block being mapped to a page. If each block was mapped to a page these extra XORs would cause additional read and write operations for each update, which is not acceptable. Thus, only optimal codes (in terms of update efficiency) have the ability to map blocks to pages without incurring an IO overhead.
The importance of mapping blocks to pages relates to efficient rebuild. In theory, RDP has a rebuild technique for first disk failure, which requires reading only approximately three quarters of the blocks. This can be done by checking half of the rows using the P parity, and then recovering the remaining rows using the diagonals. However, it offers little benefit in practice because these blocks reside in all of the columns, and each column is mapped to a page. Thus, in practice, no read is spared and k reads must be performed.
It is noted that while k, which dictates the number of columns in both codes, must be a prime number (or a prime number minus one), this does not diminish the flexibility of choosing any number of disks for the stripe size. This can be accomplished by using virtual disks for the remaining columns, whose content is permanently set to zero and thus does not affect any XOR calculations. In fact, the content may be permanently set to any predefined data which does not affect the XOR calculations. k must only be larger than the maximum number of disks in a stripe. There is a slight penalty for fixing a large k with these codes, because their update efficiency decreases as k grows.