In modern storage systems, RAID (Redundant Array of Independent Disks) techniques are known as the preferable technique to achieve high performance and reliability. Among the well known RAID techniques, RAID-6 (block-level striping with double distributed parity), which can tolerate two failure-disks, has the best balance between storage efficiency and reliability. Erasure-coding technologies can provide both high fault tolerance and high storage efficiency [4]. RAID-6 makes larger RAID groups more practical, especially for high-availability systems. This becomes increasingly important as large-capacity drives lengthen the time needed to recover from the failure of a single drive. Single-parity RAID levels are as vulnerable to data loss as a RAID 0 array until the failed drive is replaced and its data rebuilt; the larger the drive, the longer the rebuild will take. Double parity gives time to rebuild the array without the data being at risk if a single additional drive fails before the rebuild is complete.
While all the erasure coding techniques are feasible in practice, coding schemes based on the Reed-Solomon (RS) code are most popular with their MDS (Maximum Distance Separable) property. The information dispersal algorithm [9] or the Parchive adopted in some schemes or systems [6] indeed are derived from the RS codes. To date, several classes of horizontal MDS array codes have been successfully designed to simultaneously recover double storage node failure including the EVENODD code [1], [2]X code [11], RDP (Row-Diagonal Parity) scheme [3], Libertine Code [7] or their derivative schemes [5], [10].
Although X-code [11] is an elegant two-erasure code, it is not a RAID code since it is a vertical code and does not fit the RAID-6 specification of having coding devices P and Q, where P is a simple parity device [7]. Actually, all non-MDS codes and vertical codes are not implementable in RAID-6 systems [7]. A recent examination of the performance of the codes for RAID-6 using Classic Reed-Solomon codes and Cauchy Reed-Solomon codes based on Open-Source Erasure Coding Libraries concluded that special-purpose RAID-6 codes vastly outperform their general-purpose counterparts and RDP performs the best of these by a narrow margin [8].
A RAID 5 uses block-level striping with parity data distributed across all member disks. A concurrent series of blocks (one on each of the disks in an array) is collectively called a stripe. If another block, or some portion thereof, is written on that same stripe, the parity block, or some portion thereof, is recalculated and rewritten. For small writes, this requires: Read the old data block; Read the old parity block; Compare the old data block with the write request. For each bit that has flipped (changed from 0 to 1, or from 1 to 0) in the data block, flip the corresponding bit in the parity block; Write the new data block; and Write the new parity block. The disk used for the parity block is staggered from one stripe to the next, hence the term distributed parity blocks. RAID 5 writes are expensive in terms of disk operations and traffic between the disks and the controller. The parity blocks are not read on data reads, since this would add unnecessary overhead and would diminish performance. The parity blocks are read, however, when a read of blocks in the stripe fails due to failure of any one of the disks, and the parity block in the stripe are used to reconstruct the errant sector. The CRC error is thus hidden from the main computer. Likewise, should a disk fail in the array, the parity blocks from the surviving disks are combined mathematically with the data blocks from the surviving disks to reconstruct the data from the failed drive on-the-fly.
In the event of a system failure while there are active writes, the parity of a stripe may become inconsistent with the data. If this is not detected and repaired before a disk or block fails, data loss may ensue as incorrect parity will be used to reconstruct the missing block in that stripe. Battery-backed cache and similar techniques are commonly used to reduce the window of opportunity for this to occur. The same issue occurs for RAID-6.
RAID 5 implementations suffer from poor performance when faced with a workload which includes many writes which are smaller than the capacity of a single stripe. This is because parity must be updated on each write, requiring read-modify-write sequences for both the data block and the parity block. More complex implementations may include a non-volatile write back cache to reduce the performance impact of incremental parity updates. Large writes, spanning an entire stripe width, can be done without read-modify-write cycles for each data plus parity block, by simply overwriting the parity block with the computed parity since the new data for each data block in the stripe is known in its entirety at the time of the write.
RAID 6 extends RAID 5 by adding an additional parity block; thus it uses block-level striping with two parity blocks distributed across all member disks. RAID 6 does not have a performance penalty for read operations, but it does have a performance penalty on write operations because of the overhead associated with parity calculations. Performance varies greatly depending on how RAID 6 is implemented. RAID 6 is no more space inefficient than RAID 5 with a hot spare drive when used with a small number of drives, but as arrays become bigger and have more drives, the loss in storage capacity becomes less important and the probability of data loss is greater. RAID 6 provides protection against data loss during an array rebuild, when a second drive is lost, a bad block read is encountered, or other drive loss.
Two different syndromes need to be computed in order to allow the loss of any two drives. One of them, P can be the simple XOR of the data across the stripes, as with RAID 5. A second, independent syndrome is computed using field theory. The Galois field GF(m) is introduced with m=2k, whereGF(m)≅F2[x]/p(x))
for a suitable irreducible polynomial p(x) of degree k. A chunk of data can be written as dk−1dk−2 . . . d0 in base 2, where each di is either 0 or 1. This is chosen to correspond with the element dk−1xk−1+dk−2xk−2+ . . . +d1x+d0 in the Galois field.
Let D0, . . . , Dn-1εGF(m) correspond to the stripes of data across hard drives encoded as field elements in this manner (in practice they would probably be broken into byte-sized chunks). If g is some generator of the field and ⊕ denotes addition in the field while concatenation denotes multiplication, then P and Q may be computed as follows (n denotes the number of data disks):
      P    =                            ⊕          i                ⁢                  D          i                    =                        D          0                ⊕                  D          1                ⊕                  D          2                ⊕        …        ⊕                  D                      n            -            1                                    Q    =                            ⊕          i                ⁢                              g            i                    ⁢                      D            i                              =                                    g            0                    ⁢                      D            0                          ⊕                              g            1                    ⁢                      D            1                          ⊕                              g            2                    ⁢                      D            2                          ⊕        …        ⊕                              g                          n              -              1                                ⁢                      D                          n              -              1                                          
⊕ is a bitwise XOR operator and gi is the action of a linear feedback shift register on a chunk of data. Thus, the calculation of P is just the XOR of each stripe. This is because addition in any characteristic two finite fields reduces to the XOR operation.
The computation of Q is the XOR of a shifted version of each stripe. Mathematically, the generator is an element of the field such that gi is different for each nonnegative i satisfying i<n.
If one data drive is lost, the data can be recomputed solely from P, and therefore the implementation is similar to RAID 5. If two data drives are lost or the drive containing P is lost, the data can be recovered from Q (and P, if available) using a more complex process.
Suppose that Di and Dj are the lost values with i≠j. Using the other values of D, constants A and B may be found so that Di⊕Dj=A and giDj⊕gjDj=B. Multiplying both sides of the latter equation by gn−i and adding to the former equation yields (gn−i+j⊕1)Dj=gn−iB⊕A and thus a solution for Dj which may be used to compute Di. The computation of Q is computationally intensive compared with P.
A number of algorithms for computing Q in a RAID 6 implementation are known. Reed-Solomon codes have a strip unit which is a w-bit word, where w must be large enough that n≦2w⊕1. w is typically constrained so that words fall on machine word boundaries: wε{8, 16, 32, 64}. However, as long as n≦2w+1, the value of w may be chosen at the discretion of the user. Most implementations choose w=8, since their systems contain fewer than 256 disks. Reed-Solomon codes treat each word as a number between 0 and 2w−1, and operate on these numbers with Galois Field arithmetic (GF(2w)). The act of encoding with Reed-Solomon codes is simple linear algebra. A Generator Matrix is constructed from a Vandermonde matrix, and this matrix is multiplied by the k data words to create a codeword composed of the k data and m coding words. When disks fail, the lost data may be reconstructed by deleting rows of GT, inverting it, and multiplying the inverse by the surviving words. This process is equivalent to solving a set of independent linear equations. The construction of GT from the Vandermonde matrix ensures that the matrix inversion is always successful.
CRS codes modify RS codes in two ways. First, they employ a different construction of the Generator matrix using Cauchy matrices instead of Vandermonde matrices. Second, they eliminate the expensive multiplications of RS codes by converting them to extra XOR operations. Note, this second modification, which transforms GT from a n*k matrix of w-bit words to a wn*wk matrix of bits, can only apply to Vandermonde-based RS codes. As with RS coding, w must be selected so that n≦2w+1. Instead of operating on single words, CRS coding operates on entire strips. In particular, strips are partitioned into w packets, and these packets may be large. The act of coding now involves only XOR operations—a coding packet is constructed as the XOR of all data packets that have a one bit in the coding packet's row of GT. To make XORs efficient, the packet size must be a multiple of the machine's word size. The strip size is therefore equal to w times the packet size. Since w no longer relates to the machine word sizes, w is not constrained to [8,16,32,64]; instead, any value of w may be selected as long as n≦2w. Decoding in CRS is analogous to RS coding, in which all rows of GT corresponding to failed packets are deleted, and the matrix is inverted and employed to recalculate the lost data.
EVENODD and RDP are two codes developed for the special case of RAID-6, which is when m=2. Although their original specifications use different terms, EVENODD and RDP fit the same paradigm as CRS coding, with strips being composed of w packets. In EVENODD, w is constrained such that k+1≦w and w+1 is a prime number. In RDP, w+1 must be prime and k≦w. Both codes perform the best when (w−k) is minimized. In particular, RDP achieves optimal encoding and decoding performance of (k−1) XOR operations per coding word when k=w or k+1=w. Both codes' performance decreases as (w−k) increases.
If we encode using a Generator bit-matrix for RAID-6, the matrix is quite constrained. In particular, the first kw rows of GT compose an identity matrix, and in order for the P drive to be straight parity, the next w rows must contain k identity matrices. The only flexibility in a RAID-6 specification is the composition of the last w rows. In, Blaum and Roth demonstrate that when k≦w, these remaining w rows must have at least kw+k−1 ones for the code to be MDS. We term MDS matrices that achieve this lower bound Minimal Density codes.