The present invention relates, in general, to the field of systems and methods for calculating RAID 6 check codes. More particularly, the present invention relates to an efficient system and method for performing the check code calculations for RAID 6 computer mass storage systems in such a manner that it becomes computationally feasible to implement a RAID 6 system in software on a modern high-speed computer by careful matching of the characteristics of the Commutative Ring in which calculations take place with the capabilities of the computer.
RAID stands for Redundant Array of Independent Disks and is a taxonomy of redundant disk storage schemes which define a number of ways of configuring and using multiple computer disk drives to achieve varying levels of availability, performance, capacity and cost while appearing to the software application as a single large capacity drive. Various RAID levels have been defined from RAID 0 to RAID 6, each offering tradeoffs in the previously mentioned factors. RAID 0 is nothing more than traditional striping in which user data is broken into chunks which are stored onto the stripe set by being spread across multiple disks with no data redundancy. RAID 1 is equivalent to conventional "shadowing" or "mirroring" techniques and is the simplest method of achieving data redundancy by having, for each disk, another containing the same data and writing to both disks simultaneously. The combination of RAID 0 and RAID 1 is typically referred to as RAID 0+1 and is implemented by striping shadow sets resulting in the relative performance advantages of both RAID levels. RAID 2, which utilizes Hamming Code written across the members of the RAID set is not now considered to be of significant importance.
In RAID 3, data is striped across a set of disks with the addition of a separate dedicated drive to hold parity data. The parity data is calculated dynamically as user data is written to the other disks to allow reconstruction of the original user data if a drive fails without requiring replication of the data bit-for-bit. Error detection and correction codes ("ECC") such as exclusive OR ("XOR") or more sophisticated Reed-Solomon techniques may be used to perform the necessary mathematical calculations on the binary data to produce the parity information in RAID 3 and higher level implementations. While parity allows the reconstruction of the user data in the event of a drive failure, the speed of such reconstruction is a function of system workload and the particular algorithm used.
As with RAID 3, the RAID scheme known as RAID 4 consists of N data disks and one parity disk wherein the parity disk sectors contain the bitwise XOR of the corresponding sectors on each data disk. This allows the contents of the data in the RAID set to survive the failure of any one disk. RAID 5 is a modification of RAID 4 which stripes or "diagonalizes" the parity across all of the disks in the array in order to statistically equalize the load on the disks. In RAID 4 and RAID 5 implementations, insufficient data is received to enable parity to be calculated solely from the incoming data, as is the case with RAID 3. Therefore, the array controller or RAID software must combine new data with old data and existing parity data to produce the new parity data, requiring each RAID write to include a read from two drives (old data, old parity), the calculation of the difference between the new and old data, the application of that difference to the old parity to obtain the new parity, and the writing of the new data and parity back onto the same two drives.
The designation of RAID 6 has been used colloquially to describe RAID schemes which can withstand the failure of two disks without losing data through the use of two parity drives (commonly referred to as the "P" and "Q" drives) for redundancy and sophisticated ECC techniques. Data and ECC information are striped across all members of the RAID set and write performance is generally worse than with RAID 5 because three separate drives must each be accessed twice during writes. In the following description, RAID 6 will be treated as if it were an extension of RAID 4; the RAID 5 diagonalization process obviously can be applied as well.
With respect to the previously described levels, RAID storage subsystems can be implemented in either hardware or software. In the former instance, the RAID algorithms are packaged into separate controller hardware coupled to the computer input/output ("I/O") bus and, although adding little or no central processing unit ("CPU") overhead, the additional hardware required nevertheless adds to the overall system cost. On the other hand, software implementations incorporate the RAID algorithms into system software executed by the main processor together with the operating system, obviating the need and cost of a separate hardware controller, yet adding to CPU overhead.
All currently described RAID 6 schemes have been implemented using specialized hardware to calculate multiple orthogonal arithmetic check (parity) codes on the data disks; these check codes are stored on a set of check disks, and are used to reconstruct data in the event that one or more data disks (or check disks) fail. In order to fully appreciate the difficulties encountered, it is useful to briefly describe the type of arithmetic used in these calculations.
A Commutative Ring (hereinafter "Ring") consists of a set of elements plus a pair of operators on the members of the set which behave more-or-less like addition and multiplication under the integers, namely:
1. The set is closed under both operators (if .alpha. and .beta. are members so are .alpha.+.beta. and .alpha.*.beta.) PA0 2. Both operators are commutative (.alpha.+.beta.=.beta.+.alpha., .alpha.*.beta.=.beta.*.alpha.) PA0 3. Both operators are associative (.alpha.+[.beta.+.gamma.]=[.alpha.+.beta.]+.gamma., .alpha.*[.beta.*.gamma.]=[.alpha.*.beta.]*.gamma.) PA0 4. Each operator has a unique identity element (.alpha.+0=.alpha., .alpha.*1=.alpha.) PA0 5. Multiplication distributes across addition (.alpha.*[.beta.+.gamma.]=.alpha.*.beta.+.alpha.*.gamma.) PA0 6. Every element .alpha. has a unique additive inverse -.alpha. such that .alpha.+(-.alpha.)=0. PA0 1. Data disk j and the Q check disk fail; restore data disk j via the formula: EQU d.sub.j =P+.SIGMA.d.sub.k PA0 2. Data disk j and the P check disk fail; restore data disk j via the formula: EQU d.sub.j =(Q+.SIGMA.d.sub.k *.alpha..sup.k)*(.alpha..sup.j).sup.-1 PA0 3. Data disks i and j fail; set up the following two simultaneous equations: EQU d.sub.i +d.sub.j =P+.SIGMA.d.sub.k =A EQU d.sub.i *.alpha..sup.i +d.sub.j *.alpha..sup.j =Q+.SIGMA.d.sub.k *.alpha..sup.k =B PA0 4. The P and Q check disks fail; reconstruct them from the data disks. PA0 1. If the second "seed" is selected as .alpha.=2, that is, the degree 1 polynomial "X", then multiplying by .alpha. in GF(2.sup.64) consists of shifting the multiplicand left 1 bit, and then XORing a constant into the result if there was a carry out of the shift. The constant to XOR is the generating polynomial G less its X.sup.64 term, i.e. G+X.sup.64. A multiplication by .alpha..sup.j (i.e. the degree j polynomial "X.sup.j ") is a shift left of j bits followed by an XOR of a value from a table indexed by the j bits that were shifted off the high end; this table is created in much the same way that a "fast software CRC" table is created. This reduces the size of the table needed for multiplication from 2.sup.64 to 2.sup.M, where M is the number of data disks in the RAID 6 array. PA0 2. If a good generating polynomial is also selected, then multiplying by X.sup.j can be performed efficiently without table lookups at all for "reasonable" values of j. The properties of a good generating polynomial G are: PA0 3. Computing general inverses in GF(2.sup.64) is extremely difficult. However, such general computations need not be done because, in observing the equations for reconstructing d.sub.i, there are only two types of polynomials being inverted: X.sup.j and (X.sup.j +X.sup.k). If the number of data disks in the RAID 6 set is limited to M (where, from the simplification in (2)a, M.ltoreq.64-D) then there are a total of M(M+1)/2, i.e. less than 2100 inverses to store in a table. This reduces each solution of the simultaneous equations (in the worst case, when two data disks fail) to M multiplies by powers of 2, 2M XOR's, and one general multiply by a polynomial modulo G (which consists of 128 shift-test-XOR cycles). In light of the relative infrequency of double disk failures, this should be an acceptable compute penalty, and there are ways to trade-off extra memory space (for larger inverse tables) for reduced compute time in the general multiply, as well. PA0 4.For a given failure recovery operation on the RAID 6 array the polynomial to be inverted (X.sup.j or (X.sup.j +X.sup.k)) will be the same for every set of symbols to be recovered. Advantage can be made of this by pre-processing the inverse obtained from the above-mentioned inverse table into a multiplication accelerator table to reduce the number of instructions executed in each general multiplication, by a process similar to that used to generate a "fast software CRC" table. For instance, the number of Alpha AXP.TM. instructions needed to perform a general 64-bit polynomial multiply in GF(2.sup.64) is reduced from approximately 320 for the "bit-by-bit" method to approximately 40 with a 1024-entry multiplication accelerator table. PA0 5. Because of the limited number of Field elements whose inverses are needed, one need not be restricted to Galois Fields of polynomials. Rings of polynomials modulo a reducible generator G, in which some non-zero elements do not have inverses, may be considered as long as all elements of the form X.sup.j or (X.sup.j +X.sup.k) do have inverses. This relaxes the constraint on G from "G must be irreducible" to "G must be relatively prime to all polynomials of the form X.sup.j and (X.sup.j +X.sup.k) for j&lt;k&lt;64-D" where D is the degree of G+X.sup.64. The set of polynomials formed by this G is not a Galois Field, but it is a Useful Ring of order 64 over GF(2), or "UR(2.sup.64)" for short. Generator Polynomials for UR(2.sup.64) exist of the form G=X.sup.64 +X.sup.D +1. For those polynomials, G+X.sup.64 has only two non-zero terms as opposed to a minimum of five non-zero terms for gen. pols. in section (2)b). The smallest such D is D=4. Corresponding generator polynomials exist for all UR(2.sup.N) where N is a power of 2 and N.gtoreq.16.
If every non-zero element .delta. in the Ring has a multiplicative inverse .delta..sup.-1, such that .delta.*.delta..sup.-1 =1, then the Ring is called a Field.
The integers, for example, form an infinite Ring under normal addition and multiplication, but not a Field, since (for example) the integer 2 has no multiplicative inverse. The integers do form something called a Euclidean Ring, which is a Ring with a unique "division with remainder" that can be performed on any two elements in the Ring. This fact allows one to construct finite rings of integers based on the remainders the integers leave with respect to some fixed integer; this is known as the Ring of integers Modulo M, where M is the Modulus of the Ring. The elements of this Ring are the integers from 0 to M-1, and the operations of addition and multiplication in such a Ring are all done Modulo M. If 0&lt;x&lt;M and x and M are relatively prime (meaning they have no common remainderless divisor other than 1) then x has a multiplicative inverse in the Ring of integers Modulo M. Therefore, if M is a prime number, every non-zero element of the Ring of integers Modulo M has a multiplicative inverse, and so the integers form a Field Modulo M. This field is known as the Galois Field Modulo M, or GF(M). The smallest such Field is GF(2), and consists of the integers 0 and 1; addition in this Field looks like Boolean XOR, and multiplication looks like Boolean AND.
Polynomials are functions of a variable X of the form: EQU C.sub.0 +C.sub.1 *X+C.sub.2 *X.sup.2 +C.sub.3 *X.sup.3 + . . . +C.sub.N *X.sup.N
where C.sub.i represents the coefficients of the polynomial and N, the highest exponent of X with a non-zero coefficient, is called the degree of the polynomial. Given any Field, the set of polynomials with coefficients in that Field forms an infinite Ring under the operations of polynomial addition and polynomial multiplication with the coefficients combined according to the rules of the underlying Field. Furthermore, this Ring is Euclidean, that is, two polynomials can be divided to obtain a unique quotient and remainder. Therefore, the set of polynomials Modulo some polynomial G forms a finite Ring, and G is known as the Generator polynomial of that Ring. If G is an irreducible polynomial of degree N over an underlying Galois Field GF(M) (i.e. it cannot be represented as a product of smaller polynomials with coefficients in that Field) then every non-zero element of the Ring of polynomials generated by G has a multiplicative inverse, and those polynomials form a Galois Field known as GF(M.sup.N).
The arithmetic used in all current RAID 6 hardware implementations takes place in GF(2.sup.N). This is the field of polynomials with coefficients in GF(2), modulo some generator polynomial of degree N. All the polynomials in this field are of degree N-1 or less, and their coefficients are all either 0 or 1, which means they can be represented by a vector of N coefficients all in {0,1}; that is, these polynomials "look" just like N-bit binary numbers. Polynomial addition in this Field is simply N-bit XOR, which has the property that every element of the Field is its own additive inverse, so addition and subtraction are the same operation. Polynomial multiplication in this Field, however, is more complicated as it involves performing an "ordinary" polynomial multiplication, then dividing the result by the Generator polynomial to get the remainder. Despite the computational difficulties involved, this is the simplest multiplication operation whose range and domain cover the entire set of N-bit binary numbers and which distributes with an N-bit addition operation. Software implementations generally find it easier to implement multiplication by a table lookup method analogous to logarithms.
Calculating multiplicative inverses in GF(2.sup.N) is extremely difficult,and in general cannot be expressed algorithmically; instead, table lookup techniques based on logarithms must be used.
RAID 6 check codes are presently generated in hardware by taking a set {d.sub.k } of N-bit data symbols, one from each data disk, and deriving a set of check symbols {s.sub.i } in GF(2.sup.N) of the form: EQU s.sub.i =.SIGMA.d.sub.k *(.sigma..sub.i).sup.k (k=0 to Numdisks-1)
where .sigma..sub.i is a unique "seed" that generates the i.sup.th check code. One of the seeds is generally set to the value 1 (i.e. the degree 0 polynomial "1", the multiplicative identity of GF(2.sup.N)) so that its corresponding check code is equal to .SIGMA. d.sub.k or the simple XOR of the data disks; that is, RAID 5 "parity" is a special case of these calculations.
Each RAID 6 check code expresses an invariant relationship, or equation, between the data on the data disks of the RAID 6 array and the data on one of the check disks. If there are C check codes and a set of F disks fail, F.ltoreq.C, the failed disks can be reconstructed by selecting F of these equations and solving them simultaneously in GF(2.sup.N) for the F missing variables. In RAID 6 systems implemented or contemplated today there are only 2 check disks--check disk P, with "seed" 1, and check disk Q, with some "seed" .alpha.--and four double-failure cases:
where the sum does not include the j.sup.th data disk; then reconstruct Q. This is just like a RAID 5 reconstruction, plus reconstructing Q.
where the sum does not include the j.sup.th data disk; then reconstruct P.
(where the sums do not include the i.sup.th and j.sup.th data disks) and solve them for d.sub.i and d.sub.j : EQU d.sub.i =(A*.alpha..sup.j +B)*(.alpha..sup.i +.alpha..sup.j).sup.-1 EQU d.sub.j =(A*.alpha..sup.i +B)*(.alpha..sup.i +.alpha..sup.j).sup.-1 =A+d.sub.i
Once initialized, RAID check codes are generally computed incrementally; in the RAID 6 case discussed above, if data on one disk of the RAID 6 set is being updated then the difference A of the old data and new data on that disk is computed and the P and Q check codes are updated as follows: EQU P=P+.DELTA. EQU Q=Q+.DELTA.*.alpha..sup.k
Where K is the index of the disk being updated i.e. its position within the RAID 6 set.
Hardware RAID 6 implementations generally pick N, the degree of the generating polynomial of the Galois Field, as small as possible, in order to perform the difficult GF(2.sup.N) arithmetic operations (multiply and inverse) via table lookups. This also minimizes the width of the specialized hardware data paths that perform the arithmetic in GF(2.sup.N). The minimum value for N is a function of the maximum number of data disks desired in the RAID 6 array; that is, N must be large enough so that .alpha..sup.k has a distinct value for every data disk index k, else for some pair of data disks i and j, .alpha..sup.i +.alpha..sup.j =0 and the simultaneous equations of the previous section won't be solvable when those two disks fail. With proper choices of the generating polynomial (a "primitive" polynomial) and of .alpha., a GF(2.sup.N) implementation can support a RAID 6 array with 2.sup.N -1 data disks.
However, while a small value of N is a good choice for hardware implementations of RAID 6, it can cause severe computational problems with software implementations of RAID 6. For example, a RAID 6 array with N=4 (a common hardware implementation allowing up to 15 data disks) would require software to perform 1024 multiplies and adds in GF(2.sup.4) plus the ancillary bit field unpacking and packing operations per 512-byte sector written, just to update the Q check disk. It is this operation of updating the Q check disk during normal writes that is by far the dominant computational cost of a software RAID 6 implementation, although it is also important to keep the computational cost of data reconstruction manageable.