This invention relates to computer systems and, more particularly, to computer systems with means to enhance reliability of the storage medium that the computer system uses.
RAID (Redundant Array of Independent Drives) is a technique for creating what appears to be single logical storage device out of an array of physical hard drives, such as drives 11 in FIG. 1. The most common RAID technique is RAID level 5, in which a set of n separate physical hard drives are coalesced into a single memory array. The array is divided into stripes, as shown for example in FIG. 1, with a portion of each stripe—a “strip”—stored on each of the different physical drives in the set. Thus, n-1 of the strips hold data, and the nth strip holds parity information. The simplest way to visualize this arrangement is to think of one of the n hard drives holding the parity information. Many implementations, however, change the hard drive that holds the parity data from stripe to stripe.
When it is known that one of the hard drives that holds data fails to output proper data, the missing data can be reconstituted from the parity data. That is, when respective disk controllers 12 are able to report to array controller 13 that an error condition exists, controller 13 can recover the missing data, which allows the calling process to continue working while maintenance can take place on the failed drive, unaware that a problem was discovered. When data of a known drive is known to be wrong (and typically missing) the error is said to be an erasure error. As is well known, however, a single parity allows only one error can be detected (under the assumption that the probability of any other odd number of errors occurring concurrently is essentially zero), and thus when the location of the error is known (as in the case of erasure errors) the detected error can be corrected.
RAID level 6, which has recently been gaining in usage, employs two (or more) strips per stripe to hold redundant data, as illustrated in FIG. 2 for a system that employs exactly two strips per stripe to hold redundant data. The aim of such a RAID 6 system is to protect the array against two concurrent error conditions.
In both RAID 5 and RAID 6 systems the redundant data can be viewed as degenerates of a Reed-Solomon error-correcting code, based, for example, on Galois field GF(28). The first redundant data strip (applies to both RAID 5 and RAID 6) holds the syndromeP=D0+D1+ . . . +Dn-1  (1)and the second redundant data strip (RAID 6) holds the syndromeQ=g0·D0+g1·D1+ . . . +gn-1·Dn-1  (2)where the polynomial g is a generator of the field and the “·” is multiplication over the field (which is NOT the normal multiplication), and the “+” designates the XOR operation. In GF(28) there are 256 polynomial (gi) coefficients, running from 0 to 255 and, therefore, equation (2) can handle 256 Di elements. If each Di element corresponds to the data of a strip, then operating in GF(28) allows use of 256 data strips. Adding a strip for the P syndrome and a strip for the Q syndrome results in a maximum array of 258 hard drives, each of which stores/outputs 8 bit bytes.
There is another type of error for which current RAID techniques do not compensate, and that is the undetected read error. This occurs when, for a variety of reasons, controllers 12 fail to report a read error, and thus without an alert provides the wrong value for a read request. Such events are uncommonly rare—a bit error rate of 1 in 1017 or less—and are thus usually ignored because a typical consumer desktop hard drive may go several years without a single such error.
However, the situation for a large RAID array experiencing continual usage is quite different. An array of 20 drives that runs in a 24×7 environment can read as many as 3×1017 bits/year, and can thus experience multiple undetected read errors per year. Each is potentially a catastrophic event, because it may result in the altering of a mission-critical value; for example, a bank account balance, a missile launch code, etc. The silent nature of the error means that it cannot be trapped, and thus no corrective action can be taken by software or manual means.
Clearly, at least in some applications, it is desirable to have a means for detecting and correcting unreported errors.