The importance of error correction coding of data in digital computer systems has increased greatly as the density of the data recorded on a mass storage medium, more particularly a magnetic disk or tape, has increased. With higher recording densities, a tiny imperfection in the medium can corrupt a large amount of data. In order to avoid losing that data, error correction codes ("ECC's") are employed to, as the name implies, correct the erroneous data.
Before a string of data symbols is recorded, it is mathematically encoded to form redundancy symbols. The redundancy symbols are then appended to the data string to form code words--data symbols plus redundancy symbols. The code words are then stored on the storage medium. When the stored data is to be accessed, the code words containing the data symbols are retrieved from the storage medium and mathematically decoded. During decoding any errors in the data are detected and, if possible, corrected through manipulation of the redundancy symbols [For a detailed description of decoding see Peterson and Weldon, Error-Correcting Codes, 2d Edition, MIT Press, 1972].
In order to further protect the stored data from loss due, for example, to the corruption of a relatively large section of the storage medium or a complete shut down, or failure, of the storage device, some data storage systems store a copy of the data on a redundant data drive. The storage devices are herein referred to collectively as drives. If a particular data drive in the system becomes corrupted or fails, such that most or all of the data stored thereon can not be retrieved, the system furnishes to a requesting user the copy of the data which is stored on the redundant drive. The system is thus extremely robust, as it loses data only when both a data drive and its associated redundant drive fail. In order to achieve such robustness, however, the system requires, for a given amount of data, twice the data storage capacity of a non-redundant system.
In order to reduce the storage overhead, from one redundant drive for every data drive to one redundant drive for every "n" data drives, a known system exclusive-OR's the corresponding symbols of the data code words stored on n data drives and records the resulting "XOR-symbols" on an associated redundant drive. If one of the data drives fails, the system regenerates the data symbols which were stored thereon by exclusive-OR'ing the symbols recorded on the other n-1 data drives with the XOR-symbols stored on the redundant drive. The system thus loses data only when two or more of the n+1 drives simultaneously fail. Such a system is relatively robust, but not as robust as the duplicate system described above.
Further, the system trades storage overhead for speed. As part of every write operation to a designated drive, the system must (i) retrieve the corresponding symbols from the other data drives, (ii) exclusive-OR the retrieved data symbols with the corresponding new data symbols to be recorded, and (iii) record on the associated redundant drive the resulting XOR-symbols. The system must, also, record the new symbols on the designated data drive. Accordingly, each write operation involves three extra relatively time consuming steps.
A scheme for reducing the number of required redundant drives while retaining the robustness of the duplicate system has been proposed. The system encodes the stored data using an (n,k) distance "D" Reed-Solomon code, and thus, protects the data stored on k data drives using n-k redundant drives. Basically, before the system records a data code word in a designated sector on a particular data drive, the system first retrieves the data code words stored in the same sector location on each of the other data drives. Next, the system encodes, using the Reed-Solomon code, the corresponding symbols from each of the k-1 retrieved code words and the code word to be recorded and generates for each set of k symbols n-k redundancy symbols. The system thus encodes the first symbols from each of these code words and generates a first set of n-k redundancy symbols, and next, encodes the second symbols from each of the code words and generates a second set of n-k redundancy symbols, and so forth. Each set of n-k redundancy symbols and the associated k encoded symbols together form a code word in the Reed-Solomon code.
The system next records the various sets of n-k redundancy symbols on the n-k redundant drives, by recording one symbol from each set, that is, one symbol from each Reed-Solomon code word, on each drive. The first of the n-k redundant drives thus has stored on it the k+1.sup.st symbol from each Reed-Solomon code word, the second redundant drive has stored on it the k+2.sup.nd redundancy symbol from each Reed-Solomon code word, and so forth.
If a drive thereafter fails, the system can regenerate the data stored thereon using the associated Reed-Solomon code words. The system retrieves from each of the other drives the data and redundancy symbols which form the associated Reed-Solomon code words. The system then manipulates these code words, using conventional error correcting techniques, and basically fills in the otherwise lost symbols. Such a system can thus regenerate lost, or irretrievable, symbols even if up to D-1 drives fail simultaneously. Accordingly, the system is extremely robust.
This system, however, has the same drawback as the XOR-system described above, namely, each write operation includes the steps of (i) retrieving corresponding symbols from each of the data drives, (ii) encoding the retrieved symbols and the symbols to be recorded to generate redundancy symbols, and (iii) recording the generated redundancy symbols on the various redundant drives. Each write operation thus includes the same three extra steps.
The length of time the system takes to write data to a drive limits the speed with which data can be transferred to the system. A system which is robust, has low storage overhead and can perform write operations relatively quickly is desirable.