In the field of data storage, data recovery can be accomplished by encoding stored data into parity data, and using the parity data to recover the stored data, should some of the stored data be lost. FIG. 1 shows an arrangement for an (n,k) MDS (maximum distance separable) code. A storage controller 100 encodes data disks 102 into parity disks 104. The storage controller 100 may implement any of a variety of (n,k) type MDS codes, where k is the number of data nodes (e.g., data disk 102), n is the total number of nodes (parity nodes and data nodes), and n−k is the number of parity nodes. In the example of FIG. 1, n=5 and k=2. When a page of data 106 is requested to be read, the storage controller 100 may obtain corresponding data from one or more data disks 102. When a page of data 106 is requested to be written, the storage controller 100 both writes data to one or more data disks 102, and computes parity data that is written to the parity disks 104. The parity data may be computed from both the page of data 106 as well as from data already stored in the data disks 102. Significantly, if a data disk 102 fails, one or more of the remaining data disks 102 and parity disks 104 are used to recover the lost data that was on the data disk 104.
FIG. 2 shows a generic (4,2) coding arrangement. The same arrangement might be used whether the code is a Repetition code, an MDS code, or another type of code. Storage units are maintained, namely, data disk 102 units and parity disk 104 units. Depending on which coding scheme is used, one or more units are read to reconstruct a lost data unit. If a (4,2) Repetition code is used, any one other disk can be read to reconstruct a lost disk. With a (4,2) MDS code, any two units will be read to reconstruct the lost data. Generally, for any (n,k) MDS code, k units need to be read to reconstruct a single lost disk. Many codes are designed to handle failure of multiple storage units. However, the overhead necessary to handle perhaps rare multiple concurrent failures may both increase the total amount of parity data needed as well as add to the cost of reconstruction when only a single storage unit fails.
Described below are coding techniques that are efficient when it is assumed that one data node or storage unit fails. Described also is a technique for finding an optimal repair strategy for any given code.