Erasure coding has long been used to add redundancy to stored data and to facilitate recovery of data in the event of disk drive or other failures, which could lead to data loss. In a typical erasure coding scheme, a set of data, such as a file, is stored in the form of N fragments. Owing to redundancy built in to the fragments, however, only K of the N fragments are needed to completely recover the original set of data without errors. Up to N−K fragments can therefore be damaged and the set of data can still be recovered, as long as any K fragments remain. In some examples, K fragments store the original set of data and the remaining N−K fragments store parity information. In other examples, each fragment includes data and/or parity from at least one other fragment. Regardless of implementation, erasure coding schemes permit all of the original set of data to be recovered from any K fragments of the N fragments originally stored.
Theoretical models have been developed to predict the reliability of erasure coded data. See, for example, a PhD dissertation by Hakim Weatherspoon entitled, “Design and Evaluation of Distributed Wide-Area On-Line Archival Storage Systems (UC Berkeley, Technical Report No. UCB/EECS-2006-130, Oct. 13, 2006. See also “Notes on Reliability Models for Non-MDS Erasure Codes” by J. L. Hafner and K. Rao, IBM Report, 2006. These theoretical models employ continuous-time Markov chains to examine sequences of failures and repairs.