Providers of massive online storage must balance the heavy demands of availability, performance, reliability, and cost. Distributed replication and erasure coding are used to provide for the recovery of data in the event of storage device failure or other system failures. Erasure coding is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces and stored across a set of different locations, e.g., storage devices in different geographic locations. Erasure coding creates a mathematical function (e.g., polynomial interpolation or oversampling) to describe a set of numbers so they can be checked for accuracy and recovered if one is lost. Erasure coding can be represented in simple form by the following equation: n=k+m. The variable “k” is the original number of portions of data. The variable “m” stands for extra or redundant portions of data that are added to provide protection from failures. The variable “n” is the total number of portions of data created after the erasure coding process. For example, in a 10 of 16 configuration, 6 extra portions of data (m) are added to the 10 base portions (k). The 16 data portions (n) are distributed across 16 storage devices. In the event of data loss or a lost connection to one or more storage devices, the original data can be reconstructed using any 10 of the 16 fragments.
Despite the use of these techniques, there still have been cases of catastrophic data loss. Catastrophic data includes a loss of data that cannot be recovered despite the use of erasure coding. Such data loss can lead to liability costs and significant consequences to the brand of the online storage provider. As the amount of data stored by each device grows, so does the catastrophic nature of the data loss. This catastrophic data loss typically involves coincident storage device failures. Coincident failures are often attributable to commonalities in storage device origin and usage and overreliance upon the reported mean time to failure (MTTF) for each storage device. For example, a massive online storage provider may establish multiple data centers using storage devices purchased at the same time from the same manufacturer. While these storage devices are operated in different geographic locations, they may be subject to similar manufacturing defects and/or common wear-out characteristics that lead to coincident failures. Even with improvements to manufacturing and general longevity of storage devices, it is safe to assume that all devices will eventually fail and when they will fail is unknown.