1. Field
The present invention relates generally to the distributed storage of data over various nodes in a network.
2. Description of the Related Art
A central problem in storage is how to store data redundantly, so that even if a particular piece of storage fails, the data will be recoverable from other sources. One scheme is to simply store multiple copies of everything. While that works, it requires considerably more storage for a particular level of reliability (or, contrapositively, it provides considerably less reliability for a particular amount of storage).
To achieve better reliability, erasure codes can be used. An erasure code takes an original piece of data and generates what are called “shares” from it. Shares are designed so that as long as there are enough shares that their combined size is the same as the size of the original data, the original data can be reconstructed from them. In what is referred to as a k-of-n erasure coding scheme, n shares are generated and any k of them can be used to reconstruct the original data. Each share is of size 1/k times the size of the original data so that the shares contain enough information for reconstruction. The n may be highly variable. Storing more shares will result in greater reliability, but the number of shares can scale from k to essentially infinity. The trivial scheme of simply storing multiple copies of the original data can be thought of as a 1-of-n scheme for n copies, and the highly unreliable but also simple scheme of chopping the original data into pieces and storing them all separately can be though of as a k-of-k scheme for k pieces.
The technique of erasure coding first breaks up the original data into k pieces, then treats those pieces as vectors in a Galois field (GF) and generates shares by multiplying the pieces by random coefficients and adding them together. Erasure coding can also be performed by treating the pieces as vectors modulo a prime number. For simplicity, the erasure coding described below uses Galois fields. A share then comprises the result of that computation along with the random coefficients. The randomness of the coefficients causes there to be some chance of the original data being non-recoverable, essentially equal to the reciprocal of the number of elements of the Galois field being used. For storage purposes, a Galois field of size 232 or 264 (corresponding to treating sections of 4 and 8 bytes respectively as individual units) is a reasonable tradeoff between computational overhead and likelihood of data being non-recoverable by chance, being extremely unlikely with 232 and essentially impossible with 264.
The above technique faces limitations when used for distributed storage over the Internet. For Internet storage, the scarce resource is bandwidth, and the storage capacity of the end nodes is essentially infinite (or at least cheap enough to not be a limiting factor), resulting in a situation where the limiting factor on any storage is the amount of bandwidth to send it. For initial storage, this results in a very similar model to one where the limiting factor is storage capacity; there is a one-to-one replacement of bandwidth for storage. But after initial storage, significant bandwidth resources can be consumed for replacing failed storage media (and all storage media eventually fails). Typically when redundant storage is being done locally, e.g., in a local area network, a full recovery of the original data for the shares held by failed media is performed and stored transiently, then new shares are created and stored on another piece of media, and the transient complete copy is then thrown away. To do the same thing across the Internet would require retrieving k shares from the various remaining storage media, reconstructing the original data, generating a new k shares, and sending the shares to the storage media (including the replacement storage media). This requires 2 k times the size of the share being used in bandwidth over the Internet, which becomes rapidly unacceptable as k gets large.
Simply storing multiple copies of the data does a better job of avoiding usage of bandwidth, but has much worse reliability properties and has wastage of its own on the scale of n, since as more copies of something need to be held, more bandwidth gets used as those copies go bad.