A promising direction in computer storage systems is to harness the collective storage capacity of massive commodity computers to form a large distributed storage system. When designing such distributed storage systems, there are three aspects to consider, namely data reliability, storage cost, and access overhead. The first aspect is data reliability, and individual components of a massive distributed storage system may fail due to a variety of reasons, including hard drive failures, computer motherboard failures, memory problems, network cable problems, loose connections (such as a loose hard drive cable, memory cable, or network cable), power supply problems, and so forth.
Many applications require the distributed storage system to ensure a high data reliability. For example, an online banking application may require the account balance data to have a Mean Time Between Failure (MTBF) of 109 hours. In general, these data reliability requirements are beyond the capability of any single storage component (such as a computer or a hard drive). Therefore, for distributed storage systems to be useful in practice, proper redundancy schemes must be implemented to provide high reliability, availability and survivability. One type of redundancy scheme is replication, whereby data is replicated two or three times to different computers in the system. As long as any one of the replica is accessible, the data is available. Most distributed storage systems use replication for simplified system design and low access overhead.
Another type of redundancy scheme that may be applied to ensure reliability is Erasure Resilient Coding (ERC) techniques. Erasure-resilient codes enable lossless data recovery notwithstanding loss of information during storage or transmission. The basic idea of the ERC techniques is to use certain mathematical transforms and map k original data chunks into n total chunks (data and n−k parity). Note that chunks are of the same size and can be physically mapped to bytes, disk sectors, hard drives and computers, and so forth. When there are no more than n−k failures, all original data can be retrieved (using the inverse of the mathematical transforms). Such ERC techniques are called (n,k) ERC schemes.
Even if redundancy schemes achieve the same data reliability, they can differ significantly in terms of the storage cost and access overhead. For example, in replication schemes data on a failed chunk easily can be accessed through its replica and thus the access overhead is low. However, the storage costs are high because each data chunk is replicated a number of times. Large storage cost directly translates into high cost in hardware (hard drives and associated machines), as well as the cost to operate the storage system, which includes the power for the machine, cooling, and maintenance. It is desirable, therefore, to decrease the storage cost. On the other hand, (n,k) ERC schemes are efficient in terms of storage costs. However, accessing data on a failed data chunk requires the mathematical inverse and involves k other chunks (data+parity). In this sense, the access overhead is significant. In short, given the data reliability requirement there exist trade-offs between the storage cost and the access overhead in the distributed storage system design.
Existing redundancy schemes only allow very coarse exploration of these trade-offs. In particular, the replication schemes and (n,k) ERC schemes represent two extremes of such trade-offs. In contrast, using multiple protection groups to protect multiple data chunks allows free exploration of the trade-offs between the storage cost and the access overhead. Nevertheless, there is a lack of existing erasure-resilient coding techniques that use multiple protection groups to protect multiple data chunks. Note that some error-correction coding techniques do use the concept of different protection groups. However, the design goal for these techniques is for correcting errors, and is radically different from coding techniques in distributed storage systems, which involves correcting erasures. Thus, these techniques are not applicable for distributed storage system applications.