As the number of hard disks in large-scale storage systems has increased, techniques that employ redundancy in order to tolerate hardware faults without loss of data, and even without interruption of access to data, have become increasingly important. The most popular technique of this sort is called RAID5, a term introduced by David A. Patterson, Garth A. Gibson and Randy H. Katz in the paper, “A case for redundant arrays of inexpensive disks RAID,” published in the Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, September 1988. RAID5 systems can provide both I/O performance improvements, by spreading the pieces of a data object across multiple disks, and data safety improvements, by storing redundant information that is sufficient to allow the data on a single failed disk to be reconstructed. Arrays of disks are coupled to form RAID5 groups and a simple parity code (where the data stored in a region of one disk is the bitwise XOR of data stored in corresponding regions of other disks in the group) is typically employed to provide redundancy with minimal storage space overhead. Other methods for coupling disks together to allow recovery after a single disk failure were also surveyed in the 1988 paper, including replication of each data block on two different disks (called RAID1 there). Advances on RAID5 that allow recovery after two simultaneous disk failures have come to be known as RAID6.
One could imagine increasing the capacity of RAID-based storage systems by simply adding subsystems, each protected by its own internal RAID redundancy. In this case the overall system becomes less reliable as additional fallible subsystems are included in it. A more scalable alternative is to provide redundancy across subsystems that are well insulated from each other's failure, so that failure of entire subsystems can be tolerated. This kind of redundancy can be provided by RAID running across subsystems, as is described for example in “Multi-Level RAID for Very Large Disk Arrays,” by Alexander Thomasian, published in ACM SIGMETRICS Performance Evaluation Review, March 2006. This approach has the disadvantage that the rigid correspondence of data components between elements of the RAID group makes incremental scaling difficult. One could not, for example, increase total storage capacity by just increasing the capacity of one subsystem.
Alternative schemes have been proposed for spreading redundancy across subsystems, with storage responsibilities shifting incrementally as individual subsystems are added or removed. The management of storage assignments must also, of course, be fault tolerant. The Chord system introduced randomized algorithms for achieving these goals in the peer-to-peer world. Chord was described by Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan in the paper, “Chord: A Scalable Peer-to-peer Lockup Service for Internet Applications,” published in the Proceedings of ACM SIGCOMM'01, San Diego, September 2001. It built upon work by D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy, “Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web,” which was published in the Proceedings of the 29th Annual ACM Symposium on Theory of Computing (El Paso, Tex., May 1997). The consistent hashing work was also the subject of the U.S. Pat. No. 6,553,420, Karger et al., “Method and Apparatus for Distributing Requests Among a Plurality of Resources,” filed June 1998.
Chord is a randomized mechanism that assigns data to storage servers. The Chord algorithm uses hash-based block names as permanent identifiers for blocks of data and divides the address space of all possible block names among the storage servers. The division is accomplished by pseudo-randomly assigning a number of points in the address space to each storage server. The collection of all assigned points are used to define a set of address ranges: each server is responsible for all blocks with names that fall into an address range for which it has been assigned the starting point. The address range extends to the next point assigned to a server. When a new server is added to the storage system, new points are pseudo-randomly assigned to it and responsibility for portions of the address space correspondingly shift; data is shifted between servers accordingly. The number of points assigned to a server is proportional to its storage capacity. The same set of address ranges is used to define responsibilities for both primary and redundant copies of a block: the primary copy falls in some address range, and redundant copies belong to the servers assigned succeeding ranges. When a server dies or is removed from the system its assigned points disappear. This causes some adjacent address ranges to be extended and storage responsibilities to shift. The Chord approach of randomly assigning storage responsibilities works well for very large numbers of servers, but it does not scale well to smaller numbers of servers. For example, the only guarantee that Chord makes that redundant copies of data are assigned to different servers is statistical—this guarantee fails for small numbers of servers. If all copies of a block of data are stored on the same server, then the data is lost if that server fails.
A randomized storage assignment method that doesn't suffer from this problem is described by R. Honicky and Ethan Miller in their paper, “Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution,” which appeared in the Proceedings of the 18th International Parallel & Distributed Processing Symposium (April, 2004). They provide algorithms for assigning replicas of blocks of data (or other redundant components derived from the blocks) to a set of storage devices, with each replica being placed on a different storage device. The RUSH algorithms involve grouping together storage devices that were added to the storage system at the same time and labeling each group with a unique cluster identifier. A deterministic function of block identifiers and cluster identifiers determines where each replica resides. As new clusters are added, th algorithm reassigns some fraction of all replicas to the new storage.
RUSH doesn't allow individual failed storage devices to be removed, only entire clusters of devices, and there are constraints on the minimum size of a cluster. These algorithms also have the drawback that the amount of work needed to determine where a replica resides increases as the number of clusters increases. All identifiers for blocks already stored need to be checked using the RUSH algorithm when new storage is added in order to determine which blocks have been reassigned to the new storage and need to be moved.
Redundancy schemes similar to those used in RAID5 systems have also been employed in storage systems that use randomized placement of redundant components. This class of redundancy schemes is sometimes referred to as “erasure resilient codes,” because they depend on knowing which redundant components have been “erased” in order to reconstruct the missing data. The use of parity blocks, as in RAID5, is an efficient way to protect against a single disk failure: corresponding bits on each disk are treated as bits of a codeword, protected by a single parity bit, allowing any single-bit erasure (i.e., any single disk failure) to be recovered. This approach can be extended to schemes that can recover from multiple hardware failures by protecting a longer codeword with a more sophisticated error correcting code. This is the basis of advances on the RAID5 technique, as is discussed for example by G. Feng et al. in “New Efficient MDS Array Codes for RAID, Part 1: Reed-Solomon-Like Codes for Tolerating Three Disk Failures,” published in IEEE Transactions on Computers, September 2005. The same distributed-codeword idea is also the basis of fault tolerant distributed storage methods, such as the one described by Michael Rabin in U.S. Pat. No. 5,485,474, “Scheme for Information Dispersal and Reconstruction,” filed in May 1991. This generic dependence of distributed-storage protection schemes on the idea of a distributed codeword has a drawback: error correcting codes are designed to protect collections of elements each of which is only a few bits long. There may be better codes available if advantage can be taken of the fact that the elementary units of storage being protected are actually hundreds or thousands of bytes long (or longer).
In summary, there is a need to protect storage systems comprising large collections of disks from faults in an incrementally scalable fashion. It is desirable that the method be able to scale down to relatively small collections of disks, since storage systems that grow large may not start off large. The ability to add and remove storage in small increments is useful not only for scaling, but also for non-disruptive migration to new hardware. Data assignment schemes based on randomized placement of data are attractive, but existing algorithms have distinct disadvantages in terms of incremental scalability and efficiency. Finally, existing storage schemes base their fault recovery on error correcting codes that are designed to protect very small data elements, and take no advantage of the relatively large size of the elementary units of storage being protected.