1. Technical Field
The present invention relates generally to techniques for highly available, reliable, and persistent data storage in a distributed computer network.
2. Description of the Related Art
A need has developed for the archival storage of “fixed content” in a highly available, reliable and persistent manner that replaces or supplements traditional tape and optical storage solutions. The term “fixed content” typically refers to any type of digital information that is expected to be retained without change for reference or other purposes. Examples of such fixed content include, among many others, e-mail, documents, diagnostic images, check images, voice recordings, film and video, and the like. The traditional Redundant Array of Independent Nodes (RAIN) storage approach has emerged as the architecture of choice for creating large online archives for the storage of such fixed content information assets. By allowing nodes to join and exit from a cluster as needed, RAIN architectures insulate a storage cluster from the failure of any one or more nodes. By replicating data on multiple nodes, RAIN-type archives can automatically compensate for node failure or removal. Typically, RAIN systems are largely delivered as hardware appliances designed from identical components within a closed system.
A representative archive comprises storage nodes that provide the long-term data storage, and access nodes that provide the interface through which data files enter the archive. To protect files, typically one of several possible schemes are used. These well-known file protection schemes include simple file mirroring, RAID-5 schemes that spread the file contents across multiple nodes using a recovery stripe to recreate any missing stripes, or variations on RAID 5 that use multiple recovery stripes to ensure that simultaneous node failures do not lead to overall system failure. One such variation is the Information Dispersal Algorithm (IDA), original developed by Rabin and described in U.S. Pat. No. 5,485,474. Rabin IDA itself is a variant of a Reed-Solomon error correcting code, such as a linear block code used to ensure data integrity during transmission over a communications channel. Rabin IDA breaks apart a data file so that the pieces can be distributed to multiple sites for fault tolerance without compromising the integrity of the data. In particular, IDA uses matrix algebra over finite fields to disperse the information of a file F into n pieces that are transmitted or stored on n different machines (or disks) such that the contents of the original file F can be reconstructed from the contents of any m of its pieces, where m≦n. Because of the way in which the data is broken up, only a subset of the original pieces are required to reassemble the original data. In IDA, an important objective is to ensure integrity of the dispersed data, and this is accomplished by ensuring that each fragment of the data is not usable, in of itself, to recover the original data. This requirement is undesirable, as it is preferred to have as much of the data as possible freely available (as there may be no loss during transmission or storage), so that the checksum pieces are only used to reconstruct any of the original data that may be unavailable. Moreover, while Rabin IDA provides fault tolerance and data security, it is not computationally efficient, especially as the size of the file increases.
To address this problem, other types of error correcting codes with smaller computational requirements were developed. Tornado codes are similar to Reed-Solomon codes in that an input file is represented by K input symbols and is used to determine N output symbols, where N is fixed before the encoding process begins. In this approach, after a file is partitioned into a set of equal size fragments (called data nodes), a set of check nodes that are equal in size and population are then created. The encoding of the file involves a series of specially designed bipartite graphs. Each check node is assigned two or more nodes to be its neighbors, and the contents of the check node is set to be the bit-wise XOR of the value of its neighbors. The nodes are sequentially numbered, and the encoded file is distributed containing one or more nodes. Decoding is symmetric to the encoding process, except that the check nodes are used to restore their neighbors. To restore a missing node, the contents of the check node is XORed with the contents of certain neighbor nodes, and the resulting value is assigned to the missing neighbor. Tornado codes provide certain advantages but also have limitations. Among other issues, a graph is specific to a file size, so a new graph needs to be generated for each file size used. Furthermore, the graphs needed by the Tornado codes are complicated to construct, and they require different custom settings of parameters for different sized files to obtain the best performance. These graphs are usually quite large and require a significant amount of memory for their storage.
Still another approach to the problem of protecting content in distributed storage is described in U.S. Pat. No. 6,614,366, to Luby et al, which also purports to address limitations and deficiencies in Tornado coding. In this patent, an encoder uses an input file of data and a key to produce an output symbol. An output symbol with key I is generated by determining a weight, W(I), for the output symbol to be generated, selecting W(I) of the input symbols associated with the output symbol according to a function of I, and generating the output symbol's value B(I) from a predetermined value function F(I) of the selected W(I) input symbols. An encoder can be called repeatedly to generate multiple output symbols. The output symbols are generally independent of each other, and an unbounded number (subject to the resolution of I) can be generated, if needed. A decoder receives some or all of the output symbols generated. The number of output symbols needed to decode an input file is equal to, or slightly greater than, the number of input symbols comprising the file, assuming that input symbols and output symbols represent the same number of bits of data are then created. This approach is said to provide certain advantages over Tornado or other Reed-Solomon based coding techniques.
While the approaches described above are representative of the prior art and can provide fault tolerant and secure storage, there remains a need to improve the state of the art, especially as it relates to the problem of reliable and secure storage of fixed content, especially across heterogeneous RAIN archives.