Large scale mass storage systems are driven by many emerging applications in research and industry. For instance particle physics experiments generate petabytes of data per annum. Many commercial applications, for instance digital video or medical imaging, require highly reliable, distributed mass storage for on-line parallel access. Mass storage systems of petabyte scale have to be built in a modular fashion as no single computer can deliver such scalability.
Large farms of standard PCs have become a commodity and replace traditional supercomputers due to their comparable compute power and their much lower prices. The maximum capacity of standard disk drives, such as installed in commodity PCs, exceeds 1 terabyte per node. Thus, a cluster installation with 1000 commodity PCs and disks would provide a distributed mass storage capacity, exceeding 1 petabyte at a minimal cost. The reason why this type of distributed mass storage paradigm has not been adopted yet is its inherent unreliability.
Local disks, connected to a central server, can be protected against data loss by using RAID technology (RAID: “Redundant Array of Independent/Inexpensive Disks”). Proposed by Patterson et al. (D. A. Patterson, G. Gibson, and R. H. Katz: “A Case for Redundant Arrays of Inexpensive Disks”, SIGMOD International Conference on Data Management, Chicago, pp. 109-116, 1988), RAID aims at improving performance and reliability of single large disks by assembling them into one virtual device, while maintaining distributed parity information within this device. The cited paper introduces five RAID strategies, often quoted as RAID levels 1 through 5, which differ in terms of performance and reliability. In addition to these five levels, the RAID Advisory Board defined four more levels, referred to as levels 0, 6, 10 and 53. All these RAID schemes are defined for local disk arrays. They are widely used in order to enhance the data rate or to protect from data loss by a disk failure, within one RAID ensemble.
A next step was to apply the RAID concept to a distributed computer farm. Distributed RAID on a block level (as opposed to a file system level) as first proposed by Stonebraker and Schloss (M. Stonebraker and G. A. Schloss: Distributed RAID—A new Multicopy Algorithm, Proceedings of the International Conference on Data Engineering, pp. 430-437, 1990) and patented, for instance, in JP 110 25 022 A. This approach suffers often from several drawbacks: reliability, space overhead, computational overhead and network load. Most of these systems can only tolerate a single disk failure. Simple calculations show however, that inevitably larger systems must be able to cope with simultaneous errors of multiple components. This applies, in particular, to clusters of commodity components such as mentioned above, since the quality of standard components may be worse than that of high-end products. However, given the potential scale of the discussed systems, no compute node is reliable enough to provide appropriate reliability to support scalability to thousands of nodes. In addition, the space overhead, defined as the ratio of space required for redundant data to the space available for user data, induced by these systems is in most cases not optimal with respect to the Singleton bound (D. R. Hankerson: Coding Theory and Cryptography: The essentials, ISBN:0824704657). Codes that attain this bound are able to tolerate a disk failure for every redundancy region that is available within the system. It can be easily shown that as the minimal requirement for tolerating N disk failures, N redundancy regions are required. Distributed data mirroring, as for instance proposed by Hwang et al. (K. Hwang, H. Jin, and R. Ho: Orthogonal Striping and Mirroring in Distributed RAID for I/O-centric Cluster Computing, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, no. 1, January 2002), is very inefficient, using only half of the total capacity for user data. In addition, the whole system can only tolerate a single disk error. For larger installations, the probability of a data loss scales linear with the system size, approaching 1 during a period of a few years for the named systems, even if highly reliable components are being used.
All these system have in common that they stripe logical data objects over several physical devices. For instance, logically adjacent blocks of a file are distributed over several disks in case of a distributed system on multiple nodes. For distributed systems, this distribution of data blocks has a major drawback. It requires network transactions for any read/write access to the said logical data object. For example, in case of a read access to a large file on an N-node distributed RAID system, the fraction 1-1/(N-P) of all read accesses have to be performed across the network from remote nodes, where P is the number of redundancy blocks in a stripe group (usually 1). This traffic increases both network and CPU overhead.
Other distributed systems use network-capable RAID controllers (e.g., N. Peleg: Apparatus and Method for a Distributed RAID, U.S. patent application Ser. No. 2003/0084397) or are meant for use in wide area networks (e.g., G. H. Moulton: System and Method for Data Protection with Multidimensional Parity, U.S. patent application Ser. No. 2002/0048284). Data striping also applies to systems that are able to tolerate multiple failures by using multidimensional parity (e.g., D. J. Stephenson: RAID architecture with two-drive fault-tolerance, U.S. Pat. No. 6,353,895).
PC clusters traditionally have centralized file servers and use the known RAID technology for their local devices. In addition, backup systems are provided to protect from data loss in the case of an unrecoverable server error. However, such backup systems may require substantial time for the recovery process. It is desirable to avoid the expensive installations of centralized file servers with their associated disadvantages of poor scalability, low network throughput and high cost by building a reliable mass storage system based on the unreliable components of the cluster.