1. Technical Field
The present invention relates to computer server data redundancy in general, and, in particular, to a method for providing fault tolerance to multiple computer servers.
2. Description of Related Art
Fault-tolerance is a major concern for distributed and parallel computer systems because distributed and parallel computing tend to have a higher number of possible points of failure. In addition, with more and more industries relying on electronic commerce, fault-tolerance for computer servers on which electronic commerce depends has becomes increasingly important.
Replication is a widely used method to achieve fault-tolerance for a distributed computer system. In order to tolerate crashes of t servers, replication requires t+1 copies of identical processes. Such approach is based on the assumption that all t+1 machines will have identical state in normal operating conditions. In case of a failure (since the assumption is that no more than t machines will fail), there will always be at least one machine available. The communications overhead of replication is minimal, but the storage space overhead can be prohibitive large.
In data storage and communication, coding theory is extensively used to recover from faults. For example, redundant array of independent disks (RAID) use disk striping and parity based schemes (or erasure code) to recover from disk faults. Network coding has been used for recovering from packet loss or to reduce communication overhead for multi-cast. In such applications, the data is viewed as a set of data entities such as disk blocks for storage applications and packets for network applications. By using coding theory techniques, one can get much better space utilization than, for example, simple replication. However, since such technique is oblivious to the structure of the data, the details of actual operations on the data are ignored and the coding techniques simply re-compute the entire block of data. This can result in a large communication and computational overhead that depends on the size of the data structure, making such approach impractical for applications with huge data sets.
Consequently, it would be desirable to provide an improved method for providing fault tolerance to multiple computer servers.