Networked computer systems enable users to share resources or services. One computer can request and use resources or services provided by another computer. The computer requesting and using the resources or services provided by another computer is typically known as a client, and the computer providing resources or services to another computer is typically known as a server. When a server fails in such a networked computer system, it is desirable for the system to recover from the failure in a way that is transparent to the clients. The ability of a system to detect and recover from the failure of a server with no or little impact on the clients is known as high availability.
One method for achieving high availability in networked computer systems is fault tolerance at the hardware level. A particular implementation of this method is known as triple module redundancy or "TMR." With TMR, three instances of the same hardware module concurrently execute. By comparing the results of the three hardware modules and using the majority results, the failure of any of the hardware modules can be detected. The primary disadvantage of TMR is that TMR does not detect and recover from the failure of software modules.
Another method for achieving high availability in networked computer systems is software replication. With software replication, a software module that provides a service to clients is replicated on at least two different nodes in the system. The software module on each node is referred to as a replica. If one replica fails, client requests for the service are routed to any remaining replicas. As long as at least one replica has not failed, the service provided by the software module remains available to the clients. Thus, software replication detects and recovers from the failure of both hardware modules and software modules.
While software replication overcomes the primary disadvantage of TMR (i.e., that TMR does not detect and recover from the failure of software modules), software replication has its own disadvantages. The primary disadvantage of software replication is that software replication requires complex software protocols. These protocols are necessary to ensure that all of the replicas have the same state. These protocols are also necessary to ensure that all client requests are completely executed exactly once by all replicas, even in the event of a failure of one replica. For obvious reasons, it is undesirable for one replica to completely execute a request and fail after completely executing the request and then for another replica to completely execute the same request. Due to their complexity, some of these protocols are very inefficient and decrease the processing capacity of the system. Therefore, a need exists for a software replication protocol that is more efficient and increases the processing capability of the system.