1. Field of the Invention
This invention pertains generally to software-based fault-tolerant computer systems and, more particularly, to multithreaded application programs that are replicated using the egalitarian and competitive active replication strategy.
2. Description of Related Art
Fault-tolerant systems are based on entity redundancy (replication) to mask faults and, thus, to provide continuous service to their users. In software fault tolerance, the entities that are replicated are the application programs or parts thereof (processes, objects or components). A fundamental issue in the design and implementation of fault-tolerant systems is that of maintaining consistency of the states of the replicas.
Distributed systems offer the opportunity for fault tolerance by allowing replicas of the application programs to be hosted on different computers (i.e., in different fault containment regions). In the client-server model of distributed computing, a client invokes a method of a server, typically hosted on a different computer. To render an application fault-tolerant, the server is replicated but the client may also be replicated, particularly in multi-tier applications and in peer-to-peer applications, wherein a process, object or component acts as both a client and a server.
In an active replication strategy, the program code of the replicas is identical and the replicas execute their copies of the code concurrently and, thus, the active replication strategy is an egalitarian strategy. Active replication is based on each of the replicas starting in the same initial state (values of their attributes or variables) and executing the same methods or operations and on strong replica consistency. If there is no non-determinism in the execution of the replicas, it is obvious that they will reach the same state at the end of the execution of each method invocation or operation. The present invention ensures that the replicas generate the same results, even if non-determinism caused by multi-threading is present in the replicas. When the replicas take an action or produce a result that is externally visible, such as sending a message, issuing an input/output command, and so forth, the first such action or result is the one that is used and the corresponding actions or results of the other replicas are either suppressed or discarded. Thus, the active replication strategy is a competitive strategy.
The most challenging aspect of replication is maintaining strong replica consistency, as methods are invoked on the replicas, as the states of the replicas change dynamically, and as faults occur. Strong replica consistency means that, for each method invocation or operation, for each data access within said method invocation or operation, the replicas obtain the same values for the data. Moreover, for each result, message sent or request made to other processes, objects or components, the replicas generate the same result, message or request.
Many application programs written in modern programming languages (such as C++, Java, etc.) involve multithreading, which is a source of non-determinism. Unless it is properly handled, non-determinism can lead to inconsistency in the states of the replicas. To maintain strong replica consistency, it is necessary to sanitize or mask such sources of non-determinism, i.e., to render the replicated application program virtually deterministic. A virtually deterministic replicated application program is an application program that exists as two or more replicas and that may involve non-deterministic decisions but, for those non-deterministic decisions that affect the state of the replicas at the end of each method invocation, the replicas make the same non-deterministic decisions.
Many fault-tolerant systems based on active replication employ a multicast group communication system. Examples of such a multicast group communication system are Isis (K. P. Birman and R. van Rennesse, Reliable Distributed Computing Using the Isis Toolkit, IEEE Computer Society Press, 1994, incorporated herein by reference) and Totem (L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia and C. A. Lingley-Papadopoulos, Totem: A fault-tolerant multicast group communication system, Communications of the ACM, vol. 39, no. 4, April 1996, pp. 54-63, incorporated herein by reference). Such a multicast group communication system delivers messages reliably and in the same order (linear sequence) to all of the members of the group, i.e., to all of the replicas of the process, object or component.
For replicated unithreaded application programs where the replicas are distributed on multiple computers, a reliable ordered multicast group communication system can be used to maintain strong replica consistency, in the presence of no other sources of non-determinism except the order in which messages are delivered. For multithreaded application programs, the problem of maintaining strong replica consistency is more difficult because two threads in a replica can access a shared resource, typically shared data, in an order different from the order in which the corresponding threads in another replica access their copies of the shared data; consequently, the states of the replicas can diverge and become inconsistent.
For multithreaded application programs, if two threads within a process, object or component share data between them, only one of those threads can access that shared data at a time. Therefore, the shared data must be protected with a mutual exclusion construct, commonly referred to as a mutex, and the thread must be granted the mutex, and enter the critical section of code within which it can access the shared data. When the thread is finished accessing the shared data, it must release the mutex and leave the critical section. To maintain strong replica consistency, the threads in the replicas must be granted the mutexes in the same order, so that they enter the critical sections and access the shared data within the critical section in the same order.
There are several prior patents that address multithreaded application programs. In particular, U.S. Pat. Nos. 5,577,261 and 5,794,043, which are incorporated herein by reference, describe the implementation of process management functions, such as the claim(), release(), suspend() and signal() functions. Operations involving those functions are rendered consistent by having each processor claim a global mutex (called GLUPP) before performing any process management operation. Once it has acquired the global mutex, the process performs the operation and then distributes the results to the other processors before relinquishing the global mutex.
The global mutex, used in those patents, is actually described in U.S. Pat. No. 4,718,002, which is incorporated herein by reference. That patent describes how a mutex can be granted to processors, processes, replicas or threads in a distributed system, but the mechanism requires that one processor should be designated as a distinguished control processor and that the granting of the mutex is determined by that control processor.
U.S. Pat. No. 5,621,885, which is incorporated herein by reference, describes a strategy based on a Primary/Backup approach, in which the Primary replica executes the required operations. When the Primary replica performs an I/O operation, the results of the I/O operation are communicated to the Backup replica, so that the Backup replica performs the same operation as the Primary replica. That strategy requires the replicas to be cast into specific roles of either Primary or Backup replica.
U.S. Pat. Nos. 5,802,265 and 5,968,185, which are incorporated herein by reference, likewise describe a strategy based on a Primary/Backup approach, in which the Primary replica executes the operations required of the computer system. When the Primary replica performs an asynchronous or non-deterministic interaction with the operating system, the results of the interaction with the operating system are communicated to the Backup replica, so that the Backup replica performs the same operation as the Primary replica. Object code editing is the primary mechanism by which the program code is modified and no provisions are made for active replication. U.S. Pat. Nos. 5,802,265 and 5,968,185 are related to the TARGON/32 Fault Tolerance (TFT) system, described below.
The TARGON/32 system (A. Borg, W. Blau, W. Graetsch, F. Herrmann and W. And, Fault tolerance under Unix, ACM Transactions on Computer Systems, vol. 7, no. 1, 1989, pp. 1-24, incorporated herein by reference) describes a fault-tolerant version of the Unix operating system. It is based on special hardware that provides a reliable ordered multicast protocol, but is not applicable to distributed systems. Moreover, that strategy requires a distinguished control processor.
The Delta-4 system (M. Chereque, D. Powell, P. Reynier, J. L. Richier and J. Voiron, Active replication in Delta-4, Proceedings of the IEEE 22nd International Symposium on Fault Tolerant Computing, Boston, Mass., 1992, pp. 28-37 and also D. Powell (ed.), Delta-4: A Generic Architecture for Dependable Distributed Computing, Springer-Verlag, 1991, both of which are incorporated herein by reference) supports active, semi-active and passive replication for application programs, but it does not handle non-determinism (in particular, multithreading) for active replication.
The Hypervisor system (T. C. Bressoud and F. B. Schneider, Hypervisor-based fault tolerance, ACM Transactions on Computer Systems, vol. 14, no. 1, 1996, pp. 80-107, incorporated herein by reference) and the Transparent Fault Tolerance (TFT) system (T. C. Bressoud, TFT: A software system for application-transparent fault tolerance, Proceedings of the IEEE 28th Fault-Tolerant Computing Symposium, Munich, Germany, June 1998, pp. 128-137, incorporated herein by reference) both aim for transparency to the application and the operating system. However, the Hypervisor system uses hardware instruction counters to count the instructions executed between two hardware interrupts and the TFT system uses object code editing to modify the program code. Moreover, both of those systems employ a Primary/Backup approach.
Other researchers (J. H. Sly and E. N. Elnozahy, Supporting non-deterministic execution in fault-tolerant systems, Proceedings of the IEEE 26th Fault Tolerant Computing Symposium, Sendai, Japan, June 1996, pp. 250-259, incorporated herein by reference) have introduced a software instruction counter approach, analogous to the hardware instruction counter approach of the Hypervisor system, to count the number of instructions between non-deterministic events in log-based rollback-recovery systems.
P. Narasimhan, L. E. Moser and P. M. Melliar-Smith, Enforcing determinism for the consistent replication of multithreaded CORBA applications, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems, Lausanne, Switzerland, October 1999, pp. 263-273, incorporated herein by reference, describes a non-preemptive deterministic scheduler strategy that imposes a single logical thread of control on the replicas to maintain strong replica consistency. That strategy, in effect, undoes the multithreading that was programmed into the application program.
Transactional Drago (S. Arevalo, R. Jimenez-Peris and M. Patino-Martinez, Deterministic scheduling for transactional multithreaded replicas, Proceedings of the IEEE 19th Symposium on Reliable Distributed Systems, Nurnberg, Germany, October 2000, pp. 164-173, incorporated herein by reference) also uses a non-preemptive deterministic scheduler but is aimed at transaction processing systems.
It will be appreciated that strategies, such as detailed in U.S. Pat. Nos. 4,718,002, 5,621,885, 5,802,265 and 5,968,185 described above, require casting replicas into Primary or Backup roles. Furthermore, U.S. Pat. Nos. 5,802,265 and 5,968,185 and the TFT system utilize object code editing to modify the program code and they disclose no mechanisms for active replication.
Therefore, a need exists for a system and method of providing consistent replication of multithreaded applications that is egalitarian and may be transparently implemented. The present invention satisfies those needs, as well as others, and overcomes the deficiencies of previously developed active replication strategies.