1. Field of the Invention
The invention relates to software-based fault-tolerant computer systems and, in particular, to multithreaded application programs that are replicated using the leader-follower semi-active and passive replication strategies.
2. Description of Related Art
Fault-tolerant systems are based on entity redundancy (replication) to mask faults and, thus, to provide continuous service to their users. In software fault tolerance, the entities that are replicated are the application programs or parts thereof (processes, objects or components). A fundamental issue in the design and implementation of fault-tolerant systems is that of maintaining the consistency of the states of the replicas.
Distributed systems offer the opportunity for fault tolerance by allowing replicas of the application programs to be hosted on different computers (i.e., in different fault containment regions). In the client-server model of distributed computing, a client invokes a method of a server, typically hosted on a different computer, by receiving a request message containing that method invocation and by receiving a reply message from that server. To render an application fault-tolerant, the server is replicated but the client may also be replicated, particularly in multi-tier and peer-to-peer applications, where a process, object or component acts as both a client and a server.
Fault-tolerant systems support several different replication strategies including semi-active and passive replication, and variations thereof. In both semi-active and passive replication, one of the replicas is distinguished as the Primary replica and the other replicas are called the Backup replicas.
In semi-active replication, all of the replicas of a process, object or component execute each method invoked on the replicas. The Primary replica determines the order in which the methods and other operations are executed and communicates that order to the Backup replicas, which executes the methods and other operations in the same order. If the Primary replica makes a decision regarding a non-deterministic operation (such as the order in which access to a shared resource is granted), it communicates that decision to the Backup replicas which make the same decision. If the Primary replica fails, a Backup replica takes over as the Primary replica and starts making decisions that the other Backup replicas must follow.
In passive replication, only the Primary replica executes the methods invoked on the replicas. The state of the Primary replica (values of its variables or attributes) is checkpointed periodically or on demand, and the messages, methods and other operations after the checkpoint are logged. If the Primary replica fails, a Backup replica takes over as the Primary replica. The checkpoint is loaded into the Backup replica and the messages, methods and other operations after the checkpoint are replayed.
A challenging aspect of replication is to maintain strong replica consistency, as methods are invoked on the replicas and the states of the replicas change dynamically, and as faults occur. Strong replica consistency means that, for each method invocation or operation, for each data access within that method invocation or operation, the replicas obtain the same data values. Moreover, for each result, message sent or request made to other processes, objects or components, the replicas generate the same result, message or request.
Many application programs written in modern programming languages (such as C++, Java, etc.) involve multithreading, which is a source of non-determinism. Unless it is properly handled, non-determinism can lead to inconsistency in the states of the replicas. To maintain strong replica consistency, it is necessary to sanitize or mask such sources of non-determinism, i.e., to render a replicated application program virtually deterministic. A virtually deterministic replicated application program is an application program that exists as two or more replicas and that may involve non-deterministic decisions; however, for those non-deterministic decisions that affect the states of the replicas, the replicas must make the same non-deterministic decisions.
U.S. Pat. Nos. 5,577,261 and 5,794,034 which are incorporated herein by reference describe the implementation of “process management” functions, such as the claim( ), release( ), suspend( ) and signal( ) functions, which are also used by the current invention. Operations involving those methods are rendered consistent by having each processor claim a global mutex (called GLUPP) before performing any “process management” operation. Once it has acquired the global mutex, the process performs the operation and then distributes the results to the other processors before relinquishing the global mutex.
U.S. Pat. No. 4,718,002 which is incorporated herein by reference describes how a mutex can be granted to processors, processes, replicas or threads in a distributed system. Each grant of a mutex requires three messages, two messages to claim and grant the mutex and one message to release the mutex. It should be appreciated that this approach requires the communication of multiple additional messages for claiming, granting and releasing a mutex.
U.S. Pat. No. 5,621,885 which is incorporated herein by reference describes a strategy based on Primary/Backup replication, in which the Primary replica executes the required operations. When the Primary replica performs an I/O operation, the results of the I/O operation are communicated to the Backup replica, so that the Backup replica performs the same operation as the Primary replica. This strategy is directed at maintaining consistency between Primary and Backup replicas only for I/O operations and does not address inconsistency that arises from multithreading.
U.S. Pat. Nos. 5,802,265 and 5,968,185 which are incorporated herein by reference are related to the TFT system described below and describe a strategy based on the Primary/Backup approach, in which the Primary replica executes the required operations. When the Primary replica performs an asynchronous or non-deterministic operation, it communicates the results of that operation to the Backup replica, so that the Backup performs the same operation as the Primary. The teachings of these patents disclose no mechanism for guaranteeing that a Backup receives such communication before or concurrently with the communication of results by the Primary to an entity external to the system. As a result, the design is exposed to the risk that the Primary might perform actions and communicate results of those actions to clients, and subsequently fail without ensuring that the Backups have received the communication from the Primary about the operating system interactions. It should be appreciated that such a fault can leave a Backup with the obligation of reproducing those actions; however, the Backup replica might lack the necessary information to do so.
The TARGON/32 system (A. Borg, W. Blau, W. Graetsch, F. Herrmann and W. And, Fault tolerance under Unix, ACM Transactions on Computer Systems, vol. 7, no. 1, 1989, pp. 1-24, incorporated herein by reference) provides mechanisms for the Unix operating system that ensure consistent processing by multiple replicas of asynchronous operations and signals, such as the claim( ) and release( ) functions. A designated control processor (the Primary) records a checkpoint immediately before it processes an asynchronous operation. If the control processor fails, a Backup processor restarts from the checkpoint and then processes the asynchronous operation immediately thereafter, ensuring that the Backup processes the operation starting from the same state as the control processor.
The Delta-4 system (D. Powell (ed.), Delta-4: A Generic Architecture for Dependable Distributed Computing, Springer-Verlag, 1991, incorporated herein by reference) provides support for non-deterministic application programs that employ semi-active or passive replication. To provide such support, Delta-4 queues interrupts until the application program executes a polling routine in which the replicas synchronize and agree on the interrupts received and the order in which to process them.
The Hypervisor system (T. C. Bressoud and F. B. Schneider, Hypervisor-based fault tolerance, ACM Transactions on Computer Systems, vol. 14, no. 1, 1996, pp. 80-107, incorporated herein by reference) and the Transparent Fault Tolerance (TFT) system (T. C. Bressoud, TFT: A software system for application-transparent fault tolerance, Proceedings of the IEEE 28th Fault-Tolerant Computing Symposium, Munich, Germany, June 1998, pp. 128-137, incorporated herein by reference) uses a Primary/Backup approach and aims for transparency to the application and the operating system by utilizing hardware instruction counters to count the instructions executed between two hardware interrupts. The TFT system utilizes object code editing to modify the program code to provide fault tolerance.
Other researchers (J. H. Sly and E. N. Elnozahy, Supporting non-deterministic execution in fault-tolerant systems, Proceedings of the IEEE 26th Fault Tolerant Computing Symposium, Sendai, Japan, June 1996, pp. 250-259, incorporated herein by reference) have introduced a software instruction counter approach, analogous to the hardware instruction counter approach of the Hypervisor system, to count the number of instructions between non-deterministic events in log-based rollback-recovery systems. If a fault occurs, the instruction counts are used to replay the instructions and the non-deterministic events at the same execution points.
Non-preemptive deterministic scheduler strategies also exist that impose a single logical thread of control on the replicas to maintain strong replica consistency (P. Narasimhan, L. E. Moser and P. M. Melliar-Smith, Enforcing determinism for the consistent replication of multithreaded CORBA applications, Proceedings of the IEEE 18th Symposium on Reliable Distributed Systems, Lausanne, Switzerland, October 1999, pp. 263-273, incorporated herein by reference). The effect of this strategy is to undo the multithreading that was programmed into the application program.
Transactional Drago (S. Arevalo, R. Jimenez-Peris and M. Patino-Martinez, Deterministic scheduling for transactional multithreaded replicas, Proceedings of the IEEE 19th Symposium on Reliable Distributed Systems, Nurnberg, Germany, October 2000, pp. 164-173, incorporated herein by reference) also uses a non-preemptive deterministic scheduler but is configured for use in transaction processing systems.
Therefore, a need exists for systems, software mechanisms, methods, improvements and apparatus for providing strong replica consistency for multithreaded application programs based on semi-active and passive replication that maintain application transparency. The systems, software mechanisms, methods, improvements and apparatus in accordance with the present invention satisfy that need, as well as others, and overcome deficiencies in previously known techniques.