1. Field of the Invention
This invention relates generally to methods of maintaining replica consistency and more particularly to methods of maintaining a consistent view of time for a group of replicas in a fault-tolerant distributed system, wherein each processor has a physical hardware clock and the replicated application program contains clock-related operations.
2. Description of Related Art
One of the biggest challenges of replication-based fault tolerance is maintaining replica consistency in the presence of replica non-determinism (see, D. Powell, editor, “Delta-4: A Generic Architecture for Dependable Distributed Computing”, Springer-Verlag, 1991, incorporated herein by reference). For active replication, it has been recognized that the replicas must be deterministic, or rendered deterministic. Consequently, passive replication, based on the primary/backup approach, has been advocated if the potential for replica non-determinism exists; however, the same replica non-determinism problems that arise for active replication during normal operation arise for passive replication when the primary replica fails.
Clock-related operations, such as invoking the method gettimeofday( ), are one source of replica non-determinism. Clock-related operations are common not only in real-time applications but also in non-real-time applications, such as in the following two examples: (1) the physical hardware clock value is used as the seed of a random number generator to generate unique identifiers such as object identifiers or transaction identifiers; and (2) the physical hardware clock value is accessed when a timeout is required, for example, for timed remote method invocations to prevent extensive delays and by transaction processing systems in two-phase commit and transaction session management.
Although the primary/backup approach solves the consensus problem for individual clock readings of replicas in a group of replicas, it does not guarantee that the clock readings will always advance forward. If the primary replica that determines the clock readings for the group of replicas crashes, the newly selected primary starts with its own physical hardware clock value for the next clock reading. Because of the differences in the two physical hardware clocks, and the gap in time of the computation for the two replicas, the next clock reading might be earlier than the previous clock reading of the primary replica before it crashed. Clock roll-back can break the causal relationships between events in the distributed system, and can lead to undesirable consequences for the replicated application.
It might also happen that two consecutive clock readings from two different replicas (due to the failure of the original replica) differ too much in the other direction; that is, the second clock reading is too far ahead of the first clock reading. The presence of this fast-forward behavior can lead to unnecessary time-outs in the replicated application.
The clock roll-back and fast-forward problems associated with the primary/backup approach can be alleviated by closely synchronizing the physical hardware clocks. Clocks can be synchronized in a fairly accurate manner using software-based solutions such as the Network Time Protocol (NTP) or hardware-based solutions such as Global Positioning Satellite (GPS) clocks. However, even exact clock synchronization does not solve the problem of maintaining consistent clocks at the replicas. Note that the fast-forward behavior rarely happens for semi-active replication (discussed herein) because the backup replicas lag behind the primary replica that determines the clock value, assuming that the clocks are synchronized closely enough (see, P. Verissimo, “Ordering and timeliness requirements of dependable real-time programs”, Journal of Real-Time Systems, 7(2):105-128, 1994, incorporated herein by reference).
For distributed applications that run on commercial-off-the-shelf general-purpose operating systems, such as Solaris, Linux or Windows, traditional physical hardware clock synchronization algorithms cannot solve the replica non-determinism problem for clock-related operations. Such traditional clock synchronization algorithms can be found in L. Lamport and P. M. Melliar-Smith, “Synchronizing clocks in the presence of faults”, Journal of the ACM, 32(1):52-78, 1985, incorporated herein by reference; L. Rodrigues, P. Verissimo, and A. Casimiro, “Using atomic broadcast to implement a posteriori agreement for clock synchronization”, in Proceedings of the IEEE 12th Symposium on Reliable Distributed Systems, pages 115-124, Princeton, N.J., October 1993, incorporated herein by reference; T. K. Srikanth and S. Toueg, “Optimal clock synchronization”, Journal of the ACM, 34(3):626-645, 1987, incorporated by reference; and P. Verissimo and L. Rodrigues, “A posteriori agreement for fault-tolerant clock synchronization on broadcast networks”, in Proceedings of the IEEE 22nd International Symposium on Fault-Tolerant Computing, pages 527-536, Boston, Mass., July 1992, incorporated herein by reference. One reason that traditional clock synchronization algorithms do not suffice is that such algorithms provide only approximate clock synchronization. Another reason is that the replicas in the group of replicas can read different clock values when they process the same request at different real times due to asynchrony in replica processing and/or scheduling, as shown in FIG. 1. This problem is intrinsic to event-triggered systems, no matter how accurately the clocks are synchronized.
To guarantee replica consistency in the presence of clock-related non-determinism, fault-tolerant systems, such as Mars (see, H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and R. Zainlinger “Distributed fault-tolerant real-time systems: The Mars approach”, IEEE Micro, pages 25-40, February 1989, incorporated herein by reference) have used a lock-step, time-triggered approach. However, the time-triggered approach is not applicable in all circumstances, due to its requirement of a priori scheduling of the operations of the replicated application. In particular, a program cannot read the clock time because no mechanism is provided to ensure precise consistency of the readings of the clocks.
In S. Mullender, editor, “Distributed Systems”, ACM Press, second edition, 1993, incorporated herein by reference, a pre-processing approach has been proposed to render deterministic the computations of the replicas. The pre-processing involves executing a distributed consensus protocol to harmonize the inputs from the environment. In particular, the primary/backup approach is used to cope with non-deterministic reading of clocks for a group of replicas. The physical hardware clock value of the primary replica is returned, and the result is conveyed to all of the backup replicas. The other replicas utilize that clock value, instead of their own physical hardware clock values.
U.S. Pat. No. 5,001,730, which is incorporated herein by reference, describes a distributed clock synchronization algorithm for address-independent networks. Synchronization is achieved by using the fastest clock in the network as the master clock against which all other clocks in the network are synchronized. Each node sends a message to all of the other nodes in the network when its timer times out. If a node receives a message with a higher clock time than its own before it sends a message, that node does not send its message. However, no mechanism is provided to ensure that all nodes receive the same message first and, thus, that patent does not ensure consistent readings of the clocks.
U.S. Pat. No. 5,041,966, which is incorporated herein by reference, defines three partially distributed methods for performing clock synchronization. The general concept is that randomly selected M processors out of N processors cooperate to adjust the clocks of all processors in the distributed system. In the first method all processors randomly select M processors at different time instants, and each processor adjusts its clock to an average of the local times of the M processors. In the second method each processor transmits its own local time to randomly selected M processors and adjusts its own clock to the average of the local times it receives. In the third method all processors adjust their clocks to the average of the local times received from randomly selected M processors. The methods consider fault tolerance, but they make no attempt to ensure consistent readings of the clock.
U.S. Pat. No. 5,530,846, which is incorporated herein by reference, describes a method for accommodating discrete clock synchronization adjustments, while maintaining a continuous logical clock that amortizes the adjustments at a predetermined rate. Two logical clocks are used to decouple clock synchronization from clock amortization. One logical clock is discretely synchronized to an external time reference, and a second logical clock is adjusted with amortization to provide a continuous monotonically non-decreasing logical clock. Again, the method makes no attempt to ensure consistent readings of the clock.
U.S. Pat. No. 5,689,688, which is incorporated herein by reference, describes two methods for synchronizing local times, maintained at nodes within a network, with a reference time. The active method is a handshaking scheme in which synchronization is initiated by the node requiring synchronization and involves an exchange of messages between the node and the reference time source, producing a synchronized time and a maximum error. The passive method involves a reference time source that broadcasts a burst of reference-time synchronization messages; a node listens for the messages, updating its local time and maximum error. Individual nodes are synchronized independently and there is no mechanism to ensure consistent readings of the clock.
U.S. Pat. No. 6,157,957, which is incorporated herein by reference, describes a clock synchronization system and method for a communication network, consisting of multiple nodes that transfer data over communication links. The nodes exchange timing information with a master node that has a master clock against which the local clocks of the nodes are to be synchronized. At predefined moments in time, each node exchanges timing information with the master node, calculates timing data and stores the timing data in a sequence of timing data, called its history. After at least two exchanges, the method calculates parameters from the history, stores them and uses them to compute a continuous conversion function. The continuous conversion function converts the local time into the master time with a pre-specified and guaranteed precision that is nevertheless only approximate. No mechanism is provided to guarantee consistent readings of the clock.
FIG. 1A shows two replicas, R1 10 and R2 12, that both process the same messages and are required to maintain consistency between their states, their processing and their results. Each replica is supported by a replication infrastructure 14, 16, and each such infrastructure contains a queue of unprocessed messages 18, 20. Because of communication delays and differences in processing speeds, the two replicas do not perform the same operations at exactly the same real time. In FIG. 1A, replica R1 is processing 22 request message number 5 while replica R2 is still processing 24 request message number 3 when request message number 8 is received and queued at both replicas R1 and R2 26, 28. Even though request message number 8 is received simultaneously at both replicas, the message is likely to be processed at different real times by the two replicas.
In FIG. 1B, the processing of request message number 8 invokes the gettimeofday( ) method 34, 36 of the operating system to read the physical hardware clock. Because the two replicas R1 30 and R2 32 process the request message at different real times, the replicas can receive different values for the time from the gettimeofday( ) method, even if the two clocks are perfectly synchronized. If the two replicas process two different values for the time, their states and results can diverge, thus destroying replica consistency. It is essential that the gettimeofday( ) methods in the two replicas yield exactly the same values for the time, even if their corresponding physical hardware clock readings yield different real times.
Therefore, a need exists, as outlined above, for a method of providing a consistent time service for fault-tolerant distributed systems based on replication in order to maintain replica consistency. The present invention satisfies those needs, as well as others, and overcomes clock-related sources of replica non-determinism and replica inconsistency.