A distributed system utilizing a protocol referred to as virtual synchrony (i.e., operating in a virtual synchrony environment) comprises a plurality of process groups, each of which process groups comprises a plurality of processes. Processes are typically distributed among two or more computers so that if one computer fails, the entire process group does not fail. Processes and process groups are configured for managing and executing application programs, and for transmitting messages between the process groups and processes.
Virtual synchrony ensures that a message transmitted to a plurality of destination processes is received by either all or none of the destination processes. Virtual synchrony, furthermore, ensures that messages delivered to a set of destination processes are delivered in a specified order to all destinations. In a system using virtual synchrony, the message order is maintained even though subsequent messages destined for other processes are interspersed with each other. Several message orders may be specified, generally FIFO (First-In-First-Out), causal, and total order.
FIFO order means that the messages will be delivered in the order they were transmitted but without any specified ordering between messages from different sources. So, if message source A transmits messages A1 and A2 in that order, and message source B transmits message B1 and B2 in that order, each destination may deliver A1, A2, B1, and B2 to applications on the respective destinations in any order, so long as A1 is delivered before A2 and B1 is delivered before B2, such as A1, A2, B1, B2; or B1, A1, B2, A2; etc.
Causal order means that a message may not be delivered before any cause of the message is delivered. For example, a process A may transmit a message A1 to both a process B and a process C. Message A1 causes process B to transmit message B1 to process C. If messages are delivered in causal order, then message A1 must be delivered before message B1 because B1 was caused by A1. These type of problems may happen in distributed systems due to transmission delays, loss of messages in the network causing retransmission, scheduling delays on processors, or many other network problems.
Total order means that each destination process may deliver all its messages in exactly the same order as any other process delivering the same set or any shared subset of messages. Suppose we have processes A, B, and C and message X1 comes to A and B, X2 comes to B and C, and X3 comes to all the processes. Any order may be selected as long as A and B deliver X1 and X3 in the same order and B and C deliver X2 and X3 in the same order.
Theoretically, these message orders can be applied in a mutually exclusive manner. In practice, though, they are generally inclusive (causal implies FIFO, total implies causal and FIFO).
Virtual synchrony with total order has been demonstrated to work very well within local area networks (LANs) using systems such as Totem. Such networks can be extended to wide area networks (WANs), using U.S. patent application Ser. No. 09/213,682, filed Dec. 17, 1998, entitled “Method and Apparatus to Extend the Fault-Tolerant Abilities of a Node into a Network,” issued Apr. 9, 2002 in the name of Law, Jr., as U.S. Pat. No. 6,370,654, which is hereby incorporated in its entirety by reference herein. Local Totem networks can be made fault tolerant using redundant communication fabrics as discussed in greater detail in U.S. patent application Ser. No. 09/477,784, filed Dec. 31, 1999, and entitled “Redundant Communication Fabrics for Enhancing Fault Tolerance in Totem Networks”, issued Apr. 22, 2003, in the name of Minyard, as U.S. Pat. No. 6,553,508, which is hereby incorporated in its entirety by reference herein. However, the system of U.S. Pat. No. 6,370,654 is not tolerant of the failure of a router or point-to-point communication link.
Accordingly, there is a need for a system and a method which will enable virtual synchrony to be extended to wide area networks while maintaining fault-tolerant properties.