The invention relates generally to communication networks and, more particularly, to a system and method for extending virtual synchrony to wide area networks.
A distributed system utilizing a protocol referred to as virtual synchrony (i.e., operating in a virtual synchrony environment) comprises a plurality of process groups, each of which process groups comprises a plurality of processes. Processes are typically distributed among two or more computers so that if one computer fails, the entire process group does not fail. Processes and process groups are configured for managing and executing application programs, and for transmitting messages between the process groups and processes.
Virtual synchrony ensures that a message transmitted to a plurality of destination processes is received by either all or none of the destination processes. Virtual synchrony, furthermore, ensures that messages delivered to a set of destination processes are delivered in a specified order to all destinations. In a system using virtual synchrony, the message order is maintained even though subsequent messages destined for other processes are interspersed with each other. Several message orders may be specified, generally FIFO (First-In-First-Out), causal, and total order.
FIFO order means that the messages will be delivered in the order they were transmitted but without any specified ordering between messages from different sources. So, if message source A transmits messages A1 and A2 in that order, and message source B transmits message B1 and B2 in that order, each destination may deliver A1, A2, B1, and B2 to applications on the respective destinations in any order, so long as A1 is delivered before A2 and B1 is delivered before B2, such as A1, A2, B1, B2; or B1, A1, B2, A2; etc.
Causal order means that a message may not be delivered before any cause of the message is delivered. For example, a process A may transmit a message A1 to both a process B and a process C. Message A1 causes process B to transmit message B1 to process C. If messages are delivered in causal order, then message A1 must be delivered before message B1 because B1 was caused by A1. These type of problems may happen in distributed systems due to transmission delays, loss of messages in the network causing retransmission, scheduling delays on processors, or many other network problems.
Total order means that each destination process may deliver all its messages in exactly the same order as any other process delivering the same set or any shared subset of messages. Suppose we have processes A, B, and C and message X1 comes to A and B, X2 comes to B and C, and X3 comes to all the processes. Any order may be selected as long as A and B deliver X1 and X3 in the same order and B and C deliver X2 and X3 in the same order.
Theoretically, these message orders can be applied in a mutually exclusive manner. In practice, though, they are generally inclusive (causal implies FIFO, total implies causal and FIFO).
Virtual synchrony with total order has been demonstrated to work very well within local area networks (LANs) using systems such as Totem. Such networks can be extended to wide area networks (WANs), using U.S. patent application Ser. No. 09/213,682, filed Dec. 17, 1998, entitled xe2x80x9cMethod and Apparatus to Extend the Fault-Tolerant Abilities of a Node into a Network,xe2x80x9d issued Apr. 9, 2002 in the name of Law, Jr., as U.S. Pat. No. 6,370,654, which is hereby incorporated in its entirety by reference herein. Local Totem networks can be made fault tolerant using redundant communication fabrics as discussed in greater detail in U.S. patent application Ser. No. 09/477,784, filed Dec. 31, 1999, and entitled xe2x80x9cRedundant Communication Fabrics for Enhancing Fault Tolerance in Totem Networksxe2x80x9d, issued Apr. 22, 2003, in the name of Minyard, as U.S. Pat. No. 6,553,508, which is hereby incorporated in its entirety by reference herein. However, the system of U.S. Pat. No. 6,370,654 is not tolerant of the failure of a router or point-to-point communication link.
Accordingly, there is a need for a system and a method which will enable virtual synchrony to be extended to wide area networks while maintaining fault-tolerant properties.
The present invention, accordingly, provides a system and a method which will enable virtual synchrony to be extended to wide area networks without a single point of failure in the system. In a preferred embodiment of the present invention, a virtual synchrony wide area network has a first local area network (LAN) and a second LAN. A first router and a second router are connected to the first LAN, and a third router and a fourth router are connected to the second LAN. Both LANs are virtual synchrony networks maintaining total order for all messages. A point-to-point link is connected between the first and third routers, between the first and fourth routers, and between the second and third routers, and between the second and fourth routers. Each router is provided with computer program code for controlling the flow of messages through the routers.