Many networking applications involve large numbers of end-points, i.e., nodes. Such applications may require their components to synchronize states reliably in a highly distributed environment. Well known examples of the problem include enforcing consistency in a distributed database and maintaining cache coherency in a distributed multiprocessor environment. Although the problem has existed for a long time, recent exponential growth of Internet and proliferation of Internet-related applications bring it to the foreground and underscore the need for more efficient solutions. Moreover, Internet-related applications can be distributed over thousands of end-points and often operate in real time, complicating straightforward extension of previously known methods. As one illustration, let us take a brief look at the architecture of a network router.
A typical router has a single router controller managing multiple forwarding devices. A single router can easily cause performance bottlenecks; it is also a single point of failure. High performance routers of the next generation will likely include multiple router controllers working in parallel. In such an architecture, consistency must be maintained over all router controllers' forwarding tables. This is a classical state synchronization requirement.
Such synchronization requirement can be naturally supported by an atomic multicast service, which ensures both atomicity and total ordering over all messages sent to the multicast group. By atomicity we mean that any message sent to the group is delivered either to all or none of the operational group members. Total ordering means that messages are delivered in the same order at all such group members: Note that here we use delivered rather than received. By delivered we mean that a message is passed to applications sitting on top of the atomic multicast service. The order in which messages get delivered can be different from the order in which they are received.
As a fundamental abstraction for building distributed reliable applications, atomic multicast has been widely studied in the field, and has been actually implemented in a number of working systems, such as Isis and Horus. Below we present a brief overview of the previous work.
Isis ABCAST Algorithm
The Isis system is one of the pioneering protocols that support atomic multicast. Isis is described in K. P. Birman et al., Lightweight Causal and Atomic Group Multicast, ACM TRANSACTIONS ON COMPUTER SYS., August 1991, and in K. P. Birman & T. Joseph, Reliable Communication in the Presence of Failures, ACM TRANSACTIONS ON COMPUTER SYS., February 1987. Both above-mentioned articles are hereby incorporated by reference as if fully set forth herein.
The Isis ABCAST primitive achieves atomicity and total ordering based on a three-way commit protocol. To send a message from a client/sender, the following steps are performed:
1. A sender transmits the message to all of its destinations.
2. Upon receipt of the message, each recipient assigns it a priority number larger than the priority of any message received but not yet delivered; the recipient then informs the sender of the priority it assigned to the message.
3. The sender collects responses from the recipients that remain operational, computes the maximum value of all the priorities it had received, and sends this value back to all the recipients.
4. The recipients change priority of the message to the value received from the sender; they can then deliver messages in order of increasing priority.
A number of factors contribute to the poor scalability of Isis. First, to send a message, the sender has to block until the communication completes. During this period, no other message can be sent. This means that the performance of the entire multicast group is limited by the slowest receiver.
Second, explicit knowledge of group membership is required to ensure reliability. The management of group membership is expensive. Moreover, whenever the group membership changes, the entire group has to block until every member has installed the new view of the group membership. This is undesirable in many cases. For example, in the router context mentioned above, new router controllers are added when the system load is high. Blocking the entire controller group can easily cause disastrous network congestion in this case.
Finally, the overhead for sending a message is relatively high. For each multicast message, three communication steps are required to ensure the proper delivery of the multicast message, even if the communication channel is perfect and no group member fails. Furthermore, an overhead of 2n total messages is involved in the best case, where n is the group size.
For all these reasons, the ABCAST algorithm typically cannot scale to more than 100 members.
Sequencer-Site Algorithms
This class of algorithms is described in, inter alia, M. F. Kaashoek et al, An Efficient Reliable Broadcast Protocol, OPERATING SYS. REV., October 1989, hereby incorporated by reference as if fully set forth herein. Sequencer-Site algorithms achieve total ordering by using an elected process—a sequencer—responsible for assigning sequence numbers to all multicast messages and then multicasting the messages to the entire group. This algorithm requires a single communication step in the optimal case where the sequencer is also the source of the message, and two steps in all other cases. Because of the high load on the sequencer, the algorithm is considered non-scalable even for medium size systems.
Rotating-Token Algorithms
These algorithms are described in the following sources:    (1) Y. Amir et al., The Totem Single-Ring Ordering and Membership Protocol, ACM TRANSACTIONS ON COMPUTERS SYS., November 1995; (2) J. M. Chang and N. Maxemchuck, Reliable Broadcast Protocols, ACM TRANSACTIONS ON COMPUTER SYS., August 1984; (3) Robbert van Renesse et al., Horus: A Flexible Group Communications System, COMM. OF ACM, April 1996; and (4) L. E. Moser et al., Extended Virtual Synchrony, IEEE 14th Int'l Conf. on Distributed Computing Sys., June 1994. These articles are hereby incorporated by reference as if fully set forth herein.
The algorithms in this class are similar to the sequencer-site algorithms, but they rotate the role of the sequencer, i.e., pass (1)he token, among several processes. Thus, before any message can be sent, the sender has to acquire a “token.” The token-holder then places a sequence number on each message it multicasts, and messages that arrive out of sequence are delayed until they can be delivered in order. The rotating-token algorithm alone can not guarantee message atomicity. It is usually combined with knowledge of group membership to achieve atomic multicast.
Rotating-token algorithms provide load balancing and avoid network contention when shared links are used, as is the case, for instance, in Ethernet-based LANs. Unfortunately, token management usually involves substantial overhead. In addition, in the worst delay case, a client-sender may need to wait for a complete rotation of the token before it can send any messages. This can lead to excessive latency.
Symmetric Algorithms
These algorithms are based on Lamport's total order algorithm, described in L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, COMM. OF ACM, July 1978. See also L. Rodrigues et al., Totally Ordered Multicast in Large-Scale Systems, IEEE 16TH INT'L CONF. DISTRIBUTED COMPUTING SYS ., May 1996. The Lamport and Rodrigues articles are hereby incorporated by reference as if fully set forth herein.
In this scheme, data messages are delivered according to the order defined by the timestamps assigned at multicast time. In order to be live, algorithms in this class require correct processes to multicast messages periodically. Alternatively, an additional communication step is required. Total order can be established in a single communication step when all processes broadcast simultaneously, and in two steps in all other cases. Unfortunately, in such symmetric algorithms, all group members are involved in the communication. This means that the entire system has to cater to the slowest member.
Chandra and Toueg's Algorithm
This algorithm requires two steps: (1) reliably broadcasting a message; followed by (2) execution of a consensus. See T. D. Chandra & S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems, J. OF ACM, March 1996. This article is hereby incorporated by reference as if fully set forth herein.
The consensus algorithm is based on a failure detector (⋄S) that requires three communication steps; thus, in the best case, a total of four communication steps are required to run the total order broadcast algorithm. The Chandra-Toueg algorithm requires (n−1)2 messages for the first step (reliable broadcast), and (2(n−1)+(n−1)2) messages for the second step (consensus execution), for a total of (2(n−1)2+2(n−1)) messages, where n is the multicast group size. Clearly, the second order group size dependence scales poorly.