Networked computer systems are becoming increasingly popular as they permit different computers to share information. A node is generally a device that is connected as part of a computer network. Not to the exclusion of other devices, as used herein a node is generally understood and appreciated to be a computer.
Designing distributed infrastructure applications such as, for example, memory systems, file systems and group communication services is difficult for a variety of reasons. These for example include issues of consistency of the distributed state, the reliability and availability of data in the face of node, link or other component failure, and scalability.
Typically a rather fundamental aspect in most networks and especially with distributed infrastructure applications is the ability for nodes to share in group communication. Informally speaking, group communication generally refers to two services, reliable totally ordered multicast and group membership agreement—which are typically implemented as a single service, as their properties are related.
To further understand the utility of group communication, consider a brokerage or trading system. Brokerage and trading systems typically involve a number of distributed applications and/or systems that must act in concert to execute each and every trade. For example, when conducting a trade it may be necessary to check prices of an equity from a rapidly changing database, check currency rates from another, draw funds from an account, and place an order at the trading floor, each of which is an independent action, but actions which must be coordinated to successfully conclude the trade. Further, once initiated, these actions must occur reliably, even if the broker's computer or other element of the network system fails partway through the trade execution. A failure to act in such a reliable fashion could, for example, result in the trade occurring without drawing funds from the buyers account. A key element to the reliability of such a system to conduct the trade execution is to ensure that messages between the interacting applications are delivered reliably and in the proper order. This is a natural setting for a group communication system.
Such a need for coordinated group communication is not limited to complex systems. A message board as may be used by students, colleagues, hobbyists, or other individuals desiring to share information. Generally in such settings it is desired that all users in the group see all the messages posted to the group (during the time they are members of the group), and that the messages are seen in the same order by all members of the group so that replies make sense and appear in proper context.
In theory, a group communication service operates to ensure that all correct members of a group (intended members and members that have not crashed or been disconnected) are aware of and in agreement with all membership changes that happen within the group. For a given group, a current agreed upon identity of the group membership may be called a view—i.e., a view of the group membership at that moment, which will exist until new members join or current members leave.
When a new member joins a group, and thus establishes a new view, it is desirable that the member receive each and every message sent to the group from the time of its joining. It is also highly desirable for all members of the group to receive the same messages in the same total order. That is, through the communication system, each member receives its messages in exactly the same order as every other member of the same group or subset of the group.
Total order among messages means that each message in a set of messages either comes before or after any other message in the set. For example, if group members X, A, and W each broadcast messages M(X), M(A), and M(W) respectively, then the group communication system may choose any total order in which to deliver this set of messages. One such order is M(X) before M(A) before M(W); thus, all members will receive the messages in that order.
Various attempts to provide group communication systems and services have been undertaken, which are generally large and complex. Frequently these systems rely on one dedicated node as a gate keeper, either to order the messages or through which all messages must pass. In other systems, the node sending a message is responsible for coordinating the delivery to every other node member of the group, which of course imposes additional overhead and tracking upon the sending node and may interrupt the activities of the receiving node. Gate keepers and single access points impose significant constraints upon a group communication system in terms of scalability and reliability.
A significant aspect in attempting to implement a group communication system or method is to ensure that, (A) group members receive the same messages, despite a lossy network, and (B) group members receive messages in the same order, despite concurrent sending of messages. Should some group members receive only some messages or messages in a different order, system instability, data corruption, and/or unintended system operations are likely to occur.
It is also generally desirable for members of the group to add and read only complete messages. Consider a group message such as “Sell XYZ stock and buy ABC stock.” Should only the first part of the message “Sell XYZ stock” be transmitted to the group, or one or more group members only read the first part of the message, the failure to buy ABC stock may well have negative consequences. It is therefore often extremely important to control write operations in such a way that other nodes do not inadvertently receive partial data or data believed to be current when in fact the write operation is still ongoing.
Hence, there is a need for a group communication system and method for that overcomes one or more of the drawbacks identified above.