The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Reliable Group Communication
Certain telecommunications systems comprise a distributed plurality of processing nodes, and offer reliable multicast group communication. In these systems, group communication is implemented over a multicast framework, and a transport layer guarantees reliable delivery of multicast messages. In this context, the term “distributed application” refers to applications that implement group communication. Therefore, a distributed application is as an application that consists of the set of all associated programs running in different nodes of the system, communicating through group communication.
Conceptually, each node in the distributed system consists of applications that interface to a group messaging layer. The group messaging layer maintains information representing the concept of a group, and has a presence on every node of the system. The group messaging layer is responsible for the in-order delivery of messages to members of a group that are local to the group messaging layer, as well as delivering messages from its members to other members on the same or remote nodes in the system.
The group messaging layer logically interfaces to a transport layer that implements the reliable delivery of group messages. That is, the reliability of multicast messages is implemented by a transport subsystem that manages the acknowledgement of the reception of a message at a node, the retransmission of a message, and performs other operations. Each layer may be implemented as one or more software elements, firmware elements, hardware elements, or a combination thereof.
In order to guarantee reliable delivery, the transport layer can determine how many acknowledgements to expect for a message that is multicast. Therefore, in order to implement reliable delivery of multicast messages, the sender of a message may need to know how many nodes to which the message is targeted, since this will be the number of nodes from which it can expect acknowledgements. That is, if the sender of a message wants to be guaranteed that a set of nodes containing members in a group actually received a message, then the number of nodes in the set is the number of acknowledgements it will expect to receive. However, in past approaches the transport layer makes no provision for actually tracking the values and not just the number of elements of such a set.
Group Views
A member's view of a group reflects the understanding, of that member, about which nodes and how many members are in the group. In some systems, the group view actually may extend to knowing which members are in the group. Thus, the group view reflects the knowledge of the membership of the group. In this context, members of the view are processing nodes.
Many distributed applications employing group communication comprise asynchronous distributed processing nodes. Each node is essentially independent from all other nodes, and performs processing according to its own schedule or based on its own clock. Thus, the nodes are decoupled.
In such systems, failures such as communications outages or process crashes can occur, and therefore the system is inherently unreliable or unstable. Further, delays in message delivery in such an asynchronous system do not allow the sender to conclude that there has been a communication failure or a process crash. The members of a group may change, or a group may undergo reconfiguration, during such failures. A fundamental problem involves how to provide reliable communication among nodes in such unstable group communication systems. To provide such communication, nodes in the system must communicate messages directed to determining whether a prior message was properly delivered. However, no synchronous channel is available in the system for such messages; indeed, the messaging relating to verifying message delivery is itself subject to unreliability.
Prior solutions to these problems have been attempted, but in general, such solutions all limit, to some extent, the degree of asynchronous behavior that is available in the system. For example, in a virtual synchrony approach, the underlying infrastructure for group communication can enforce stable group views. This means that the sender of a message to a group can be sure that the message will be delivered to the sender's view of the group and that the group membership will not change while the message is sent.
In group communication systems that use virtual synchrony, while a group is reconfiguring and its membership is changing, e.g., in response to an outage or crash, group messaging is suspended. A suspension is imposed because any group views during this period are inherently unstable. Accordingly, a group is not allowed to reconfigure or change membership while messages are being exchanged. This approach affords distributed applications the assurance that a known set of group members will receive a message multicast to the group. Furthermore, whenever a sender sends messages that require responses, the sender knows how many responses to expect and from whom they can be expected, since virtual synchrony provides reliable and stable group views.
Problem Areas
However, implementing stable group views for group communication with virtual synchrony is costly in terms of message overhead, measured as the number of messages that must be exchanged in order to enforce virtual synchrony, and as the delay to the application that cannot send messages while group views are updated. Specifically, several rounds of reliable message exchange are required while changes in group views are propagated and agreed upon. During this time, changes in group membership are suspended. This excessively limits or delays ongoing group communication. In particular, in certain systems, the overhead of maintaining stable group views is too high, and making changes to group membership mutually exclusive with group messaging is unacceptable.
For example, certain systems have high bandwidth requirements for group communication. An example is a high-speed data packet router comprising multiple processing nodes that are distributed over a high-speed mesh. As a result of these requirements, long delays in sending messages are unacceptable. Further, in certain systems, messaging must be timely enough to convey system information, and these systems cannot tolerate delays in group communication while a group view is stabilized. The latter condition applies to systems that employ group communication to exchange critical information in real time.
Still other systems must recover from outages in times that are less than that required for the exchange of multiple rounds of reliable group messages. Therefore, if any outages result in a change to a group membership or topology, the group views have to be updated and the system cannot afford this time. Thus, if the maximum time allowed for the recovery from network outages, crashed nodes or processes is smaller than that required for the time for a group to reach agreement upon its new configuration, then the system cannot promise stable group views to its distributed applications.
Based on these reasons or others, certain systems must tolerate unstable group views, and a greater burden is placed upon the distributed application in these systems. In such systems, the sender of a message might send a message to a set of group members as defined by its group view, and the group membership can change while the message is being delivered. This can have negative side effects for the sender of the message.
For example, if the sender is expecting responses from the members of the group to which it sent the message, then presumably, it must dimension response memory for the responses to match what it perceives as the group membership at that time. Therefore, if some members departed the group while the message was being sent, the transport layer may give up on trying to reliably deliver the message to all nodes in the group, after one or many retransmissions, returning an error to the sender of the message.
In these types of failures, the sender is stalled while a maximum specified time is consumed waiting for responses from nodes that may no longer be in the group, because the group reconfigured while the message was sent. The sender's view of the group is now untrustworthy, and the sender may have to take extra measures to determine which members actually received a message. This scenario is termed the “uncertain known member” scenario, since the sender of the message cannot know if the member left the group, or there was an outage while a message was being delivered to the group. The sender must either wait for messages, or send more messages in order to be able to determine the scenario.
Another hazard for a distributed application operating with unstable group views is that the application may inadvertently send the message to new members who joined the group while the message was being delivered. This can occur, for example, when a group is reconfiguring while the application is sending a message; since the sender's group view was not updated, any new members that arrived while delivery of the message may have received the message. This is termed the “unannounced new member” scenario, and may cause ambiguous results.
For example, if the message is delivered to the unannounced new members, then transport layers of the new members' node will acknowledge its receipt. This would have the side effect that some known members might not receive the message, since the transport layer on the sending node will prematurely receive all the acknowledgements that it was expecting. The problem with the unannounced new member scenario is that the sender no longer knows if all the members received the message, and it may not even be aware that there is a problem.
A sending node can detect the problem if it expects responses to a message that it sent to its group view. The unannounced member may respond to the message along with all members that are in its group view. However, the sender probably will not have reserved memory for all known members and unannounced members, and therefore the sender will drop any responses that arrive after it receives the amount it was expecting.
All of these effects introduce significant uncertainty in the distributed application. The basis of the uncertainty is that the group views are inherently unstable. A method is sought that can provide a more deterministic environment in systems which can not enforce stable group views.
In past approaches, this problem is not addressed. The past approaches always assume that group communication is stable.
Based on the foregoing, there is a need for a way to provide reliable group communication, and timely updates to group views, in systems that cannot implement stable group communication.