1. Field of the Invention
The present invention relates to multiprocessing computer systems. More particularly, the invention concerns a method and apparatus to reliably deliver ordered machine-readable messages among processing members in a multiprocessing computer system, where failure conditions are addressed by invoking a membership protocol requiring only asymmetric safety.
2. Description of Related Art
Multiprocessing Systems
Multiprocessing computing systems perform a single task using a plurality of processing "elements", also called "nodes", "participants", or "members". The processing elements may comprise multiple individual processors linked in a network, or a plurality of software processes or threads operating concurrently in a coordinated environment. In a network configuration, the processors communicate with each other through a network that supports a network protocol. This protocol may be implemented using a combination of hardware and software components. In a coordinated software environment, the software processes are logically connected together through some communication medium such as an Ethernet network. Whether implemented in hardware, software, or a combination of both, the individual elements of the network are referred to individually as "members", and together as a "group".
The processing elements typically communicate with each other by sending and receiving messages or packets through a common interface. A processing element typically makes a collective call at the common interface, which coordinates the call among a subset of the processing elements. The elements involved in the call are referred to as participants. In many applications, frequently used collective calls are provided in a common communication library such as the industry standard Message Passing Interface. Further details on this standard are described in "MPI: A Message-Passage Interface Standard," published by the University of Tennessee, 1994. The communication library provides not only ease in parallel programming, debugging, and portability but also efficiency in the communication processes of the application programs.
Multicast
In multiprocessing computing systems, the individual nodes cooperate to perform an overall task of the system. A fundamental part of internodal cooperative task performance is the effective delivery of messages among the nodes. As a result, engineers have developed a number of different message delivery or "multicast" techniques. Although the various multicast approaches differ in some ways, each approach typically broadcasts messages to a predefined group called a "membership group".
It is difficult, however, for each node to learn the status of other nodes, as required to most effectively cooperatively perform the overall system tasks. For example, multiprocessing tasks are complicated when one or more nodes fails. Thus, each node cannot proceed without some knowledge of which other nodes have failed, and the other nodes' task completion status. This coordination problem is difficult because many activities occurring are simultaneously in different nodes, and the nodes' statuses change constantly.
Thus, one of the problems associated with multiprocessing systems is that it is impossible to guarantee sufficiently consistent failure detection by the nodes so that each correctly functioning node is guaranteed an accurate and consistent "view" of the current membership. A given member's "view" of the current membership includes all other members for which no notification of communications failure has been received. Because consistency cannot be guaranteed, failure or tardiness of some processing elements may not be consistently detected and reported by the other elements. Inconsistency in these reports makes recovery from system failures difficult.
The problem of keeping the views of the members in the membership accurate and consistent is known as the membership problem. Members join a group because of events external to the cooperative task assigned to the group. Members leave a group either because of such external events or because of a failure of the member or of computing resources on which the member depends. These external and failure events are called membership events. Ideally, within a short time after any membership event, all remaining members of the group would have the same accurate view of the group.
Membership Protocols
As mentioned above, it is often important to determine membership in a multiprocessing system. Specified parameters for establishing group membership collectively form a "membership protocol". An ideal membership protocol would have these features: (1) it is triggered by some membership event, (2) it requires at most a fixed amount of time to complete, (3) it results in complete consistency of views of the remaining members, and (4) each remaining member's view consists of exactly the set of remaining members. This last feature exhibits what is known as "symmetric safety", where each processing member that views another member must also be viewed by that member. In this application, a member "viewing" another member means the first member has no reason to consider the second member, or communications with the second member, as having failed. Under symmetric safety, if two members views intersect in at least the name of one member with a view, then the two members views must be identical. A member's view always includes itself. Unfortunately, this ideal membership protocol is impossible in the presence of crash failures and lost messages.
One strategy for approximating an ideal membership protocol is to assume a high degree of synchrony in the computation and the transport layer of the processing elements. These are referred to as synchronous agreement protocols, such as the one described by Dwork et al. in U.S. Pat. No. 5,513,354. This patent concerns a method for managing tasks in a network of processors in which the processors exchange views as to which processors have failed and update their views based on the views received from the other processors. After a number of synchronous rounds of exchange, the operational processors reach an eventual agreement as to the status of the processors in the system. A failure in the assumed synchrony would lead to either inconsistency or the problem of "blocking". Also the agreement protocol is not designed to tolerate a communication partition.
Another strategy for approximating the ideal membership protocol is to weaken the requirement that the protocol terminate in a fixed amount of time, instead requiring that the protocol merely terminate eventually. This weaker membership protocol is referred to as an asynchronous agreement or consensus protocol, similar to the one described by T. Chandra et al. in "The Weakest Failure Detector for Solving Consensus," Proceedings of the 11th Annual ACM Symposium on Principles of Distributed Computing, 1992, pp. 147-158. A disadvantage of such a consensus protocol is that there is no guarantee on how long the protocol requires to terminate. So, from a practical point of view, there is no guarantee of termination at all. Moreover, in the presence of communication failures (lost messages) that prevent one subgroup of participants from communicating with another subgroup, it is not even possible to guarantee eventual agreement.
Another approach, discussed in U.S. patent application Ser. No. 08/522,651, further weakens the membership conditions to require neither termination nor accuracy, and to reduce safety to merely require that, if the views of two members differed, then neither member was contained in the view of the other. Membership protocols satisfying these much weaker constraints are said to achieve interactive consistency, and referred to as interactive (or collective) consistency protocols. The advantage of interactive consistency protocols is that they usually terminate quickly and achieve both consistency and accuracy in the sense that a member's view of the current membership usually consists of the set of members with whom it could communicate. Their disadvantage is that there are no termination guarantees (i.e., they can not use a time-out), so a protocol might block forever waiting for a message that would never be sent from a member that has crashed.
Still another form of weakened membership protocols, referred to as dynamic uniformity, is described by D. Malki et al. in "Uniform Actions in Asynchronous Distributed Systems," Proceedings of the 13th Annual ACM Symposium on Principles of Distributed Computing, 1994, pp. 274-283. Generally, dynamic uniformity requires that each correctly functioning participant either reaches the same decision as reached by any other participant or is eventually viewed as disabled by others. The main disadvantages with dynamic uniformity protocols are their complexity and possible temporary inconsistency.
Implementation of Known Multicast Techniques
Most conventional multicast systems employ a membership protocol providing symmetric safety. Some examples include: K. P. Birman and R. van Renesse, Reliable Distributed Computing with the Isis Toolkit, IEEE Computer Society Press, Los Alamitos, Calif., 1994; D. Dolev, D. Malki, and R. Strong "A Framework for Partitionable Membership Service", Technical Report TR94-6, Department of Computer Science, Hebrew University; F. Jahanian, S. Fakhouri, and R. Rajkumar, "Processor Group Membership Protocols: Specification, Design and Implementation" in Proc. of 12th IEEE Symposium on Reliable Distributed Systems, pp. 2-11, 1993; and R. van Renesse, K. P. Birman, and S. Maffeis, "Horus: A Flexible Group Communication System", Comm. of the ACM, vol. 39, no. 4, pp. 76-83, 1996.
Although some of these known multicast systems may have achieved some scientific recognition and even commercial activity, they may not be entirely adequate for certain applications or users. For instance, since they require symmetric or so-called "strict" safety, these systems are effectively "blocking".
Additionally, symmetric safety can be expensive from a processing standpoint, requiring many different inter-nodal messages to implement. Moreover, for some applications, symmetric safety can be undesirably slow and complex. Although symmetric safety has these known limitations, approaches using anything less than symmetric safety are considered undesirable because of the perception that they cannot meet the needs of application.
Consequently, known multicast systems are not completely adequate for some applications due to certain unsolved problems.