The present invention is generally directed to insuring the continuation of consistent group formation events in a distributed topology liveness system, that is, in a multinode data processing system in which node and/or adapter liveness is communicated throughout the system via heartbeat messages, which are messages that are sent periodically and which indicate node and/or adapter liveness. More particularly, the present invention is directed to a method for detecting a situation in which a liveness daemon running on one of the nodes has been subject to a rapid restart. Even more particularly, the present invention is directed to a method for determining the existence of such quick restart events and for providing a proper indication thereof to other nodes within the network, with the particular objective of avoiding grouping inconsistencies which are situations in which one node set sees another node set fail in some way without the other node set being aware of the fact that the first node set has also failed. In short, all of the nodes within a node set should have the same view as to the operating status of the other nodes in the node set.
A proper understanding of the present invention is best obtained from an appreciation of the environment in which it is intended to operate. The present invention is employed in multinode data processing systems. These systems include a plurality of nodes each of which incorporates a data processing element which is coupled locally to its own memory system which typically includes both a volatile random access memory and a nonvolatile random access memory. The volatile memory typically comprises an array of semiconductor memory chips. The nonvolatile memory typically comprises a rotating magnetic or optical storage device. The data processing element also typically comprises a central processing unit (CPU). Each node includes one or more data processing elements. The nodes also include adapters which are communications devices which permit messages to be sent from one node to another node or to a plurality of other nodes. Internodal communications typically take place through a switch device which routes transmitted messages to destination nodes within the system.
In order to carry out various data processing functions, the nodes within any given multinode network are organizable into sets of nodes. Nodes and/or their associated adapters sometimes experience problems, delays or failures. Accordingly, from time to time during the operation of individual nodes, system checks are undertaken to make sure that the nodes are still alive and functioning. This checking is performed via heartbeat message transmissions. Each node in the system is assigned one or more “downstream” nodes for the purpose of periodically sending a message indicating liveness status. In preferred embodiments, heartbeat signals are only sent to a single other node. However, it is quite easy to instead employ a predefined list of node destinations for receipt of heartbeat signals from any or all of the nodes in the network. These liveness message transmissions are handled by daemon programs running on the various nodes in the system.
Distributed multinode data processing systems of the kind contemplated herein employ heartbeat messaging protocols which are used to control group membership which, of course, shifts over time. It is control of the membership process to which the present invention is directed. This membership process typically includes the establishment of one of the nodes in a group as the so-called Group Leader (GL). The Group Leader acts as a coordinator for nodes coming into (joining) or for nodes exiting the group. Additionally, in the event that there is a problem with the Group Leader, there is preferably also a designated second node which is intended to act as a replacement for the Group Leader in the event that the Group Leader experiences a failure. This second, backup Group Leader is referred to as the Crown Prince. In the context of the present invention, the Group Leader and Crown Prince are employed in the “liveness” (heartbeating) layer. The present invention should not be confused with group membership services which are provided to “end user applications.” In accordance with the present invention, “group membership,” as referred to above, refers to the list of members in an Adapter Membership Group which occurs on each network being monitored. On the other hand, “node reachability” refers to the set of nodes that are considered to be alive, taking all of the adapter membership groups into consideration. In particular, it is noted that the notion of “node reachability” may include message hops through indirect paths that may cross network boundaries. This set of nodes is supplied from the “liveness layer” to the “group communications layer” which runs on top of the “liveness” layer.
More particularly, the present application is concerned with two different scenarios which present potential problems with respect to group membership consistency across the nodes of the system or network. Accordingly, there is provided a method for determination of adapter and node death in a distributed system whereby node events are made consistent, that is, when a first node sees another node as being “down,” the second other node, if alive, is still able to see the first node as being “down” within finite amount of time. When a node actually suffers a “permanent” crash the heartbeat mechanism, together with the associated “join” protocol, is able to provide sufficient control and communications amongst the remaining nodes to assure maximum functionality. Accordingly, the present invention does not come into play when nodes crash, since the basic heartbeat mechanism is able to cope with this situation; nonetheless, the present invention becomes important when communication failures and process blockages result in temporary loss of contact amongst a set of distributed peers in the liveness determination subsystem. The present method addresses two possible scenarios which could lead to inconsistent node grouping situations: (1) a node where the liveness daemon is stopped and restarted quickly; and (2) a node whose communications with the rest of the nodes suffers a temporary interruption.
In situations in which the liveness daemon running on one of the nodes is stopped and restarted in a short period of time, certain consistency problems can be engendered. For example, typically it happens that when the liveness daemon restarts, for each local adapter, a message is transmitted which “proclaims” the existence and the willingness of the sending node to become a group leader; it is, in generic terms, a request to know which other nodes are “out there. ” These aspects are discussed in more detail below where the nature of the “PROCLAIM” message is considered. However, the other nodes in the group still consider the restarting node (and/or adapter) as being part of the previous group. Accordingly, group membership is no longer consistent in the sense that there is a lack of symmetry among the various nodes with regards to the “known” status of the other nodes. When this situation is caused by the “quick” restart of the liveness daemon, it is referred to herein as the “bouncing node” problem or scenario.
Likewise, a problem can occur if a first node, say Node 1, has a temporary communication problem. If the problem lasts long enough for the other nodes to expel Node 1 from the group, but not long enough for the local adapter to be declared down, the other nodes can form a new Adapter Membership Group, G2, while the adapter at Node 1 is still considered as being part of the previous group, G1 (which contains all the adapters). The adapter at Node 1 then attempts to dissolve the group, since it will have gotten no answer to a liveness (“DEATH”) message that it sent when its old upstream neighbor stopped sending heartbeat signals to it. (For a discussion of a more specific and preferred characterization of the notion of dissolving a group, attention is directed below to Section 2.2). Upon “dissolving” the group, the adapter at Node 1 reinitializes into a “group” with only a single node, which is referred to herein as a singleton group and it resumes operation. Singleton groups are inherently unstable groups since they are typically destined to soon experience a change to inclusion in a larger group. If this all happens before the adapter on Node 1 is able to form a stable group, then Node 1 never sees any “node down” events, where the other nodes see Node 1 as being “down,” especially if this is the only adapter group to which Node 1 belongs. Accordingly, the recognition of this problem brings along with it the notion that some groups are more stable (from time to time) than other groups, and that special handling is required to insure group membership consistency across the network.