The present invention is directed to a method for system recovery in a communications environment in which message packets are sent from one endpoint to another. More particularly, the present invention is directed to the utilization of a system of pair wise epoch numbers to maintain error free communication and communication consistency in a distributed data processing system which includes a plurality of communication endpoints with the use of pair wise epoch numbers providing a mechanism which alleviates the communication constraints imposed by the use of global epoch number systems. Even more particularly, the present invention is directed to a method which provides automatic self-healing when employed in a communications environment in which endpoint failure is possible.
Before describing the present invention in detail, it is useful to provide some background for better understanding its preferred operational environment. The present invention operates in distributed data processing systems. An example of such systems is the pSeries of data processor (formerly referred to as the RS/6000) manufactured and sold by International Business Machines, Inc., the assignee of the present invention. These systems include a plurality of independent data processing nodes each of which inherently includes one or more central processing units, associated random access memory and is coupled to one or more nonvolatile storage devices with readable and writable media therein. These nodes communicate with each other through the exchange of messages transmitted through one or more communication adapters. These adapters are typically connected to a switch which is provided to direct messages to designated nodes in the distributed system. Communication in this system occurs via the interchange of messages which typically have a data header imbedded in each packet comprising the message. This data header allows the exchange of messages defined by a protocol such as MPI (Message Passing Interface). In the present invention this header includes the presence of an epoch number.
Having considered the environments in which the present invention is found and is most useful, it is now appropriate to consider problems that can occur in this environment and the advantages to their solution as provided by the present invention. In particular, it is possible that an adapter might fail. If this is detected, the node affected by this failure typically has the option of seeking an alternate communication path through another adapter. In this case, or even in the case of a temporary adapter failure, the other nodes in the system can lose track of the message passing status. While such failures could be handled by a system of globally maintained consistency variables, since message passing is often just between a pair of nodes, the system of the present invention entails less overhead since it is based on pair wise sets of epoch numbers.
Additionally, it is noted that a node might also experience a failure of the variety in which the node undergoes a system reset (that is, it starts “from scratch”). The present invention also provides for recovery of communications even in the face of this, more severe mode of failure. In particular, in such scenarios it is important for the node which has failed and which has subsequently recovered be provided with a mechanism which can communicate this fact to other nodes with which it had been communicating.