The present invention is generally directed to error handling in multinode data processing networks. More particularly the present invention is directed to a cooperative arrangement of nonlocal operations and local (that is, within node) error handling in which distributed and/or parallel structured applications are maintained in a running, albeit suspended, state to accommodate more finely grained local error handling. Even more particularly the present invention provides coordination of the improved local error handling capabilities with global operations to prevent the unnecessary termination of running application and/or other programs.
Multinode data processing systems are employed to run user programs (that is, applications) using both distributed and parallel processing modalities. Operating system level software running on the various nodes of such systems handles communications between the nodes. The user applications communicate from node to node via messages sent through a switch. In the pSeries of SP products marketed and sold by International Business Machines, Inc., the assignee of the present invention, these messages are transmitted using the publicly available Message Passing Interface (MPI) as well as Internet Protocols. The messages are sent from node to node via a switch.
Utility programs, such as the publicly available Group Services interface provided with such systems, permit users to form groups of nodes for the purposes of accomplishing specific user application tasks such as searching, sorting or numerical processing. The switch receives a message from a sending node and is capable of directing that message to one or more of the other nodes within the established node group (that is, to the receiving nodes). In these multinode systems, communication takes place through adapters which provide a communication channel to the switch from memory and data processing elements within each node. To the extent needed by various application programs, this communication is coordinated through a primary node in the group.
Accordingly, it is seen that the adapters provide a key link in the communication process that permits distributed and parallel operations to take place. These operations have both a local and a global aspect. Adapters perform their own data processing functions which include interrupt generation at the local node, typically in response to information packet message receipt so that incoming messages are directed to appropriate memory locations within the memory units of the various processor nodes. However, because of their location in the communication path, an error occurring in an adapter unit can exhibit both local and global effects. The present inventors have appreciated that there is an unappreciated spectrum of severity levels in adapter errors, that some errors are more predictable than others, that some errors have a greater likelihood of recoverability and that recovery times can become so large that global operations are adversely, albeit unnecessarily, affected. However, in the past serious adapter errors have caused entire nodes to become nonfunctional solely because of adapter problems. As a result the node was then fenced off and intervention by a human systems administrator was required to remedy the problem. Such errors can result in the termination of applications and jobs that are in running states. However, some errors even though characterizable as “serious,” produce adapter states from which recovery is possible. Nonetheless, these recovery operations typically involve adapter reset operations and the early capture of fault data for hardware debugging purposes. Together, or individually, these operations can consume more than a desirable or tolerable amount of time to complete. In this regard, it is noted that recovery operations, by their very nature, are unpredictable in their outcomes and in their duration.
Adapter errors are addressed in several ways. The simplest approach is to simply count the number of “retry” events and if the number exceeds a predetermined threshold, the node is fenced off from communication with the rest of the system which often means that running jobs are terminated and eventually have to be restarted from a much earlier stage in their progress. In adapter error recovery systems that are solely based on a time-out approach after pending error recovery operations have been initiated, there has been a failure to mesh conditions present within a local node with more global considerations. Most recovery algorithms involve retry operations at the problem node, that is, at the node with the problem adapter. Some approaches involve retry operations carried out from the global node group perspective. But these methods often result in an unnecessary disconnection of the affected node from the rest of the system. While many of these error recovery schemes are capable of handling relatively transient errors, they fail when the errors become more serious and/or are of longer duration.