In modern computer data processing, improved efficiency in the execution of an application program is often achieved by separating the program into several cooperating parts and running each part on a different CPU within a computer network. The several parts of the application program are each run as a detached process on a specific CPU. The several parts may be active one at a time with the other inactive parts in a "standby" mode or they can all be active at the same time as cooperating parts of the overall data processing operation.
For reliable execution of the entire application program, each CPU running one of the parts of the application must function properly throughout the entire processing of the part. If one part of the application program fails due to a CPU crash, it is imperative that notification of the failure be made to enable a network manager to implement appropriate corrective actions. For example, the network manager can transfer the failed part of the application program to another CPU on the network for execution.
The monitoring of proper network operation is an important yet time consuming function typically carried out by network managers and operators. Upon a CPU crash, the network manager must ascertain if the CPU was executing a co-operating part of an application program, identify the application program and the part running on the failed CPU and then take appropriate steps to restart the application program. Accordingly, there is a need for an automatic, reliable mechanism to detect a failure among network components running co-operating parts of an application program and to provide efficient, informative notification of the failure.