1. Field of The Invention
The invention relates to the field of multi-process systems, e.g., computer networks, and in particular, to system failure detection and notification in an asynchronous processing environment.
2. Background Information
Multi-processing systems are known, such as computer networks, and application programs may utilize resources which are distributed among the systems, e.g., a database located on a remote computer on the network may be accessed by an application program started by through an end user interface on a personal computer (PC). Such an application program will be referred to herein generically as a network application.
System failures may result from any number of causes, for example, a computer or process abending (abnormal ending of a task, e.g., crashing), losing communications, or because of a reboot.
In network applications, it is important to detect any system failure in a timely fashion in order to provide feedback to a user at the end user interface. In particular, if a user at an end user interface has commanded an operation that is destined to fail because of such a system failure, it is important to update the end user interface with that information as soon as possible so as not to waste the time of the user.
It is further important to detect the failure in a timely fashion in order to take corrective action within the network application. If corrective action is possible, it should be taken without a long delay so as not to delay the processing of the application.
It is also important to detect the failure in a timely fashion in order to clean up resources on other systems in the network that are dependent upon the failed system. Failed operations continue to consume resources until the failure is detected and the resources are released. In a network application, these resources often exist on other systems than the failed system which are involved in the processing of the operation that has failed.
In a synchronous processing environment, system failure is typically detected when an operation is initiated on that system and the system fails to respond. Detection of the failure is thus delayed until such an operation is attempted.
However, in an asynchronous processing environment detection is not as simple as in the synchronous environment. An operation could result in system A and system B sending multiple messages to one another in an asynchronous fashion. At any point in time, it may be just as correct for one of the systems to send a message to the other as it is for one of the systems to never have to send a message to the other. The lack of messages flowing between the systems is therefore not necessarily a valid indicator of failure. The messages may be sporadic, or they may never have to occur. So in the asynchronous case, a long-running operation may continue to appear normal, even though a system has already failed.
While system failure could be detected when the next operation involving the failed system is initiated, that operation might not be initiated until minutes, hours or even days after the system has failure has occurred.
A need therefore exists for system failure detection in the asynchronous processing environment which is virtually immediate, thus solving the problems related to not having notification of a system failure in a timely fashion.