In large data processing complexes, tasks are commonly performed by a number of related systems in the complex communicating with each other. These related systems are individual operating systems running on data processing machines which are linked by a data network to form a single complex.
In order that the related systems may operate within a complex, there must be some means for communicating data items such as variables and instructions between the systems. Communication is initiated by the acquisition of one or more real or virtual communications links between the local system and a remote system. Such virtual communications links are provided by, for example, the IBM Virtual Telecommunications Access Method (VTAM), which is a network management system. Normally the acquisition and relinquishing of the links is an ongoing process which does not cause a bottleneck of link acquisition requests to build up, assuming that sufficient links are provided. The local system performs those steps in a task which it can complete without access to remote resources, such as input devices, output devices or storage containing programs or data. When a step in a task requires access to resources remote from the system then it will issue a request to link to those resources in the remote system.
Although operation using multiple linked systems should not result in the failure of a remote system causing failure of a local system, there are circumstances where this may occur. If there is a major problem in a remote system, such that the remote system slows down or the remote system stalls without terminating operation, it can result in a build up of queued or waiting requests for links to the remote system at the local system. The local system will have a large number of link acquisition requests (hereinafter called requests) waiting to be serviced. Tasks dependent on access to remote resources will thus not complete and the system will take up additional storage and other resources with the result that it slows down or stops operation. This may in turn cause similar problems in other systems. In this way, the failure of one system can cause a spread of sympathetic failures in other systems throughout the complex.
One prior art solution is to have a first program running on each system which at predetermined intervals (for example, the receipt of ten requests), sends a communication over the link to a second program running at the other end of the link, informing the second program that the first program is processing requests normally. The second program takes no part in the communication, apart from receiving the confirmation. Such a solution is commonly called "pacing" and an example of such an implementation is IBM's Virtual Telecommunications Access Method (VTAM). Extra processing by the first system is required in this solution at all times, whether there is a potential problem with the link or not, thus reducing the performance of the link. In addition the use of the link is required to establish if the remote system is operating normally.
In another solution, a first program requests a second program to confirm that it is operating normally. The second program either sends a communication back to the first confirming normal operation or does not respond. This is commonly called "polling". Extra processing by both systems is required with this solution as is use of the communications link itself.
Yet another solution involves the detection of timeouts on any communication link and to use this as an indication that the system at the receiving end of the link is not responding. This solution is also unable to detect the difference between a highly loaded system, which is processing requests at a normal rate, but is simply overloaded, and a system which has stalled and is not processing any requests.
A further prior art solution to the problem is to count the number of requests queued waiting for a link to a particular remote system. When the number of queued requests reaches a threshold value, remedial action is taken. An example of this solution is described in European Patent Application EP A 0 539 130 A2 where the action taken is to reject the request, reroute the request or accept the request. This solution has the advantage that it can detect the status of a remote system without any additional usage of the communications link to that remote system. However, it is unable to detect the difference between a highly loaded system, which is processing requests at a normal rate, but is simply overloaded, and a system which has stalled and is not processing any requests.
U.S. Pat. No. 5,031,089 discloses a distributed computer system in which each node within the system has a queue. All jobs to be performed are placed on one of the queues. A workload value is periodically calculated by each node as a function of the number of jobs on its queue. Workload values are exchanged between the nodes. The jobs on any of the queues may be re-allocated from one node to another node based on the workload values.
Thus the prior art is unable to discriminate between an overloaded system which is still processing items in a queue and a system which is failing.