The present invention relates to data processing systems.
In large data processing systems tasks are commonly performed by one of a number of related operating elements. When an operating system such as the IBM MVS system is used, these elements may be separate virtual address spaces under a single operating system. Alternatively, the elements may correspond to individual operating systems running on data processing machines which are linked by a data network to form a single data processing system.
Operation using multiple linked elements gives several advantages. Each operating element can be made effectively autonomous, so that failure of one element should not affect the others. The virtual or real storage addressed by one element can be made inaccessible to other elements to prevent accidental erasure or overwriting. If each operating element is restricted in the amount of virtual storage which can be addressed, then subdividing a large program between a number of operating elements can increase the storage available to that program. Large data files can be held in a single element with selective access being given to other elements, so avoiding the need for multiple copies of the files. Finally, in a multiprocessor environment, system restrictions may mean that a single operating element can only use one central processing unit (CPU). The use of a number of operating elements would then allow the full potential of the multiple CPUs to be realized.
The accompanying FIGS. 1 and 2 show how data processing operations may be subdivided between a number of elements as described above. In FIG. 1 the operating elements correspond to separate operating systems residing on separate, but linked, data processing machines. This arrangement will be referred to as an `intersystem communication` (ISC) configuration. In the example shown, two data processing machines are linked by data link 30. Each machine has corresponding terminals 40, 50 and storage apparatus 60, 70.
Similarly in FIG. 2 `multiregion operation` (MRO) is shown, in which a number of computer programs run on the same data processing machine and under the same operating system, but in different address spaces or regions. Here two regions 110, 120 are shown running on the single machine 100. The regions may run production and test versions of a single computer program, or different versions of the same program for use by different departments in a company, or different parts of a larger overall program. Communication between the regions is possible, and the programs can share the same terminals 130, 140 and storage apparatus 150. Alternatively each region 110, 120 can have associated dedicated terminals (130, 140 respectively) or other peripheral apparatus.
An example of a computer program which is designed to operate in an ISC or MRO configuration is the IBM CICS/MVS computer program operating under the IBM MVS operating system. (IBM, CICS/MVS and MVS are trademarks of the International Business Machines Corporation). Such operation is described in IBM manual number SC33-0519, entitled `CICS/MVS Version 2.1 Intercommunication Guide` (first edition, April 1988).
FIG. 3 shows another typical use of MRO or ISC operation, in which each region or machine provides a different type of function as part of an overall system 200. User commands and output data are handled by a terminal owning region (TOR) 210, application programs are processed in a number of application owning regions (AORs) 220, 230, and file handling and data storage are performed by a file owning region (FOR) 240. This arrangement simplifies terminal and file handling arrangements and allows higher priority applications to be run in faster or higher priority regions.
In order that a system may be subdivided as shown in FIG. 3 there must be some means for communicating data items such as variables and instructions between the regions. Some communication paths may be forbidden or simply not required, such as direct communication between the TOR and FOR. Communication is initiated by the acquisition of one or more real or virtual links between the two regions or elements. Normally the acquisition and relinquishing of the links is an ongoing process which does not cause a bottleneck of link acquisition requests to build up, assuming that sufficient links are provided. However if for some reason the flow of work is held up such that link acquisition requests are being made more often than links are being relinquished, acquisition requests are held by the requesting element in a stack of waiting requests. Again, in most cases any slow down in the relinquishing of links is only temporary so those requests held in the stack can be serviced and operation returned to normal.
A more realistic scenario is shown in FIG. 4, in which there are a number of interconnected TORs (300, 310, 320), AORs (330, 340, 350, 360) and FORs (370, 380, 390). A typical installation could have several hundred different regions or elements.
Although it was stated above that MRO or ISC operation tends to prevent the failure of one region or element causing failure of another region, circumstances will now be described in which this can occur. Consider a major problem occurring in one of the FORs (for example FOR 370 in FIG. 4), such that that FOR slows down or halts operation without actually terminating operation. This will cause a build up of stacked or waiting link acquisition requests at the AORs. Sooner or later one of these AORs will have so many link acquisition requests waiting to be serviced and therefore using up storage and other resources, that it will itself slow down or stop operation. This in turn will cause similar problems at other FORs, other AORs, and of course the TORs. In this way the failure of one region can cause a spread of sympathetic failure to other regions throughout the overall system.
One prior art solution to the problem described above is to run a continuous program in each region to check the status of all other regions to which communication might be addressed. This is wasteful of processor resources and could exacerbate the situation because of the additional work and link traffic it would generate.
A further prior art solution is to set a limit on the amount of work sent by one region to another, in order to limit the total workload of the recipient region. However, this does not take into account the fact that other regions may also be sending work to the same recipient. Also, in a typical data processing installation the number of regions may well increase as the system is expanded over the course of perhaps a few days or weeks. It would be very inconvenient to have to reset the workload limits whenever a new region is added. In order to prevent sympathetic failure the limits would have to be conservative, so this solution can also restrict the total throughput of the system.