1. Field of the Invention
The present invention generally relates to data processing systems and more particularly relates to data processing system architectures which are arranged in a cluster/lock processing configuration having efficient techniques to recover from system component failures.
2. Description of the Prior Art
It is known in the prior art to increase the computational capacity of a data processing system through enhancements to an instruction processor. It is also known that enhancements to instruction processors become extremely costly to design and implement. Because such enhancements tend to render the resulting system special purpose in nature, the quantities of such enhanced processors needed within the market place is quite small, thus tending to further increase per unit costs.
An early approach to solving this problem was the “super-computer” architecture of the 60's, 70's, and 80's. Using this technique, a single (or small number of) very large capacity instruction processor(s) is surrounded by a relatively large number of peripheral processors. The large capacity instruction processor is more fully utilized through the work of the peripheral processors which queue tasks and data and prepare needed output. In this way, the large capacity instruction processor does not waste its time doing the more mundane input/output and conversion tasks.
This approach was found to have numerous problems. Reliability tended to rest solely on the reliability of the large capacity instruction processor, because the peripheral processors could not provide efficient processing without it. On the other hand, at least some of the peripheral processors are needed to provide the large capacity instruction processor with its only input/output interfaces. The super computer approach is also very costly, because performance rests on the ability to design and build the uniquely large capacity instruction processor.
An alternative to increasing computational capacity is the employment of a plurality of instruction processors into the same operational system. This approach has the advantage of generally increasing the number of instruction processors in the market place, thereby increasing utilization volumes. It is further advantageous that such an approach tends to utilize redundant components, so that greater reliability can be achieved through appropriate coupling of components.
However, it is extremely difficult to create architectures which employ a relatively large number of instruction processors. Typical problems involve: non-parallel problems which cannot be divided amongst multiple instruction processors; horrendous management problems which can actually slow throughput because of excessive contention for commonly used system resources; and system viability issues arising because of the large number of system components which can contribute to failures that may be propagated throughout the system. Thus, it can be seen that such a system can decrease system performance while simultaneously increasing system cost.
An effective solution is the technique known as the “cluster/lock processing system”, such as the XPC (Extended Processing Complex) available from Unisys Corporation and described in U.S. Pat. No. 5,940,826, entitled “Dual XPCs for Disaster Recovery in Multi-Host Environments”, which is incorporated herein by reference. This technique utilizes the XPC with a relatively large number of instruction processors which are “clustered” about various shared resources. Tasking and management tends to be decentralized with the clustered processors having shared responsibilities. Maximal redundancy is utilized to enhance reliability.
Though a substantial advance, the cluster/lock systems tend to solve the reliability problems but remain relatively costly to implement, because virtually all of the hardware and firmware are specifically designed and manufactured for the cluster/lock architecture. This is necessary to enable each of the system components to effectively contribute to system reliability, system management, and system viability. As a result, demand volumes remain relatively low. Furthermore, the logic necessary to provide component failure recovery tends to be implemented within special purpose hardware and firmware, thereby further exacerbating the problems associated with low volume production. Also, recovery times become highly important in real time and near real time applications.