1. Field of the Invention
The present invention generally relates to data processing systems and more particularly relates to fail recovery redundancy provisions for data processing system architectures which employ commodity hardware within a cluster/lock operating environment.
2. Description of the Prior Art
It is known in the prior art to increase the computational capacity of a data processing system through enhancements to an instruction processor. It is also known that enhancements to instruction processors become extremely costly to design and implement. Because such enhancements tend to render the resulting system special purpose in nature, the quantities of such enhanced processors needed within the market place is quite small, thus tending to further increase per unit costs.
An early approach to solving this problem was the “super-computer” architecture of the 60's, 70's, and 80's. Using this technique, a single (or small number of) very large capacity instruction processor(s) is surrounded by a relatively large number of peripheral processors. The large capacity instruction processor is more fully utilized through the work of the peripheral processors which queue tasks and data and prepare needed output. In this way, the large capacity instruction processor does not waste its time doing the more mundane input/output and conversion tasks.
This approach was found to have numerous problems. Reliability tended to rest solely on the reliability of the large capacity instruction processor, because the peripheral processors could not provide efficient processing anything without it. On the other hand, at least some of the peripheral processors are needed to provide the large capacity instruction processor with its only input/output interfaces. The super computer approach is also very costly, because performance rests of the ability to design and build the uniquely large capacity instruction processor.
An alternative to increasing computational capacity is the employment of a plurality of instruction processors into the same operational system. This approach has the advantage of generally increasing the number of instruction processors in the market place, thereby increasing utilization volumes. It is further advantageous that such an approach tends to utilize redundant components, so that greater reliability can be achieved through appropriate coupling of components.
However, it is extremely difficult to create architectures which employ a relatively large number of instruction processors. Typical problems involve: non-parallel problems which cannot be divided amongst multiple instruction processors; horrendous management problems which can actually slow throughput because of excessive contention for commonly used system resources; and system viability issues arising because of the large number of system components which can contribute to failures that may be propagated throughout the system. Thus, it can be seen that such a system can decrease system performance while simultaneously increasing system cost.
An effective solution is the technique known as the “cluster/lock” processing system, such as the XPC (Extended Processing Complex) available from Unisys Corporation and described in U.S. Pat. No. 5,940,826, entitled “Dual XPCs for Disaster Recovery in Multi-Host Environments”, which is incorporated herein by reference. This technique utilizes a relatively large number of instruction processors which are “clustered” about various shared resources. Tasking and management tends to be decentralized with the cluster processors having shared responsibilities. Maximal redundancy is utilized to enhance reliability.
Though a substantial advance, the cluster/lock systems tend to solve the reliability problems but remains relatively costly to implement, because virtually all of the hardware and firmware are specifically designed and manufactured for the cluster/lock architecture. This is necessary to enable each of the system components to effectively contribute to system reliability, system management, and system viability As a result, demand volumes remain relatively low.
In implementing prior art modular cluster/lock systems, it is normal to separate the locking, caching, and mass storage accessing functions. This is logical because it provides maximum scalability. However, with this approach, because the cluster/lock processor cannot directly connect to the mass storage devices upon which the data base resides, the acceleration of data into the cache and deceleration back to mass storage is very time consuming, complex to design, and cumbersome to manage. As a result of this separation of the functions of I/O and cluster locking into different platforms, the architecture becomes more costly in two ways. First, each of the different kinds of platforms is required to have a full set of capabilities, because both platforms must have some I/O capability, and each must have some processing capacity. Second and perhaps most important, the connectivity becomes almost unmanageable, because each of the devices must communicate with each of the other devices. Furthermore, however the connectivity problem is solved is likely to increase system overhead, because of the need to accommodate all of the inter-platform interfaces.
To increase system availability, it is important for cluster/lock processing systems to employ certain redundancies to permit continued operation even through failures of individual system components. This is typically done by establishing architectures which are not vulnerable to single failures within the system. However, it is assumed that to accomplish system recovery, a resource or group of resources having identical structure must be substituted for the failing component or subsystem.