1. Field of the Invention
The present invention relates to a computer system, a fault tolerant system using the same, an operation control method and a program thereof, and particularly an improvement of the fault tolerant computer system.
2. Description of the Prior Art
Recently, performance of general purpose CPUs widely used have been significantly increased, and by installing a general purpose operating system (OS) on a workstation or a server using such general purpose CPU, high-performance and inexpensive system is provided. As a result, even in applications where a very expensive large-scale computer has been conventionally used, a system using a high-performance and inexpensive general purpose CPU is used.
On the other hand, mission-critical applications where a system needs to continuously run 24 hours per day are also increasing. In these applications, it is important to construct the system such that system down is prevented.
However, In such a general purpose CPU and a general purpose OS, due to lack of CPU's own fault detection function, and in the case of hardware failure in a general purpose OS, due to lack of fault notification means and definition of a fault processing for a response upon failure, hardware failure causes system breakdown. Therefore, in order to provide a high reliability system, it is required to add a special peripheral circuit or to develop a dedicated OS, thereby making it difficult to develop high reliability systems while following the speed at which normal general purpose CPU systems are developed. For this reason, cost-performance differences between normal systems and high reliability systems tend to increase.
Therefore, in order to provide a high reliability computer system having commonality with a computer system using general purpose CPU, such as a fault tolerant computer system, for example, as described in Japanese Patent Laid-Open No. 09-034809, CPUs performing the same processing synchronously with the same clock, a device for detecting a failure of a CPU and disconnecting the fault CPU and a system for disconnecting a fault IO path by CPU instructions depending on IO failure are proposed. However, in a general purpose OS widely used, due to the fact that a notification method for hardware failure and a fault processing function are not provided, there is a problem that system breakdown occurs.
Therefore, in order to use a general purpose OS, a configuration is required in which hardware failure and OS are completely separated. For example, referring to National Publication of International Patent Application No. 2001-523855, a calculation element (CE) and a IO control part (IOP) are respectively configured by one computer system and each element is connected multiple times, so that redundancy is achieved. Communication is performed between elements, and a CE or IOP is disconnected in which failure is detected.
For example, for the IO control part, hardware is virtualized from OS viewpoint. Although an IO control part in which failure occurs may be paused, this failure does not affect OS directly, and occurrence of a failure can be concealed by detecting the failure and returning a normal response by remaining redundant IO control parts.
However, in this system, a CPU and an IO require one computer respectively, and further, to make redundant configuration, a number of computers required to achieve redundancy is needed. Further, Since each computer performs asynchronously or differently, extra OS licenses needs to be installed and system become expensive.
There is a system that allows to change connection configuration of an IO device or an IO bridge on a bus dynamically by using standard PnP (Plug & Play) software on a general purpose OS, therefore it is conceivable that a high reliability fault tolerant computer system is constructed by adopting redundant configuration system using such PnP software.
However, For an IO bridge PnP processing, complete dynamic configuration change is not supported. For example, in connecting an IO bridge, since the amount of memory space allocation requested by a device connected to the IO bridge, and the number of devices are not determined, memory space is not free to be allocated to connected IO bridges due to limitation on PnP control software or OS (typically defined by a fixed value), thereby the number and the type of devices are limited.
For example, in the case of PnP connection of an IO bridge, because OS can not determine how many resources are required for the IO bridge, certain memory space is allocated and can not be changed thereafter. For this reason, when a further IO bridge or device is connected under the connected IO bridge, an IO bridge or a device can not be connected which requires more memory space than allocated to the first IO bridge.
Also, if a plurality of devices are connected, due to the fact that required memory space may exceed memory space allocated to the IO bridge, some devices may fails to allocate memory resource depending on the number of connected devices. This limitation on memory resource allocation becomes serious problem in a multistage IO bridge configuration system such that a plurality of IO bridges are connected under a PnP connected IO bridge.
Conventionally, due to the above limitation, in order to construct a fault tolerant computer system with a system using PnP software, it is required to modify OS standard PnP control software and OS itself to sufficiently allocate resources to the IO bridge.
It is an object of the invention to provide a high reliability and high availability computer system capable of constructing a redundant configuration without the need to modify existing general purpose OS functions, a fault tolerant system using the same, a operation control method and a program thereof.