1. Field of the Invention
The present invention generally relates to a duplex system of controllers. More specifically, the present invention is directed to a fault tolerant computer system in which interrupt controls are duplexed.
2. Description of the Related Art
As a computer system with high reliability, a fault tolerant computer system is known. In the fault tolerant computer system, all of hardware modules of the computer system are duplexed or multiplexed. All of these hardware modules operate in synchronization with each other, and even if a failure has occurred in a certain portion of a hardware module, the failed hardware module is disconnected from the above-described tolerant computer system, and the operation is continued by the remaining normal hardware modules. As a result, a fault resistant characteristic is improved.
FIG. 1 shows an example of a configuration of the fault tolerant computer system. The fault tolerant computer system of this example is provided with a fault tolerant (FT) control section 10, and hardware modules such as CPUs, memories, and I/O devices are duplexed. The FT control section 10 is connected to the hardware modules and carries out synchronization processing and switching control when a failure has occurred.
In the fault tolerant computer system shown in FIG. 1, a CPU (or CPU groups) 2A, a main memory 3A and a part of FT control section 10 constitute one CPU sub-system, and another CPU sub-system is provided to have completely the same configuration as the CPU sub-system 1A. Thus, the two sets of CPU sub-systems 1A and 1B are duplexed. Similarly, I/O devices (I/O device groups) 5A and 5B having the same configuration are duplexed and constitute an I/O sub-system. The FT control section 10 is located at a center of these hardware modules, and controls each of these hardware modules such as the CPU sub-systems 1A and 1B, and the I/O device groups 5a and 5B to keep the synchronous operation between the CPU sub-systems and to detect a failure. Also, the FT control section 10 also controls to disconnect a failed hardware module from the fault tolerant computer system. Although the two sets of the CPU sub-systems 1A and 1B are present in the computer system of FIG. 1, the failed sub-system is logically disconnected from the FT control section 10, and a process is continued by the remaining set of the CPU sub-system and the I/O sub-system.
Generally speaking, the fault tolerant computer system is divided into a portion which is duplexed in a hardware manner, and a portion which is duplexed in a software manner. For example, CPU sub-systems 1A and 1B are bases on which software is executed, and these CPU sub-systems 1A and 1B must be duplexed in a hardware manner. When a failure has occurred in one CPU sub-system, the FT control section 10 disconnects either the CPU or memory of the CPU sub-system, in which the failure has occurred, from the computer system, and carries out a control in such a manner that an adverse influence does not affect the CPU and the memory operating normally. On the other hand, when a failure has occurred in the I/O device, the FT control section 10 detects the failure and notifies the occurrence of the failure to software for controlling the I/O device (to be referred to as an “I/O device driver”, hereinafter). Thus, it is possible to switch the I/O devices in a software manner. In this case, the I/O device driver stops drive of the I/O device in which the failure has occurred, and drives the other of the duplexed I/O devices. This is realized as switching of the I/O devices used in the I/O sub-system.
However, some of the I/O devices cannot be duplexed in the software manner. For instance, an interrupt controller is one of such devices that cannot be duplexed in the software manner. The interrupt controller receives an interrupt request issued from each of the I/O devices or the like, and notifies the interrupt request to the CPU. The interrupt request is allocated with an interrupt number called “IRQ” by an operating system (OS). In a certain case, a plurality of I/O devices are allocated to a single interrupt number. The interrupt controller converts the interrupt request issued from each of the devices into the predetermined interrupt number, and then notifies the interrupt number to the CPU. At this time, while the CPU is presently executing an interrupt process corresponding to a certain interrupt number, the interrupt controller does not notify the interrupt request having the same interrupt number or manages the interrupt requests issued from the plurality of devices such that the interrupt requests are not lost. For this purpose, the interrupt controller internally executes a process of holding a status corresponding to the interrupt request on execution. Therefore, if a failure has occurred in the interrupt controller, all of the data for the interrupt request would be lost. As a result, it is not possible to recover the interrupt controller to the original status in software.
Further, although present operating systems (OSs) such as the “Windows” (registered trademark) and the “Linux” allow existence of plurality of interrupt controllers, these operating systems cannot cope with a state that interrupt controllers are increased or decreased during the operation. Therefore, the interrupt controllers which have been present when the computer system was started must be present until the operating system is shut down, and must continue to operate in the normal state.
By the way, present PC servers direct to an open-system, and when the PC server should be manufactured in a low cost, the Intel-compatible (Intel is a registered trademark) CPU and electronic components which are commercially available in low prices are necessarily selected. Also, the Windows and the Linux are major operating systems in the present PC servers and have been designed based upon the Intel-compatible architecture. However, in the open-system PC servers, when a fault tolerant computer system should be configured in a low cost, there are many problems.
For instance, the most I/O devices and the most operating systems such as the “Windows” are not designed under consideration of the fault tolerant computer system. Therefore, even if the devices are duplexed, the PC server cannot completely cope with a fail-over process on a failure. In the Intel-compatible PC server, the interrupt control depends on a special I/O device on which the legacy functions called “south bridge” are concentrated. Particularly, since the interrupt control is one central function of the system operation, the operating system directly accesses the south bridge to control the operation of the south bridge. For this reason, if a failure has occurred once in the south bridge, the function of this operating system is completely lost. As a result, a system-down is caused. Also, it is practically impossible to modify the operating system such as the Windows, which has been mainly used in the open-system PC server, to adapt for the fault tolerant computer system.
In conjunction with the above description, Japanese Laid-open Patent Application (JP-A-Heisei 9-251443) discloses a processor fault recovering method for an information processing system. In this conventional example, the information processing system has a plurality of processors, at least one of which operates as a system supporting processor. The remaining processors operate as instruction processors. In such an information processing system, when a failure has occurred in one processor, an interrupt is issued to an operating system (OS) which is running on at least one instruction processor. The operating system recognizes that the failure has occurred in the instruction processor, and stops an application program being executed on the instruction processor when the interrupt is issued, and then replaces the above-described instruction processor by the system supporting processor.