1. Field of the Invention
The present invention relates to a method and a system for resetting a fault tolerant computer system equipped with a plurality of modules.
2. Description of the Related Art
Regarding a computer that provides high reliability, there has conventionally been available a fault tolerant computer system. The fault tolerant computer duplexes or multiplexes hardware modules constituting a system to operate all the modules in synchronization, and cuts off a module to continue processing by a normal module even when a fault occurs in a certain area, thereby enhancing fault tolerance.
The fault tolerance computer basically includes hardware modules such as a CPU, a memory and an I/O device to be duplexed or triplexed, and a fault tolerance control section (“FT control section” hereinafter) connected to the modules to execute synchronous operation processing, switching control at the time of a fault, or the like. FIG. 1 shows an example of a system in which a CPU, a memory and an I/O device are duplexed. In the drawing, a CPU (group) 901 and a main memory 902 constitute one CPU subsystem 903-1, and it is duplexed with another CPU subsystem 903-2 of a completely identical configuration. Similarly, I/O devices (groups) of identical configurations are duplexed to constitute an I/O subsystem 904.
The FT control section is positioned in a center thereof to control the modules (CPU subsystems 903-1, 903-2, and I/O subsystem 904). It controls maintenance of synchronous operations of both CPU subsystems 903-1 and 903-2, detection of faults, and cutting-off of a fault module.
Generally, the fault tolerant computer is divided into a section for duplexing and controlling the modules by hardware and a section for duplexing and controlling the same by software.
For example, the CPU subsystem constituted of the CPU and the memory is itself a board on which software operates, and must be duplexed and controlled by hardware. Accordingly, when an error occurs in the CPU subsystem, the hardware (FT control section) cuts off the CPU or the memory from the system and executes control to prevent an influence on the CPU or the memory of a normal operation.
In FIG. 1, there are two CPU subsystems 903-1 and 903-2. A fault side is logically cut off by the FT control section, and an operation is continued by one CPU subsystem 903-1 (or 903-2) and the I/O subsystem 904.
On the other hand, when a fault occurs in the I/O device, the FT section that has detected the fault announces an error to software (“I/O device driver” hereinafter) for controlling the I/O device, whereby I/O device switching can be executed by the software. In this case, the I/O device driver cancels use of the fault I/O device, and uses another duplexed I/O device instead.
This means switching of I/O devices 905 to be used in the I/O subsystem 904.
The CPU subsystems 903-1, 903-2 of the fault tolerant computer must be operated by completely identical clocks, and it is important to achieve sameness in reset releasing timing for starting operations of the CPU's.
According to a conventional method, e.g., JP-A-9-128258 “Resynchronous Reset Processing Method of Computer System”, an intersystem synchronization section connected to both processors simultaneously issues resets to CPU's.
According to a system described in JP-A-9-128258, it is easy to simultaneously issue resets to a plurality of CPU's as one intersystem control section issues resets. However, presence of only one intersystem synchronization section creates a risk that the system will not start when a fault occurs therein. Especially, since there is no mention of a case in which intersystem control sections are duplexed, how to simultaneously issue resets to CPU's is not described.
There are only a few other documents which specifically touch on synchronous reset control to a plurality of CPU's. A reason is that a CPU synchronization method uses not a reset but interruption synchronism as a starting point, for example, as described in JP-A-7-073059. For example, according to a method frequently used conventionally, an operating system or system software operating on a CPU stops at a certain check point, and a synchronous operation is started upon reception of an interruption input from a synchronous control section.
According to this method, however, an internal state of the CPU must be completely understood to guarantee that the internal state of the CPU is completely the same at the time of a stop at the check point. Otherwise, even when interruptions are simultaneously applied to the CPU's, enormous internal logics of the CPU's are not always maintained in the same state, and consequently synchronism of operations thereafter cannot be guaranteed.
That is, while the CPU is engaged in loop processing to wait for interruptions by the operating system or the system software, even in a CPU stopped state seen from the outside, many logics still operate in the CPU, such as processing of a loop command of the operating system or the system software, or system bus monitoring to wait for interruptions. In the CPU, prediction processing is carried out to achieve a high speed. However, prediction contents may vary from CPU to CPU. Furthermore, even a difference in refreshing timing or address of the main memory between the CPU subsystems may cause a variance in internal states of the CPU's.
In the old type CPU, the synchronization method that uses an interruption as a starting point may be effective. However, because of recent increases in size and complexity of internal logics of the CPU, it is virtually impossible to change the CPU that has started an operation to the completely identical state by software. To solve this problem, therefore, a method of completely synchronizing reset signals to reset all the internal logics of the CPU to input them to the CPU is the only way.