1. Field of the Invention
The present invention relates to a control method for an information processing apparatus, an information processing apparatus, a control program for an information processing system and a redundant comprisal control apparatus, and in particular to an effective technique applied to an information processing system for improving failure resistance by making redundantly comprised hardware perform equivalent information processing.
2. Description of the Related Art
In a server system required to be highly reliable, information processing systems having mirroring functions are known to duplicate hardware such as processors so that both thereof execute the same information processing in order to improve failure resistance and, if an abnormality occurs in one processor, continue the information processing by using the input and output signals of the normally operating mirror processor.
In a case where output signals from both processors do not identify with each other, while no such error as described above (i.e., abnormality) has been detected in either processor operating in duplication, by an abnormality detection of the processors such as a parity error of input and output data, an ECC error, a timeout error in the processing operation of each processor, et cetera, a judgment criteria for identifying an abnormality of either of these processors does not exist. Consequently, the system needs to be stopped if the emphasis is placed on reliability but the availability of the system decreases. Such has been a technical challenge.
Meanwhile, an abnormality detection of power source generally takes a long time as compared to the operating speed of a logic circuit such as a processor. Due to this, if an abnormal system is to be detected by using a detection signal for power source abnormality involving a power source voltage drop, causing a processor to become inoperative, such a technique cannot remedy the above described failure of nonidentity output signals whereas both systems are seemingly normal.
For example, the patent document 1 has disclosed a high reliability computer, comprising first and second CPUs of the same configuration, a clock unit for supplying clock and reset signals of the same frequency and phase to these CPUs, a dual system adapt or (DSBA) for connecting these two CPUs with an input/output apparatus and an inter-block communication unit for exchanging CPU statuses, et cetera, between the two CPUs, in which the clock unit accomplishes synchronous execution of programs by the two CPUs and the dual system adapt or detaches one CPU, if it fails, to have the other non-failing CPU continue the processing.
That is the presiding DSBA monitors and compares the two CPUs and accesses the system such as memory, I/O, et cetera, by using the signal coming from the normally operating CPU of the two.
The DSBA confirms normality of the CPUs by performing an ECC check, parity check, et cetera, of signals transmitted from the dualized CPUs, respectively, and monitoring for an error signal for notification of an abnormality detected by the CPU therein. When detecting an abnormality, the DSBA shuts off the system judged to be abnormal to continue the processing by the normal CPU only.
If the two respective signals transmitted from the CPUs 0 and 1 raise a nonidentity, while no abnormality is detected for either CPU, the choice will be either stopping the system due to inability to continue, or continuing the processing by using either one of the two CPUs only.
In the system disclosed by the patent document 1, if a discrepancy occurs in the internal circuit caused by a failure of the internal semiconductor of the CPU or a software error, a built-in error detection circuit is capable of detecting it by a parity check, et cetera. Or, if an error occurs in the bus between the CPU and the DSBA, the error can be detected by a parity check or ECC check of the bus at the input to the DSBA or the CPU.
If there is an abnormality in the power source supplying the operating power for the CPU, however, causing the entire CPU to be affected so that the failure detection circuit, et cetera, within the CPU cannot function properly, and hence are incapable of outputting an error signal, then a possibility of outputting data to the controller as if the CPU were operating normally arises.
In the case of power source failure, as the supply voltage to the CPU falls below the lowest voltage for normal operation due to a drastic voltage drop, the CPU is considered to fall into a critically abnormal condition just a few milliseconds thereafter, and therefore it is possible to judge which CPU has become abnormal in an extended period of time. However, if mirroring is performed (i.e., dualizing CPU) by hardware, it is necessary to judge an error immediately at the time of the CPUs in two systems outputting different signals, not a few milliseconds thereafter. Therefore, it is necessary to detect a power source abnormality before a power source failure impacts on a malfunction of the CPU circuit.
In the meantime, CPUs have grown in recent years consuming large amounts of power, requiring a power supply dedicated to each CPU, and hence it has become necessary to take the effect of power supply failure into consideration in practicing mirroring.
Incidentally, a patent document 2 has disclosed a technique to equip a latch for retaining the output from a voltage abnormality detector which monitors the power source voltage at the processing apparatus in a data processing system including a plurality of processing apparatuses and a monitoring apparatus for monitoring these processing systems, enables confirmation of synchronism between a detection of voltage abnormality and malfunction of processing apparatus by confirming an abnormality of power source voltage by the monitoring apparatus referring to the latch when an abnormality is detected in the processing apparatus, and accomplish a clarification of the relation of cause and effect between the voltage abnormality and the abnormality of the processing apparatus.
While by using the technique disclosed by the patent document 2 it is possible to determine a cause-and-effect relationship between a voltage abnormality and respective malfunction of the processing apparatus, there is, however no disclosed technique for defining a failed CPU in such a fault that the processing results of a plurality of CPUs raise a nonidentity while a fault has not been detected for each CPU as described above.
Likewise, the patent document 3 has disclosed a multiplex system which comprises, in each of multiplexed processing apparatuses, a power source state retention unit for monitoring the input power at the own apparatus and memorizing a presence or absence of reapplying power in association with an instantaneous power outage, and a control unit for referring to a power source state retention unit comprised by other processing apparatus in response to detection of non-response in the other apparatus, judging whether or not the non-response has been caused by a restarting of the system in association with the instant power outage and resetting the state of the power source state retention unit comprised by the aforementioned other processing apparatus.
The patent document 3, as with the patent document 2, however, also does not disclose a technique for defining a failed CPU in which such a fault that the processing results of a plurality of CPUs raise a nonidentity while a fault has not been detected for each CPU as described above.
Additionally, the patent document 4 has disclosed a computer system which comprises a circuit for assembling the computer system, and a fault management system for detecting a fault state of each circuit independently and correlating each circuit with the fault state. The patent document 4, however, also does not disclose a technique for defining a failed CPU in such a fault that the processing results of a plurality of CPUs raise a nonidentity while a fault has not been detected for each CPU as described above.
[Patent document 1] Laid-open Japanese patent application publication No. 8-190494
[Patent document 2] Laid-open Japanese patent application publication No. Sho 57-141731
[Patent document 3] Laid-open Japanese patent application publication No. 3-266131
[Patent document 4] Laid-open Japanese patent application publication No. 10-143387(U.S. Pat. No. 6,000,040)