The present invention relates to a timeout preventing device, a timeout preventing method and a program thereof. More particularly, it relates to a method for preventing a timeout which may occur in reinitialization of CPU (Central Processing Unit).
Generally, as a technique for enhancing the reliability of a calculation result, it is known that a plurality of modules each having an arithmetic processing unit executes the same arithmetic processing and the calculation result is used.
As one example of such technique, Patent Document 1 discloses a technique using a fault-tolerant computer system. In the technique disclosed in Patent Document 1, the fault-tolerant computer system is configured to include a plurality of computing modules 100 and 200. Each of the computing modules 100 and 200 processes the same instruction sequence in synchronism with clocks, and compares the processing results of the respective computing modules with each other so as to continue processing by the other computing module even if one computing module is in failure.
Also, as one of techniques using the plurality of computation results, a technique called CLL (core level lockstep) is known.
The CLL is a technique in which two cores execute the same processing, and compare the calculation results with each other to enhance the reliability. A state in which two cores execute the same processing and compare the calculation results with each other is called “lockstep mode”. When the calculation results are corresponding with each other during the lockstep mode, a failure of the CPU is indicated as a non-recoverable failure. Also, there is a possibility that a failure occurs in each core other than the calculation result corresponding failure during the lockstep mode. When a recoverable failure occurs in each core, the lockstep mode is temporarily canceled (this cancellation is also called “LoL (loss of lockstep)”), resulting in “non-lockstep mode” which is a state in which the calculation results cannot be compared with each other.
When the state gets into the non-lockstep mode, a BIOS has to again get a state of a processor into the lockstep mode. This operation is called “LoL recovery”. The CPU is required to be again reset for the LoL recovery. Temporarily, in the BIOS, a process for stopping transaction between the CPU and IO (Hereinafter, the transaction is appropriately represented as “Txn”. Also, the transaction between the CPU and the IO is represented as “IO Txn”) is required. This is to prevent IO interrupt for a processor in condition of “LOL” from being lost when CPU is reset again.
(Like this, it is possible to prevent the IO interrupt from being lost by stopping transaction between the CPU and the IO.
For example, Patent Document 2 discloses a technique regarding the IO interrupt on the page 2, line 7 from left below to page 3 to line 9 from right below.    [Patent Document 1] JP-A-2004-046599    [Patent Document 2] JP-A-HEI3-102430
However, the above-mentioned technique in which transaction between the CPU and the IO is stopped suffers from the following problem.
The problem is that when the IO Txn stop time is long, a completion timeout occurs in a PCI Express Card, resulting in the possibility that the system fails. This problem will be described in detail below.
The CLL is a technique in which two cores execute the same processing, and compare the calculation results with each other to enhance the reliability. When a failure occurs in each core, it is necessary to reset the CPU in order to reenter the lockstep mode. As described above, in the general CPU reset portion, in order to prevent the interrupt of IO from being lost during resetting of the CPU, the IO Txn stops. However, there arises such a problem that when the IO Txn stop time becomes longer, the PCI Express Card results in completion timeout. The completion timeout is defined as default 50 ms on the specification of the PCI Express. In most of PCI Express Cards of the GEN1 generation, the completion timeout time cannot be changed. For that reason, in order to prevent the completion timeout, there is a need to prevent the Txn stop time from exceeding 50 ms.
FIG. 4 shows a general LoL recovery flow. Before the CPU is reset (B603), context saving (B602) is implemented in order to save the present condition, and after the CPU has been reset, context restoration (B604) is executed in order to restore the condition to the condition before the CPU is reset. When interrupt from the IO is generated between B602 and B604, the interrupt is lost. Therefore, the IO transaction stop (B601) and the IO transaction start (B605) are conducted. There arises such a problem that a processing time from 3601 to B605 exceeds 50 ms, resulting in the default completion timeout of the PCI Express Card.
Therefore, an object of the present invention is to provide a preventing timeout device, a preventing timeout method and a preventing timeout program which are capable of preventing the timeout that may occur in reinitialization of the CPU with again resetting the CPU.