The present invention relates to a system, a method, and a program for recovering from an error occurring due to the influence of, for example, cosmic rays in a computer system.
In computer systems, a fault called a transient fault is known. A transient fault is a malfunction of a circuit that temporarily occurs due to the influence of, for example, cosmic rays. As the packing density of transistors is improved, the probability of occurrence of a transient fault increases. Thus, a processor is also required to include a mechanism for detecting and recovering from a transient fault. Especially, in computer systems used for a mission-critical purpose and computer systems exposed to a high level of cosmic rays, such as an aircraft or spacecraft control system, such a requirement is high.
In this regard, Japanese Unexamined Patent Application Publication No. 55-125598 discloses a technique for, in a redundant system in which the same program is caused to run on two processors, when one of the processors has detected an error in the processor's main memory, performing a recovery operation by reading a correct value from the other processor's main memory.
Moreover, Japanese Unexamined Patent Application Publication No. 3-269745 discloses a technique for, in a system in which two processors are put in an operating state and a wait state, and the respective contents of main memories of the processors are always equalized with each other by memory equlizing means, when memory diagnosis means of one of the processors has detected an error in the processor's main memory, performing a recovery operation by reading a correct value from the other processor's main memory.
However, in these known techniques, since the respective calculation results of the processors are not compared while a program is running, a transient fault having occurred in the processors cannot be detected, and thus recovery from the transient fault cannot be made.
On the other hand, Cheng Wang, Ho-seop Kim, Youfeng Wu, and Victor Ying, “Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection”, In the Proceedings of the International Symposium on Code Generation and Optimization, pp. 244-258 (2007) discloses a mechanism for detecting a transient fault in a manner described below. In this mechanism, a source code is compiled into two versions, and the two versions are executed in respective CPU cores. For convenience sake, the threads are called a leading thread and a trailing thread. A leading thread and a trailing thread redundantly perform the same calculation and detect a transient fault by performing comparison when performing a read operation and a write operation on a shared memory. However, in this method, even when a transient fault is detected, recovery from the transient fault cannot be made. This is because, even when a mismatch between the calculation results is detected, no means for restoring a former state in which the calculations have not been performed exists, and thus a program is forcibly terminated in the current state.