When a two-bit error on a bus on a processor board occurs in data read from a memory in a computer device in operation, a method described below has been used to deal with the error.
The two-bit error can not be corrected, namely, it is an uncorrectable failure. Therefore, when a two-bit error occurs or when unfixed data caused by a two-bit error in a cache is referred to, it is detected, the computer device is stopped, the failed processor board is automatically disconnected, and then the operation is restarted by using another processor board. After that, the computer device is stopped again and the failed processor board is replaced.
For this reason, in the computer device, when an uncorrectable failure occurs on the processor board, it is necessary to detect the failure certainly and to restart the operation in a short period of time.
However, because a computer device becomes widely used and plays an important role as a social infrastructure, an improvement in the continuous operation performance and the fault tolerance performance of the computer device becomes important. For this reason, a new method which enables the continuous operation of the computer device and the replacement of the failed processor board without stopping the computer device even when uncorrectable failure occurs is required.
For example, Japanese Patent Application Laid-Open No. 2003-256396 discloses a technology which makes the continuous operation of a computer device possible without stopping the computer device when a failure occurs in the computer device.
In a computer device described in Japanese Patent Application Laid-Open No. 2003-256396, when a correctable failure such as an intermittent failure occurs, data stored in a memory and internal information on a processor on a processor board in which the failure occurs are copied onto a memory and a processor on a processor board for replacement. The processor board is switched from the processor board in which the failure occurs to the processor board for replacement. As a result, the operation is continued without stopping the computer device.