First, a general CPU (Central Processing Unit) synchronization operation is explained. In a computer system required to have high reliability, a system for duplexing CPUs and causing the CPUs to operate in synchronization with each other to continue, when a failure occurs in one CPU, processing in the other CPU is adopted.
When the CPUs are caused to operate in synchronization with each other, in some case, a synchronization shift of the CPUs occurs because of the influence of a failure of the CPUs or a failure of an external device of the CPUs. The synchronization shift of the CPUs means that commands issued by the CPUs are different between the CPUs performing the synchronization operation. For example, addresses of read commands issued by the CPUs are different from each other between the CPUs performing the synchronization operation or timings for issuing the read commands shift by one clock. These cases correspond to the synchronization shift of the CPUs.
As a case in which the CPUs included in the computer system detect errors and the synchronization shift of the CPUs occurs, a case 1 and a case 2 described below are conceivable.
Case 1: One CPU of the CPUs performing the synchronization operation detects an error
Case 2: Both the CPUs of the CPUs performing the synchronization operation simultaneously detect errors
In general, it is desired to perform the following operations concerning the case 1 and the case 2. Concerning the case 1, the computer system degenerates the CPU that detects an error. Alternatively, until the number of times of error occurrence reaches a predetermined number of times or more, the computer system synchronizes the CPUs again and continues the synchronization operation. In the case 2, the computer system records error content and stops the computer system. However, when error content is recoverable, after recovering the errors, the computer system continues the operation in the CPU synchronization operation.
In the case in which the computer system adopts a configuration including a CPU controller connected to the CPUs, a general operation of the CPU controller performed when a synchronization shift occurs in the CPU is explained. The CPU controller is, for example, a chip set called north bridge.
When the CPU synchronization operation is performed, in general, the CPU controller performs control of, for example, detection of a synchronization shift and degeneration of the CPU on a side on which an error occurs. On the other hand, for example, in some CPUs of an INTEL (registered trademark) architecture, when the CPUs are degenerated, it is possible to degenerate the CPUs only in a unit of CPUs connected to the CPU controller by a common bus. Specifically, when the computer system adopts a configuration in which plural CPUs are connected on the common bus, when the computer system adopts a configuration in which plural CPU cores are mounted in one CPU, or when plural logical CPUs operate in one CPU core, a synchronization shift occurs in a certain one CPU, and then, the CPU controller degenerates all the CPUs connected to the CPU, in which the synchronization shift occurs, by the common bus. When the synchronization shift occurs in the certain one CPU, all the CPUs connected to the CPU, in which the synchronization shift occurs, on the common bus need to execute re-synchronization processing.
A case in which the re-synchronization processing for the CPUs is performed using a general-purpose OS (Operating System) such as Windows (registered trademark) is explained. In the general-purpose OS such as Windows, a manufacturer of the OS and a manufacture of the computer system are different. Since the general-purpose OS has to operate in computer systems of various specifications, it is difficult for the general-purpose OS to perform processing specialized for the specifications of the computer systems. Therefore, when processing for coping with the synchronization shift of the CPUs is performed in the general-purpose OS, it is necessary to perform, without the support of the OS, processing not to affect a normal operation of the OS. Specifically, it is necessary to perform processing explained in (1) and (2) below.
(1) When a synchronization shift of the CPUs is detected, an interrupt for passing control to processing by firmware such as a BIOS (Basic Input/Output System) of the computer system (hereinafter referred to as a synchronization shift interrupt) is issued and the re-synchronization processing is performed in the processing by the firmware such as the BIOS.
(2) To prevent the processing of the OS from being affected, during the re-synchronization processing, for example, the interrupt is masked to perform the re-synchronization processing in minimum time.
A general operation of the computer system performed when a fault occurs in a device other than the CPUs is explained. When a fault occurs in a device other than the CPUs, in some case, both the CPUs performing the synchronization operation simultaneously detect errors. For example, when the CPUs perform read from an IO (Input Output) device such as a video controller or a LAN controller, a hardware fault occurs in the IO device. When the hardware fault occurs in the IO device, the IO controller detects the fault and returns data for informing the CPUs of the occurrence of the fault (e.g., data generally called Poison data). As a result, both the CPUs performing the synchronization operation simultaneously read the data for informing of the occurrence of the fault and detect errors.
When the CPUs performing the synchronization operation simultaneously detect errors caused by a fault of a device other than the CPUs, the CPUs often can continue the synchronization operation. This is because, since the CPUs performing the synchronization operation detect errors of completely the same content, kinds of error processing are also often completely the same. On the other hand, when both the CPUs performing the synchronization operation simultaneously detect error, the CPU controller cannot determine whether the CPUs themselves cause the errors or a device other than the CPUs causes the errors or whether the CPUs can perform fault analysis processing while keeping performing the synchronization operation or cannot perform the fault analysis processing. Therefore, the CPU controller needs to determine in advance whether, when both the CPUs simultaneously detect errors, the CPU controller processes the errors as a synchronization shift or processes the errors while the CPUs keep performing the synchronization operation. However, when the CPU controller determines in advance that the CPU controller performs processing while the CPUs keep performing the synchronization operation, thereafter, when a synchronization shift occurs, it is difficult to determine whether the synchronization shift occurs during the fault analysis processing or whether the synchronization shift occurs after the completion of the fault analysis processing or whether a cause of the synchronization shift is an original CPU error or is a new cause. During the failure analysis processing, when a synchronization shift is caused by the original CPU error, the CPU controller only has to perform processing regarding that an error occurs once. However, when a synchronization shift is caused by another cause after the completion of the fault analysis processing, the CPU controller needs to perform processing regarding that errors occur twice. In other words, when the CPU controller determines that the CPU controller processes the errors while the CPUs keep performing the synchronization operation, it is difficult to correctly perform processing when another error occurs after that. Therefore, when both the CPUs performing the synchronization operation simultaneously detect errors, in general, the CPU controller processes the errors as a synchronization shift and the firmware such as the BIOS determines, for example, necessity of re-synchronization.
For example, the operations of a computer system including a CPU controller that processes, when both CPUs performing a synchronization operation simultaneously detect errors, the errors as a synchronization shift are explained.
Firmware such as a BIOS needs to perform operations described in (a) and (b) below as fault analysis processing performed when the CPUs performing the synchronization operation simultaneously detect errors.
(a) The firmware applies the fault analysis processing to the errors detected by the CPUs.
(b) When the errors are recoverable, after recovering the errors, the firmware re-synchronizes the CPUs and continues the synchronization operation.
Processing for realizing the operations described in (a) and (b) above and performed when the CPUs performing the synchronization operation simultaneously detect errors is explained with reference to FIGS. 23 to 26.
FIGS. 23 and 24 are examples of a flow of operation processing performed when the CPUs detect errors. First, (hardware of) the CPUs detect errors (step S1 in FIG. 23). The CPUs set an interrupt mask and start error processing by firmware (e.g., a BIOS) (step S2). Subsequently, the CPUs investigate a cause of the errors according to an instruction of the firmware and log the errors (step S3). Subsequently, the CPUs determine, according to an instruction of the firmware, whether a fault is recoverable (step S4). When the CPUs determine in step S4 that the fault is recoverable, the CPUs execute fault recovery processing according to an instruction of the firmware (step S10) and proceed to step S12 in FIG. 24. When the CPUs determine that the fault is not recoverable, the CPUs call error processing of an OS according to an instruction of the firmware (step S5 in FIG. 23). The error processing of the OS is error processing instructed by the OS. The CPUs start the error processing of the OS (step S6).
Subsequently, the CPUs issue an interrupt in the error processing of the OS to another CPU connected by a common bus (step S7 in FIG. 24). The interrupt in the error processing of the OS is an interrupt for causing the other CPU to execute the error processing of the OS. In step S7, the CPUs further execute the error processing in synchronization with the other CPU.
Subsequently, the OS determines whether the fault is recoverable (step S8). When the OS determines that the fault is unrecoverable, the OS stops the system (step S9). When the OS determines that the fault is recoverable, the OS performs the fault recovery processing and performs processing for return to the error processing of the firmware (step S11). Subsequently, the CPUs return from the error processing according to an instruction of the firmware (step S12). The CPUs release the interrupt mask (step S13) and return to normal processing (step S14).
FIG. 25 is an example of a flow of operation processing performed when the CPU controller issues a re-synchronization interrupt. First, both the CPUs performing the synchronization operation simultaneously detect errors (step S21). Subsequently, both the CPUs simultaneously perform error notification to the CPU controller (step S22). Subsequently, the CPU controller degenerates one CPU (the CPU on one side) and issues a re-synchronization interrupt to the CPU not degenerated and another CPU connected to a bus common to the CPU (step S23). The re-synchronization interrupt is an interrupt for notifying that a synchronization shift occurs. Subsequently, the CPUs set an interrupt mask and start re-synchronization processing of the firmware, i.e., re-synchronization processing conforming to an instruction of the firmware (step S24). The CPUs determine, according to an instruction of the firmware, whether all CPUs that perform the re-synchronization processing are ready (step S25). When the CPUs determine that CPUs that perform the re-synchronization processing are not ready, the CPUs return to step S25. When the CPUs determine that all CPUs that perform the re-synchronization processing are ready, the CPUs execute the re-synchronization processing according to an instruction of the firmware (step S26). Subsequently, the CPUs return from the re-synchronization processing according to an instruction of the firmware (step S27). Subsequently, the CPUs release the interrupt mask (step S28) and return to the normal processing (the CPU synchronization operation) (step S29).
FIG. 26 is an example of a flow of operation processing of the CPU that receives the error processing interrupt of the OS issued in step S7 in FIG. 24. When the CPU receives the error processing interrupt of the OS (step S31), the CPU sets an interrupt mask and starts error processing of the OS (step S32). In other words, the CPU executes the error processing of the OS according to an instruction of the OS (step S33). When the CPU returns from the error processing of the OS (step S34), the CPU releases the interrupt mask (step S35) and returns to the normal processing (step S36).
FIGS. 27 to 34 are diagrams illustrating detailed examples of processing performed when the CPUs performing the synchronization operation simultaneously detect errors. FIG. 27 illustrates an example of a state in which the CPUs are performing the synchronization operation and do not detect errors. In FIG. 27, a CPU 100 and a CPU 102 are performing the synchronization operation. A CPU 101 and a CPU 103 are performing the synchronization operation. The CPU 100 and the CPU 101 are connected to a CPU controller 104 via a common bus 105. The CPU 102 and the CPU 103 are connected to the CPU controller 104 via a common bus 106. The respective CPUs are executing the normal processing (the normal processing of the OS) (see #1 to #4 in FIG. 27).
FIG. 28 illustrates an example of a state in which both the CPUs performing the synchronization operation detect errors. In this example, the CPU 100 and the CPU 102 simultaneously detect errors (see #5 and #6 in FIG. 28). The CPU 100 and the CPU 102 that detect errors notify the CPU controller 104 that the errors are detected (see #7 in FIG. 28). Subsequently, the CPU 100 and the CPU 102 start error processing of the firmware (see #8 and #9 in FIG. 28).
FIG. 29 is an example of a state in which the CPU controller degenerates one of the CPUs that detect errors and a CPU connected to a bus common to the CPU. In this example, the CPU controller 104 degenerates the CPU 102 and the CPU 103 connected to the CPU 102 by the bus 106 (see #10 in FIG. 29). When timings of the error detection are different, the CPU controller 104 may degenerate the CPU that detects the error earlier. Subsequently, the CPU controller 104 issues a re-synchronization interrupt to the CPU 100 and the CPU 101 that are CPUs not degenerated (#11 in FIG. 29).
FIG. 30 illustrates an example of a state in which the CPUs not degenerated receive a re-synchronization interrupt. In this example, the CPU 100 that detects an error and the CPU 101 that does not detect an error execute operations explained below. The CPU 100 executes the error processing of the firmware. The re-synchronization interrupt is put on pending (see #12 in FIG. 30). The CPU 101 starts the re-synchronization processing of the CPU after setting an interrupt mask. The re-synchronization processing of the CPU needs to be completed in a short time. Therefore, the CPU 101 executes the re-synchronization processing of the CPU while keeping the interrupt mask state (see #13 in FIG. 30).
FIG. 31 illustrates an example of a state in which the CPU 100 that executes the error processing of the OS issues an error processing interrupt of the OS to the CPU 101. The CPU 100 executes the error processing of the OS (see #14 in FIG. 31) and issues an error processing interrupt of the OS to the CPU 101 (see #15 in FIG. 31). On the other hand, the CPU 101 executes the re-synchronization processing regardless of the fact that the CPU 101 receives the error processing interrupt of the OS from the CPU 100. In other words, since the CPU 101 is in the interrupt mask state, the CPU 101 cannot execute the error processing of the OS (see #16 in FIG. 31).
FIG. 32 illustrates an example of a state in which the CPU 100 performs the recovery processing for an error. The CPU 100 executes the error processing of the OS. When fault recovery is possible, fault recovery processing is performed (see #17 in FIG. 32).
FIG. 33 illustrates an example of a state in which the CPU 100 returns from the error processing. The CPU 100 returns from the error processing to the normal processing, receives the pending re-synchronization interrupt, and starts the re-synchronization processing (#18 in FIG. 33).
FIG. 34 illustrates an example of a state after the CPU 100 and the CPU 101 complete the re-synchronization processing. When the CPU 100 and the CPU 101 complete the re-synchronization processing, the CPU 100 and the CPU 101 return to the normal processing of the OS (see #19 and #20 in FIG. 34). The CPU 100 and the CPU 101 respectively operate in synchronization with the CPU 102 and the CPU 103. As a result, the CPU 102 and the CPU 103 execute the normal processing (see #21 and #22 in FIG. 34).
The processing performed when the CPUs performing the synchronization operation simultaneously detect errors explained above with reference to FIGS. 27 to 34 has problems explained below. In the computer system that uses the general-purpose OS such as Windows, when error processing is performed, not only the CPU that detects an error but also the CPU that does not detect an error needs to perform the error processing. For example, when memory dump information during error occurrence is acquired in the error processing, an interrupt is issued to all CPUs of the computer system and, after cache information is copied to a memory, memory information is stored in a hard disk or the like and used, for example, when an error occurrence cause is investigated later.
However, as explained above with reference to FIG. 31, since the CPU 101, which is the CPU that does not detect an error, is in the interrupt mask state, the CPU 101 cannot execute the error processing of the OS regardless of the fact that the CPU 101 receives the error processing interrupt of the OS from the CPU 100. Therefore, in the processing explained above with reference to FIGS. 27 to 34, a problem occurs in that the CPU that does not detect an error cannot receive the error processing interrupt of the OS issued in the error processing of the general-purpose OS such as Windows and cannot execute the error processing.
There is proposed an FT (Fault Tolerant) computer system that performs duplexing control for an I/O device without altering the existing OS or I/O device driver.
Patent Document 1: Japanese Patent Application Laid-Open Publication No. 2006-172220