1. Field of the Invention
This invention relates to a multi-CPU computer having a plurality of CPUs installed therein for operating on a common operating system, and a method of restarting a system for the multi-CPU computer, and more particularly to a multi-CPU computer for performing an emergency shut-down of the system when a hardware error has occurred, and a method of restarting the system for the multi-CPU computer.
2. Description of the Related Art
Recently, UNIX® servers, and IA servers (server machines having a microprocessor available from Intel Corporation installed therein) have been introduced to main systems. Many of the UNIX servers and the IA servers are configured as multi-processor servers for enhanced performance. It should be noted that as to servers used for the main systems, much importance is placed not only on the enhancement of performance but also on the enhancement of reliability. Particularly when a fatal hardware error has occurred, it is necessary to prevent the system from running out of control and data from being destroyed. To this end, when a fatal hardware error has occurred, the system is subjected to an emergency shut-down.
In the case of the main system, however, it is not permitted to stop the system for a long time period. Therefore, in the multi-processor server, even when the system is subjected to an emergency shut-down due to a hardware error, it is a critical requirement to restart the system quickly and automatically by isolating only a portion of the system where the error has occurred. Hardware errors which occur in the server include continuous occurrence of a multi-bit error in a CPU (Central Processing Unit) cache, for example.
In the prior art, when a fatal CPU error, such as the multi-bit error in the CPU cache, has occurred, the following error handling is conventionally carried out: First, a CPU in which the error has occurred sets error information in a register. Then, the CPU generates a trap to notify the operating system of occurrence of the error.
The operating system executes trapping at the CPU in which the error has occurred. In the trapping, the operating system refers to the register of the hardware to obtain the error information.
After that, the CPU in which the error has occurred carries out panic handling. The panic handling is to subject the system to an emergency shut-down and restart the same so as to prevent the system from running out of control and data from being destroyed. In the panic handling, the CPU displays and records the error information, performs synchronization of a file system to obtain a memory dump, and then restarts the system.
FIG. 8 is a diagram showing a conventional method of restarting the system when an error has occurred. A server 900 has a CPU 910 and a CPU, 920. The CPU number of the CPU 910 is “CPU #0”, and the CPU number of the CPU 920 is “CPU #1”. Processing executed by the CPU 910 and the CPU 920 includes processing executed in a hardware logic circuit 901 and processing executed according to an operating system 902.
The CPU 910 and the CPU 920 have error notification circuits 911 and 921 as processing functions executed by the hardware logic circuit 901. Further, the error notification circuits 911 and 921 notify the operating system of information indicative of a hardware error which has occurred in a CPU. It should be noted that notification of error information to the operating system is intended to mean processing for passing the error information to a process for performing error handling of the operating system. More specifically, each of the CPU 910 and the CPU 920 sets error information in a predetermined register when a hardware error has occurred, and generates a trap. The error information includes an error type, a CPU number, an address of data where the error has occurred. A process based on the operating system refers to the register storing the error information, whereby the error information is notified.
Processing functions executed according to the operating system 902 include trapping functions 912 and 922, and panic handling functions 913 and 923. The trapping functions 912 and 922 are each provided for obtaining a error type, a CPU number, and an address with reference to the register storing error information. The panic handling functions 913 and 923 are each provided for displaying and recording the error information, synchronizing the file system, obtaining a memory dump, and restarting the system.
The example illustrated in FIG. 8 shows a case in which an error has occurred in the CPU 910. In this case, the error notification circuit 911 of the CPU 910 notifies the operating system 902 of error information. Then, the trapping function 912 executed by the CPU 910 according to the operating system 902 performs trapping, whereby the operating system 902 obtains the information including the error type, the CPU number, and the address. Subsequently, the panic handling function 913 displays and records the error information, synchronizes a file system, obtains a memory dump, and then restarts the system.
As described above, when the error handling (trapping and panic handling) is executed by the operating system, it is also possible to notify a high-order computer of fault information obtained by the operating system of a low-order computer (see e.g. Japanese Unexamined Patent Publication No. 04-340631).
Further, a technique is also known in which a diagnosis processor having obtained fault information from a faulty processor notifies an high-order processor of occurrence of a fault, and the high-order processor initializes and restarts the faulty processor, to thereby restore the faulty processor to an operative state (see e.g. Japanese Unexamined Patent Publication No. 02-71336).
Further, there has been proposed a technique for obtaining fault information when a fault has occurred in the multi-CPU computer, in which a plurality of processors executes the obtaining of fault information in parallel with each other to thereby shorten a time period required for obtaining the fault information. In this technique, a processor in which is detected occurrence of the fault instructs other processors to obtain the fault information, and the instructed processors obtain the fault information (see e.g. Japanese Unexamined Patent Publication No. 11-338838).
However, in the procedure shown in FIG. 8, the trapping and the panic handling are carried out by the CPU in which the error has occurred, and the system is restarted. Therefore, when a multi-bit error has continuously occurred in the CPU cache, the CPU sometimes cannot operate normally. Even if the CPU incapable of operating normally is about to execute the trapping and the panic handling of the system, the accurate processing cannot always be performed. Therefore, there is a fear that when a hardware error has occurred, the system in operation is hung up or the restart thereof fails. This can cause a serious problem that the operation of the system is stopped for a long time period, for example.
When the system cannot carry out error handling accurately, the following problems, for example, are expected to occur.
When error information is not displayed, it is impossible to identify a CPU to be replaced for prevention of recurrence of the error. In this case, even when the system is manually restarted after occurrence of the problem, there is a possibility that the same problem is caused again by the same error.
When the synchronization of the file system cannot be executed, data existing on a file cache (memory) cannot be written on a disk. This can cause destruction of files and data.
When a memory dump fails to be obtained, it is impossible to analyze details of information based on the memory dump, e.g. as to when the error occurred, which process was being executed then, and what error occurred. For example, there is a case where the same CPU error occurs again during the panic handling in the error handling of a CPU error, which prevents accurate error information from being displayed. In this case, if the memory dump as well failed to be obtained, there is no means for identifying the primary factor.
When the restart of the system fails, there can be caused a serious problem that the operation of the system is stopped for a long time period. If the time period of stoppage of the system for key business operations becomes longer, it can also become an object of public concern.
It should be noted that in the technique disclosed in Japanese Unexamined Patent Publication No. 04-340631, the operating system of the low-order computer notifies the fault information. Therefore, there is no guarantee that the operating system operates normally on the faulty low-order computer. This can cause a state where the fault information cannot be notified to the high-order computer.
Further, in the technique disclosed in Japanese Unexamined Patent Publication No. 02-71336, the diagnosis processor obtains fault information from another faulty processor, and the high-order processor initializes and restarts the faulty processor. More specifically, it is assumed that the processors operate individually, and can be restarted independently. In many multi-CPU computers, however, a plurality of CPUs operate on a common operating system. In such multi-CPU computers, data shared by the CPUs exists, and to restart one of the CPUs, processing for ensuring consistency of data is required. Therefore, it is difficult to apply the technique disclosed in Japanese Unexamined Patent Publication No. 02-71336 to a multi-CPU computer in which a plurality of CPUs operate on a common operating system.
Furthermore, in the technique disclosed in Japanese Unexamined Patent Publication No. 11-338838, processors other than the processor in which a fault has occurred obtains fault information, and hence it is possible to obtain fault information by the normal processors. However, the system is restarted by the faulty processor. As a result, even when the process for restating the system is executed by the processor which does not operate normally, there is a possibility that the system cannot be started properly. If the system fails to be restarted, the time period of stoppage of the system becomes longer, which results in the degraded operating efficiency of the system.