1. Field of the Invention
The present invention relates to a program, a method, and a mechanism for taking a panic dump in the event of failure.
2. Description of the Related Art
With the widespread use of the information communications technology, an information processing device, especially a server system operated in a basic system, requires high reliability. Therefore, when a failure occurs during the operation the system, it is indispensable to immediately collect information and continue the operation.
Generally, when a system cannot continue its operation due to the occurrence of fatal failure, the function of dumping memory data, that is, the panic dump facility, is used in the event of failure.
The panic dump facility stores the contents of memory the moment when abnormality which disables the system to continue its operation is detected. Normally, the OS (operating system) or a program operating in a kernel performs a dumping process.
For example, when the CPU receives an abnormality detection interrupt signal while operating the program in the kernel, the CPU passes control to the memory dump program in the kernel to take a memory dump.
Since a memory dump program is incorporated into a kernel, necessary information for a failure analysis can be dumped in the optimum size.
However, if the cause of the abnormality is, for example, a defective program operating in the kernel, the inconsistency of control data, the destruction of memory storing the kernel (program), the abnormal hardware, etc., it is possibly necessary in the dumping process to obtain the resources (for example, destroyed memory) which is the cause of the abnormality. In this case, since there occurs again abnormality during the dumping process, the dumping process can fail.
Furthermore, depending on the type of abnormality, the system can hang up and control cannot be passed to the panic dumping process. As a result, the memory data cannot be successfully dumped.
To solve the above-mentioned problems, there is a stand-alone dump to reset the system with the memory data stored, reset the hardware resources other than the memory data to be dumped, activate again the dumping process program, and dump the memory data in the current environment.
For example, if taking a memory dump by a memory dump program incorporated into a kernel as described above cannot be successfully performed, and the system hangs up, then the system is reset with the data stored in the memory as is, the memory dump program (stand-alone dump program) which is different from the memory dump program incorporated into the kernel, and the memory dump is taken.
By the stand-alone dump, a dump can be taken regardless of the environment (inconsistency of control data of the kernel, destruction of memory, etc.) in which the abnormality occurs. When temporary abnormality occurs in the hardware, the hardware can be reset for a normal operation at a strong possibility. When constant hardware abnormality occurs, there can be a strong possibility that abnormality can be detected the POST (power on self test) diagnostics performed when the system is reactivated by a resetting operation, and in the process of initializing hardware.
However, there has been the following problem in the conventional stand-alone dump.
1) To reset the system and download a stand-alone dump, it is necessary to store in advance the data of the memory area to be overwritten by the stand-alone dump. That is, it is necessary to store the data by the boot firmware to boot the OS before loading (storing) the stand-alone dump.
To attain this, it is necessary to have hardware resources required to temporarily save memory data on the boot firmware, or reserve in a disk, etc. a dedicated partition for control by the boot firmware to store the data in a file.
It is not advantageous in cost to have dedicated hardware resources. When a dedicated partition is obtained, it is to be guaranteed that there is a dedicated partition for temporarily saving memory data on a connected disk. However, since it is not controlled from the boot firmware as to whether or not a dedicated partition is reserved on a connected disk, there is the problem that the management of the partition of a disk is inevitably complicated.
2) Since a stand-alone dump is booted with the memory data stored in the event of failure, the data of the boot firmware in the memory and the OS loader (program for loading the memory with the OS) is completely overwritten in the booting process in the system. Therefore, when there occurs an abnormal condition between the kernel and the above-mentioned boot firmware and OS loader, the data of the firmware cannot be taken as dump data, thereby complicating a necessary search.
3) Since the server system is normally loaded with main memory of several GB or several tens of GB, it is not practical to take data of all memory in a panic dump. Therefore, it is normal to take only an area of kernel text, kernel data, etc. of the operating system necessary for a check. To obtain the information of the area, it is necessary to search and analyze the table in the kernel, but the information depends on the version number of the kernel. Therefore, when a stand-alone dump which is a different program from the operating system is used, it is necessary to prepare a stand-alone dump program corresponding to the version number of the operating system. Therefore, the version number of the stand-alone dump has to match the version number of the corresponding to operating system. If they do not match each other, it is not possible to search the table in the kernel. As a result, the dumping process fails or all implemented memory data is to be dumped.
Japanese Patent Laid-open Publication No. Hei 08-095834 discloses the system for solving the above-mentioned problems 1) and 2) by providing a system dump producing program use area not used during the normal operation in the main storage memory area of the system aside from the operating system use area used by the operating system, and by loading and executing the system dump producing program from an external storage device to the system dump producing program use area after resetting a computer system and before reloading the operating system to the operating system use area when the system dump cannot be taken due to the hang up, etc., thereby taking a system dump of an operating system use area.
However, in the stand-alone dump system other than the operating system, the problem pointed out in 3) above cannot be solved. That is, in the system disclosed by Japanese Patent Laid-open Publication No. Hei 08-095834, a system dump can be taken for an area indicated in a list of areas for which a system dump is to be obtained from the table information for management of the area on the main storage device allocated statically or dynamically. Therefore, a list of target areas has to be prepared in advance.
Since the information for preparation of the information about the list largely depends of the version number of the operating system, the problem that the version control of the operating system and the version control of the list are inevitably complicated cannot be solved. Furthermore, since it is necessary to prepare the list of an area for which a system dump has to be taken, the area dynamically allocated during the operation of the operating system cannot be minutely anticipated, and it is difficult to efficiently collect the necessary information in a large server system having a complicated system configuration. If the version numbers do not unfortunately match each other, a system dump cannot be taken for a necessary area.