1. Field of the Invention
The present invention generally relates to automatically producing an accurate diagnostic report and possibly automatically reviving a crashed or hung operating system instance. More specifically, a healthy running operating system (OS) can register a recovery/repair kernel to the firmware, so that when the OS crashes or hangs is detected, this firmware copies the system kernel memory to a reserved location and then copies the repair kernel into low memory to attempt an automatic repair. If the repair is successful, the firmware will swap back to running the original kernel without a reboot.
2. Description of the Related Art
FIG. 1 exemplarily shows an exemplary block diagram 100 of an SMP (Symmetric Multi-Processor) server with one or more LPARs (Logical PARtitions), including hypervisor firmware 104 that oversees the LPAR instances 101-103. Each LPAR 101-103 run an OS instance, such as an AIX OS instance.
Currently, when an OS instance fails (crashes or hangs), as demonstrated by LPAR2 102 in FIG. 1, the customer has to collect the system dump[s] and send it over to the OS vendor's technical support team, who will then diagnose the problem using the dump. There are a few problems with this approach:
1) This process is time consuming, particularly when the dump file is huge, which is getting more prevalent as system's memory continues to increase in size.
2) The OS vendor's support team may not have access to all of the OS instance's information, in which case they will have to go through multiple iterations of system dump collection and analyses.
3) The OS instance may be too damaged to be able to dump its contents to the disk. That is, the system dump component may itself fail, leaving the system in a non-diagnosable state.
Hence, it would be beneficial to both customers and to OS vendors if an online analysis of the failing OS instance can be done, and preferably done automatically. Currently there are two approaches known to the present inventors that address parts of the above problems:
1. FirmWare Assisted Dump (FWAD)
Publication “Firmware Assisted Dump in a Partitioned Environment using Reserved Partition Memory” (IP.com# IPCOM000166859D) describes a mechanism that can be used to dump an OS instance which cannot dump its own contents to disk (the third problem listed above). FWAD works by pre-registering the OS kernel's data regions to the firmware, so that those regions can be copied to safe memory regions which are preserved across the next reboot. Then the firmware and rebooting OS instance can dump the failing OS kernel's data to the dump device.
However, the FWAD does not eliminate the requirement for off-line processing of the dump data. The customer still has to collect the dump data from a device and send it to the OS vendor's technical support team, who will analyze the dump. Moreover, this solution requires an OS reboot to occur, which can take from several minutes to a few hours to complete, a very time-consuming process.
2. An Ambulance LPAR:
This is a service partition (LPAR) in the same hardware system that contains the LPAR with the failed OS instance. The OS in this ambulance LPAR can do an online diagnosis of the failed OS, and is described in the above-identified co-pending application.
The main problem with this approach is the security concern, because all the memory belonging to the failed LPAR, including application data, is exposed to the ambulance LPAR.
Another problem with the ambulance LPAR approach is that the layout of the data structures can vary among different OS versions. So, each OS version running in the hardware system needs an ambulance LPAR that runs the same OS version, making the ambulance LPAR an expensive and hard-to-manage proposition.
Therefore, a need continues to exist for improving the servicing of failed OS instances. Particularly, it would be useful to have a mechanism that can do automated and/or online analysis of a failed OS instance, but without the drawbacks associated with the FWAD or the Ambulance LPAR as described briefly above and in the above-identified co-pending application.