Operating System (OS) failure can occur owing to either hardware failure or software failure. When the system fails, it is important to collect the memory dump for diagnosis or problem analysis. This process of collecting the failed system memory into a secondary store is called “dumping”. Typically, existing dumping techniques save the memory dump to a physical secondary storage device (termed the “dump device”) before the system reboots.
Background art techniques of dumping involve copying either all or selected portions of the system memory into a physical device. This is usually performed by a single threaded application and typically under limited resources and limited support from the OS. With increasing system memory configurations, the traditional method of performing memory dump require (and will continue to require) more time to complete. Several solutions exist to reduce dump time, to increase system availability.
Parallel Dump After an OS crash, this technique utilizes all the CPUs in the system to improve the dumping speed. Since the dump driver code executes with minimal OS support (being a firmware driver), dumping cannot be made parallel as synchronization between the dumping threads becomes complicated. This technique is faster than with a single threaded dump driver, but incurs considerable computing overhead in dumping the physical memory to the dump device. The benefit actually realizable with this technique is limited by the throughput capability of the firmware driver, which is usually single threaded.
Dump to Memory (D2M): This approach is employed to copy the memory to be dumped to another part of the physical memory instead of to a secondary memory device. It is fast, as only a memory to memory copy is involved. However, the next instance of the OS must boot with less memory (at least until the D2M memory is returned to the OS after a dump analysis or after saving the D2M memory to disk), which can affect overall system performance. Further, D2M incurs a “dump time”, viz. the time spent moving all relevant dump-worthy memory pages to a contiguous physical memory region, and is not able to handle a complete memory dump, as no room remains to load the next kernel.
Dump While ReBooting (DWRB): The DWRB technique addresses the deficiencies of dump driver performance and improves the system availability by saving a minimal amount of memory (termed “golden memory”) before starting the re-boot process. However, some time is still required to save the golden memory, even if the best post-panic dump technique (such as a concurrent dump) is employed.
In all the techniques discussed above, certain amounts of time are required to dump the memory to either a secondary store or to another part of memory, before the system can reboot.