Operating systems typically are configured to perform memory dumps upon the occurrence of system crashes and serious failures involving hung user processes and services. A memory dump comprises copying the contents of main memory to secondary storage, for example in the form of a file stored on a hard disk or other secondary storage medium. In the case of a system crash, a memory dump typically must be followed by a reboot of the system. Full memory dumps are indispensable resources for the analysis and correction of problems related to crashes and for the development of reliable systems.
Writing data from main memory to a hard disk is a relatively slow operation. In the case of a full memory dump, the system must scan the entire contents of memory and write the contents to secondary storage. Thus the principal drawback to generating a full memory dump is the length of “down time” it entails for the system, during which the system is effectively unusable for other purposes. This down time is a function of the onboard memory size and, where a system reboot is required, the speed of the boot storage device. Writing sixteen gigabytes of memory to disk, for example, takes more than an hour to complete. For a computer system with 64 gigabytes of memory, generating a full memory dump may take as long as six hours.
FIG. 1 of the drawings accompanying this specification provides a simplified illustration of the process by which a full memory dump is performed following a system crash, in accordance with the prior art. This illustration is instructive in clarifying the detailed description of the invention provided below. In FIG. 1, events in time are represented, with time increasing from left to right. Initially, the system runs with its full amount of memory 101, here n gigabytes. A system crash event 103 occurs. A full memory dump of the n gigabytes of memory 101 is performed (signified by the arrow 105), following which the computer system undergoes a reboot 107.
The amount of physical memory included in conventional computers has been steadily increasing. This increase is due to regular capacity improvements in random access memory (RAM) technology, to the availability of 64-bit processor technology, and to growth in memory usage by typical computer programs. As a result, the average down time associated with the performance of full memory dumps has increased. Currently, enterprise server machines are equipped to use as much as 32 to 64 gigabytes of onboard RAM. It is expected that by 2007 low-end and mid-range servers will be capable of using up to 128 gigabytes of RAM. In such cases, the performance of a full memory dump after a crash, followed by a reboot of the system, will usually be impractical.
Alternatives to performing a full memory dump exist. For example, a memory dump of a portion of main memory, such as the operating system kernel space or the space allocated to a specific process, can be performed. However, it is generally not possible to know in advance whether a crash is due to a kernel-mode process or to a specific user process. Moreover, in some cases involving a “freeze” of the system it is not possible to generate a process-specific memory dump.