An operating system is a software program or collection of software programs that execute on a computer to provide a platform on which application programs can execute. Well-known examples of operating systems include the UNIX operating system (including variants such as the HP-UX® operating system), the Microsoft® Windows® family of operating systems, and the Apple® Mac OS® operating system. Because the operating system of a computer performs essential functions required for the proper operation of other software executing on the computer, the operating system must be as robust as possible. Despite the best efforts of operating system designers, however, an operating system may “crash” or experience other errors due to a defect in the operating system or some other problem. An operating system crash typically causes all other software on the computer to cease executing and requires the computer to be rebooted. Lesser errors may not require a reboot but may nonetheless cause other problems, such as the temporary inability to access a peripheral device.
It is desirable to identify the causes of operating system crashes and other errors so that any bugs in the operating system may be fixed, thereby reducing or eliminating the possibility of future errors. Typically, the first step in identifying the cause of an operating system error is for the computer to “dump” (copy) the contents of the computer's random access memory (RAM) to a storage device (such as a hard disk drive). For example, referring to FIG. 1, a block diagram is shown of a conventional computer 100 including an operating system 102 and a hardware layer 108. The hardware layer 108 includes one or more processors 110, a memory 112, and one or more input/output (I/O) devices 114. The operating system 102 includes a dump module 106 for dumping the contents of memory 112 to one of the I/O devices 114 (such as a hard disk drive) upon detection of an operating system error. The contents of the memory dump may then be examined (automatically by a software program, manually by a programmer, or both) in an attempt to identify the cause of the error.
In conventional computer 100, the operating system 102 itself is responsible for performing a memory dump upon detection of an error. Because the operating system 102 has experienced an error, however, it may be in an unknown state which makes it unable to perform the memory dump correctly. Furthermore, the operating system 102, like many conventional operating systems, includes both conventional device drivers 104a and special “dump” device drivers 104b which are required for use by the dump module 106 during memory dumps due to the constraints imposed by the unknown state of the operating system 102. Programming such special device drivers 104b can be tedious and time-consuming. For example, such special device drivers 104b operate under constraints imposed by the unknown state of the operating system 102b, making them difficult to program and maintain. For example, the dump module 106 typically executes in a single thread. Furthermore, the driver's spinlocks may have been held at the time of the error, so device drivers 104b cannot assume that the I/O devices 114 are in a known state. The device drivers 104b which are used to perform memory dumps must be capable of operating under such restricted conditions.
Furthermore, conventional techniques for dumping memory impose constraints on the time at which memory dumping may be performed. For example, existing techniques cannot perform a memory dump very early in the process of loading the operating system 102 because the necessary I/O drivers 104b have not yet been loaded. Such techniques cannot, therefore, be used to perform a memory dump if an error occurs, or if a dump is desired for some other reason, before the necessary I/O drivers 104b have been loaded.
As the amount of memory in a single computer continues to increase, the amount of time required to dump such memory to a storage device continues to increase. Furthermore, the techniques that currently are used to perform memory dumps impose constraints that limit the speed at which the contents of memory may be dumped.