An operating system kernel is the part of the operating system that lies between the physical hardware layer and the software layer. FIG. 1 shows a computing system 100 where a kernel 102 manages the interaction between software layer 104 and hardware layer 106. The kernel 102 provides a hardware abstraction layer that is used by the software 104 to communicate with and utilise the hardware 106. The software layer 104 may include system software that includes operating system software, and/or application software that includes applications to perform tasks required of the computing system 100.
Kernel panics occur when a process in the kernel encounters an unrecoverable error. Examples of unrecoverable errors include hardware failure and bugs in the kernel. The operating system may include panic routines that are executed when a kernel panic occurs. The panic routines may create a “crash dump” where the contents of some or all of the physical memory of the computing system are stored as a file on permanent storage such as a hard disk. The panic routines may display a message on a display device, such as a monitor, indicating that a kernel panic has occurred, and may provide information relating to the kernel panic. The panic routines may then restart the computing system on which the kernel was running before the kernel panic occurred. The kernel panic routines may, alternatively, wait for another computing system running a debugger to connect to the kernel and debug the kernel panic.
When a kernel panic occurs, any computation performed by an application running on the computing system may be lost. To avoid this problem, the application and/or the operating system may store application checkpoints, which are a snapshot of the state of the processes associated with the application, at periodic intervals. In the event of a kernel panic, a computing system on which the kernel is running may be reset, and the state of the application may be restored to a previously stored state, so only computation performed since the last stored state is lost. However, storing the application checkpoints results in an overhead in the computing system, and the efficiency of the computing system and/or any applications running on it may be decreased. Furthermore, restoring the application typically requires action by a system administrator, which may be after an unspecified length of time. Also, if the operating system does not store application checkpoints, the application must be programmed to store application checkpoints.
It is an object of embodiments of the invention to at least mitigate one or more of the problems of the prior art.