Due to the nature and growing complexity of software, there are increasing opportunities for a possibility of an error or fault in the software. Such errors or faults are, generally, handled by exception handles, but not all errors or faults can be handled by exception handles and these can lead to system crashes. Most of the time, upon system crash, a dump file is generated to allow for debugging to look for a solution to the problem. This dump file, often called a kernel dump file, contains the entire memory image of the program, including the central processing unit (CPU) state.
If an error happens in the kernel of the computer system, the entire system crashes and all applications running on the computer system crash as well. Most of these crashes occur due to bugs in the kernel. If such crashes occur, the kernel dump file is saved, the computer system is rebooted, and an application can be restarted. Restarting an application from the beginning can lead to loss of data and computation time. Current backup techniques employ checkpoint-restore for fault tolerance.
In a checkpoint-restore technique, the application is check-pointed at regular intervals of time. If the kernel crashes, the computer system is rebooted and an application can be restored from the last check-point. The checkpoint-restore technique has a few drawbacks. First, the computations performed by the application from the last checkpoint until the time of the system crash are lost. Second, check-pointing itself may take a long time if the application has a large memory buffer. Third, check-pointing may freeze the application while generating a consistent snapshot of the application, and this is performed at regular intervals, affecting application performance.