Most faults encountered in a computing device are transient or intermittent in nature, exhibiting themselves as momentary glitches. However, since transient and intermittent faults can, like permanent faults, corrupt data that is being manipulated at the time of the fault, it is necessary to have on record a recent state of the computing device to which the computing device can be returned following the fault.
Checkpointing is one option for realizing fault tolerance in a computing device. Checkpointing involves periodically recording the state of the computing device, in its entirety, at time intervals designated as checkpoints. If a fault is detected at the computing device, recovery may then be had by diagnosing and circumventing a malfunctioning unit, returning the state of the computing device to the last checkpointed state, and resuming normal operations from that state.
Advantageously, if the state of the computing device is checkpointed several times each second, the computing device may be recovered (or rolled back) to its last checkpointed state in a fashion that is generally transparent to a user. Moreover, if the recovery process is handled properly, all applications can be resumed from their last checkpointed state with no loss of continuity and no contamination of data.
Nevertheless, despite the existence of current checkpointing protocols, improved systems and methods for checkpointing the state of a computing device, and/or its component parts, are still needed.