Fault tolerance generally requires replication, either in space or in time. For replication in space (which may be referred to as duplication), two sets of processors should exhibit the same sequence of events given the same starting point and the same input stream. On a failure, the failing set of processors is removed from the configuration and processing continues.
For replication in time (which may be referred to as replay), there are two general options: checkpoint/restart and continuous replay. A checkpoint/restart system creates a checkpoint or snapshot of the current state of the system and a journal file of all inputs since the checkpoint. On a failure, the checkpoint is loaded on another set of processors and the journal file is applied. In some implementations or under some conditions, the original sequence of events may not be important, depending, for example, on the level of coordination between the operating system (OS), the application, and the checkpoint facilities. As an example, if no work has been committed, any permitted sequence of events is acceptable.
Replication also may be accomplished by continuous replay, which uses two sets of processors (like duplication) and a journal stream (similar to a checkpoint/restart system). The first set of processors record into the journal the sequence of events observed. The second set of processors use the journal to reproduce that sequence of events during the replay.
Duplication generally requires a high level of determinism in the sequence of events. An advantage of duplication is that fault tolerance generally can be made application independent and operating system independent. A disadvantage of duplication is that duplication generally requires dedicated duplicate hardware and a high level of determinism.
A checkpoint/restart system does not necessarily require determinism in the sequence of events. A checkpoint/restart system also does not require dedicated duplicate hardware resources. A checkpoint/restart system does, however, generally require application and operating system modifications to make the system work. A checkpoint/restart system also has a fairly lengthy recovery time based on the frequency of the checkpoints and the length of the journal file.
Continuous replay is application and operating system independent, like duplication, but continuous replay has a reduced level of required determinism. Like duplication, continuous replay requires dedicated duplicate hardware. Continuous replay needs a journal stream similar to checkpoint/restart, but its does not need checkpoints or operating system support, and it does not generally have a lengthy recovery time. The journal stream is a sequence of directions that flow from the primary set of resources to the secondary set of resources that indicates the sequence of events that were observed.