A multiprocessing computer system has multiple processes executing on the system. Each process performs a particular task, and the processes, taken as a whole, perform some larger task, called an application. These processes may be executing on a single central computer or they may be running on separate computers which are connected to each other via some type of communication link, i.e. a distributed computer system. As used herein, the term computer includes any device or machine capable of accepting data, applying prescribed processes to the data, and supplying the results of the processes.
In order for the application task to function correctly, the separate processes must coordinate their functions. This coordination is often accomplished through inter-process communication. In a message passing system, inter-process communication is accomplished by processes sending messages to, and receiving messages from, other processes.
If a failure occurs in one of the processes, often the entire application must be reinitialized, because each of the processes is dependent on the successful operation of the other processes. In such a case, each of the processes must be rolled back to the beginning of execution. Such reinitialization is very costly, because all the work performed by the application up until the failure is lost. The cost of restarting a computer system in the event of a process failure is reduced by the technique checkpointing.
A checkpoint is a snapshot of the state of a process which is saved on non-volatile storage, and which survives process failure. Upon recovery, the checkpoint can be reloaded into volatile memory, and the process can resume execution from the checkpointed state. Checkpointing reduces the amount of lost work in the event of a process failure because the checkpointed processes only need to be rolled back to the last checkpoint. When the processes in a system are rolled back to previous checkpoints, the set of checkpoints to which the processes are rolled back is called the recovery line.
In message passing systems, process rollback is complicated by the need to maintain a consistent state of the system. In a message passing system, the rollback of one process may require the rollbacks of other processes in order to guarantee consistency of the state of the system as a whole. For example, if the sender of a message m rolls back to a checkpointed state before message m was sent (i.e. unsends m), then the receiver of m must also roll back to a checkpointed state before m was received (i.e. unreceive m). If such rollback procedure is not followed, the states of the two processes together will show that message m has been received but not yet sent, which is inconsistent.
In a system with N processes, a global checkpoint is a set of N checkpoints, one from each process. A consistent global checkpoint is a global checkpoint which results in a consistent system state. It is noted that a system may have more than one consistent global checkpoint.