A typical distributed system includes several interconnected nodes. The nodes may communicate by the use of messages, shared memory, etc. Through communication, nodes in a distributed system are able to provide greater functionality. For example, the distributed system may be used for communication between users, solving computationally hard problems, dividing tasks between nodes to provide greater throughput (e.g., a web server accessing a database server), etc.
Occasionally, one or more nodes in the distributed system may fail. The failure of a node may be attributed to the hardware failing or the software failing. For example, hardware failure may occur when the processor crashes, upon failure to transmit data (e.g., transmit messages, data to and from storage, etc.), etc. Likewise, software failure may occur when the module of an application or the operating system fails to execute properly (or as expected). For example, an application or operating system thread may execute an infinite loop, several threads waiting for resources may cause deadlock, a thread may crash, etc.
Managing a distributed system when failure occurs involves detecting when a failure has occurred and recovering from the failure before the entire distributed system crashes. Often, rather than restarting the distributed system from the beginning in order to recover, the distributed system restarts execution from a checkpoint. A checkpoint is a point in the execution of a distributed system in which the memory information is stored in order to allow recovery. With communication and the resulting possible dependencies, one part of the distributed system cannot be restarted from the checkpoint if the part is dependent on another part of the distributed system. Accordingly, when an error is detected, the entire distributed system is restarted from the checkpoint. Thus, in order to perform a checkpoint, the entire distributed system must perform the checkpoint at the same point in execution.
One method for performing a checkpoint is for a master node to send a stop thread call with a command to respond to a child node. When the child node receives the stop thread call, the child node immediately stops whatever thread is executing, regardless of whether the thread is in a critical section, and performs a checkpoint. The child may then respond to the master that a checkpoint has been performed.