1. Field of the Invention
This invention is related to providing checkpoint services for applications in a cluster of servers.
2. Description of the Related Art
Clusters of computer systems (“nodes”) are often used to provide scalability and high availability for various applications. For high availability, an application may be “failed over” from a first node on which it was executing to a second node. The failover may occur due to a hardware or software problem on the first node, for example. The first node may “crash”, or the application may crash on the first node. Additionally, similar mechanisms to the failover may be used to move an executing application from a first node to a second node for performance reasons. For example, the second node may have more resources than the first node, may have resources that may execute the application with higher performance, or may simply be less loaded than the first node.
In either a failover or load-balancing scenario, the state of the application is typically checkpointed. The checkpointed state can be reloaded into the application (e.g. on the second node, when a failover or load balance move occurs) to resume execution with the checkpointed state. In this manner, the state of the application is resilient up to the most recent checkpoint.
Typically, one of two mechanisms are used to checkpoint applications executing on a cluster. In one mechanism (frequently used in cluster server software used to manage the cluster), shared storage that all the nodes can access is provided in the cluster. Checkpoints are written to and read from the shared storage.
Other clusters do not include shared storage (often referred to as “shared nothing” architectures). Such clusters typically implement a second mechanism for checkpointing, in which the applications themselves write checkpoints to another node in the cluster. Thus, the application must be aware of the topology of the cluster (e.g. how many nodes are in the cluster, which nodes have the resources needed to execute the application, etc.). Typically, the application statically selects a node to store the checkpoint data, and thus there is a lack of flexibility in storing the checkpoints. The application must execute on the node that stores the checkpoints as well. The application may be more susceptible to cascading fail-over, when more than one node in the cluster crashes during a short period of time. If both the node executing the application crashes and the node statically selected by the application for the checkpoint storage crashes, then the application becomes unavailable. Additionally, since the application typically executes at the user level, heavy-weight TCP communication is used to communicate the checkpoint data to the statically selected node.