As computers have become faster, cheaper, they have become ubiquitous in academic, corporate, and government organizations. At the same time, the widespread use of computers has given rise to enormous management complexity and security hazards, and the total cost of owning and maintaining them is becoming unmanageable. The fact that computers are increasingly networked complicates the management problem.
One difficult problem relates to managing systems where applications (e.g., resource-intensive scientific applications) are distributed to run on multiple nodes in a computer cluster. In these systems, when a cluster node goes down for maintenance or because of a fault condition, it is desirable that the distributed applications can continue to run in the cluster when at least one cluster node is still operational. This calls for an application checkpoint-restart function, which is the ability to save a running application at a given point in time such that it can be restored at a later time in the same state in which it was saved. Application checkpoint-restart can facilitate fault resilience by migrating applications off faulty cluster nodes and fault recovery by restarting applications from a previously saved state. However, conventional checkpoint-restart mechanisms cannot provide this functionality, while ensuring the global consistency of the network state of the cluster nodes, transparently on clusters running commodity operating systems and hardware.