In a parallel computing environment, such as a High Performance Computing environment, there are large parallel applications running on thousands of nodes. In such an environment, it may be necessary to checkpoint or restart one or more of the large parallel applications, or to at least migrate a portion of an application from one node to another node. Checkpointing is saving the state of a running application into a file such that the complete state can be restored and the application continued at a future time. Restarting is restoring the state from a checkpoint file and resuming execution in such a way that the application continues to run as if it had not been interrupted (but possibly on a different set of compute nodes). Migration includes moving portions of a running application from one or more compute nodes to one or more other compute nodes using, for instance, a checkpoint/restart mechanism, such that the user of the application does not perceive any interruption in the operation of the application, except possibly a slow down or temporary non-responsiveness.
Currently, to perform the checkpoint, restart and/or migration operations, a single master node is used to control all the compute nodes. This one node manages the thousands of connections.