The disclosure relates generally to methods and apparatus that provide checkpointing operations in multi-node systems.
Fault tolerance is a feature that helps a multi-node system (e.g., a network of servers in a data center) to recover quickly from unexpected failures. Short of totally eliminating all failures, the goal of fault tolerance is to minimize the amount of time required to bring the system back on-line after a failure event, and to ensure that the failure does not cause any crucial data to become lost. One technique for achieving fault tolerance is known as checkpointing. In this scheme, the state of an application executing in a node in the system is periodically backed-up as a series of checkpoints. Thus, if and when the application is interrupted by the occurrence of a fault (e.g., a software crash, a hardware failure, a scheduled maintenance, etc.), the state of the application is rolled back to a checkpoint just prior to the occurrence of the fault so that the state data of the application can be safely resumed or recovered without any loss of data or continuity. The state data may include, for example, data from registers, databases, processor pipelines, and any other data representing a state of a process operation in a computing node.
Conventional approaches to checkpointing in a multi-node system rely on the use of separate or off-node disk-based input and output (I/O) storage units. As such, checkpointing typically requires saving or writing data from nodes in the system to the disk-based I/O storage units. However, this can incur a high level of latency due to the amount of I/O traffic involved and/or limitations placed on bandwidth availability. To reduce latency, burst buffers, which sit between the nodes and the disk-based I/O storage units, have been employed to achieve faster caching of data. However, the use of burst buffers also introduces an additional layer of nodes into the system, which increases the complexity of the system and may impact scalability. Therefore, an opportunity exists to develop more efficient checkpointing methods that can improve one or more of system performance, scalability and throughput.