Various types of special-purpose processors, such as graphics processing units (GPUs) for general purpose computing and other types of hardware accelerators, have been developed for accelerated processing of specific types of workloads. The processing capabilities of GPU devices and other types of hardware accelerators are currently being utilized in various applications to accelerate the processing of highly-parallelized computational workloads in various technical fields. In particular, general-purpose computing on GPU (GPGPU) is utilized for high-throughput, accelerated processing of compute kernels for workloads (e.g., vector-based computations, matrix-based computations, etc.) that exhibit data-parallelism. For example, GPUs are used to accelerate data processing in high-performance computing (HPC) and embedded computing systems, for various applications such as financial modeling, scientific research, machine learning (ML), deep learning (DL), data mining, video data transcoding, image analysis, image recognition, virus pattern matching, augmented reality, encryption/decryption, weather forecasting, big data analytics and comparisons, and other applications with computational workloads that have an inherently parallel nature.
A distributed computing environment which comprises a large scale of shared computing resources over a cluster of computing nodes is typically utilized to support emerging applications such as big data analytics and DL learning applications. Indeed, DL applications, for example, require the collection, storage, and processing of a significantly large amount of data, wherein the data includes training data to build and optimize DL models, as well as model parameters of the deep learning models which are utilized for inference processing. Implementing an efficient distributed computing environment for these types of applications is not trivial as the intensive computational workloads, and the massive volume of data that must be stored, streamed, prefetched, and coordinated between the shared computing resources of the distributed computing platform presents a significant challenge and practical limit on system performance and scalability.
Furthermore, in an HPC domain, long running, heavy computing intensive tasks (e.g., DL training process) dominate the workloads of GPU resources, and such intensive GPU processing tasks can last for hours, days or even weeks to execute certain tasks (e.g., train DL models) and deliver results. It is common for a GPU server to experience some error at some point during the execution of a relatively long GPU processing task, or otherwise have the GPU processing task preempted at some point in the execution to execute a higher priority task. Such error can range from software error, memory failure, power failure, or even natural disasters. Recovering a GPU computing result by re-executing the task from the beginning to the break point is generally not a good solution due to the long running time of the GPU processing task and the heavy computing power requirement. Therefore, checkpointing the calculation result by saving a current program state in non-volatile storage is a more optimal solution to make the system robust and failure tolerant.
Checkpointing in a cloud or distributed environment faces many challenges. Such challenges include, but are not limited to, long synchronization overhead, large data movement over a communications network, significant use of system resources such as system memory and storage bandwidth, etc. For example, checkpoint images of DL models can be 500 MB or greater, which requires the use of a significant amount of bandwidth and networking resources to perform memory copy operations to transfer checkpoint images from GPU device memory to host memory (e.g., system memory) for checkpoint operations. In addition, in conventional systems, DL training is temporarily suspended during a DL model checkpoint operation to maintain a consistent state of the intermediate DL model. The longer a checkpoint operation takes, the greater the impact on the DL training process. Further, a large checkpoint image of a DL model can consume a large amount of memory and disk space.