The present invention relates generally to a general-purpose graphics processing unit and more particularly to a method and system to dynamically bind and unbind applications on a general purpose graphics processing unit.
Many-core devices like graphics processing units GPUs are designed to be used by one application at a time. However, more often than not, the resources available in a graphics processing unit GPU are under-utilized by a single application. Techniques that allow sharing of a GPU among multiple applications (multiprogramming) can improve the utilization of the GPU. This will require dynamic pre-emption of running applications and dynamic remapping of applications to GPUs.
In some situations it is desirable to checkpoint applications running on a GPU and restart these applications on a different GPU. For example, this holds when the GPU or the host that the GPU is attached to experience failures. This will require that the execution of an application be stopped on one GPU and re-started on another GPU.
The following references are referred to in the further background discussion.    [1] Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable Parallel Programming with CUDA. In Queue 6, 2 (March 2008), 40-53.    [2] H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. 2009. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications. In Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT '09), IEEE Computer Society, Washington, D.C., USA, 408-413.    [3] A. Nukada, H. Takizawa, and S. Matsuoka. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pp. 104-113, 16-20 May 2011.
In NVIDIA's CUDA runtime [1], procurement of applications to GPUs is explicitly made by the programmer (using the cudaSetDevice function), and statically defined for the whole application life-time. There is no mechanism to dynamically stop the application, and resume its execution on a different GPU. Similarly, there are no advanced scheduling mechanisms of concurrent applications onto available GPUs.
Two solutions (CheCUDA [2], NVCR [3]) have been proposed for checkpoint and restart of applications on GPUs. Both solutions suffer from a major limitation: the check-pointed state is such that the application can be re-started only on the GPU that was used for check-pointing. This is because of the use of GPU pointers in the application code. Therefore, these solutions do not fully solve the problem of preempting an application and resuming its execution on a different GPU.
Accordingly, there is a need for dynamic binding and unbinding of graphics processing unit GPU applications which overcomes the failings of the prior art. No solution is known to pre-empt or stop execution of an application that is executing on a GPU, and subsequently re-start the application on a different GPU transparently to the user.