The present invention relates generally to multiprocessing systems and more particularly to memory-aware runtime to support multitenancy in heterogeneous clusters.
Many-core processors are increasingly becoming part of high performance computing (HPC) clusters. Within the last two to three years general processing units GPUs have emerged as a means to achieve extreme scale, cost-effective, and power-efficient high performance computing. The peak single-precision performance of the latest GPU from NVIDIA—the Tesla C2050/C2070 card—is more than 1 Teraflop, resulting in a price to performance ratio of $2-4 per Gigaflop. GPUs can offer up to 20 times better performance per watt than multi-core CPUs. Meanwhile, Intel has announced the upcoming release of the Many Integrated Core processor (Intel® MIC), with peak performance of 1.2 Teraflops. Early benchmarking results on molecular dynamics and linear algebra applications have been demonstrated at the International Supercomputing Conference, Hamburg, Germany, in June 2011.
The following references are referred to in the further background discussion.    [1] J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable Parallel Programming with CUDA. In Queue 6, 2 (March 2008), 40-53.    [2] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan. 2009. GViM: GPU-accelerated virtual machines. In Proceedings of HPCVirt '09. ACM, New York, N.Y., USA, 17-24.    [3] L. Shi, H. Chen, and J. Sun. 2009. vCUDA: GPU accelerated high performance computing in virtual machines. In Proceedings of IPDPS '09, Washington, D.C., USA, 1-11.    [4] J. Duato, A. J. Peña, F. Silla, R. Mayo, and E. S. Quintana-Ortí. 2010. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proc. of HPCS '10, pages 224-231, June-July 2010.    [5] G. Giunta, R. Montella, G. Agrillo, and G. Coviello. 2010. A GPGPU transparent virtualization component for high performance computing clouds. In Proc. Euro-Par 2010, Heidelberg, 2010.    [6] V. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. 2011. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of HPDC '11. ACM, New York, N.Y., USA, 217-228
The NVIDIA's CUDA runtime [1] provides very basic mechanisms for applications to time-share a GPU. In particular, by associating CUDA contexts to applications and serving CUDA calls from different applications in the order they arrive, the CUDA runtime allows concurrent applications to time-share a GPU. However, since the CUDA runtime pre-allocates a certain amount of GPU memory to each CUDA context and does not offer memory swapping capabilities between CPU and GPU, the described time-sharing mechanism works only: (i) in the absence of conflicting memory requirements among concurrent applications, and (ii) for a restricted number of concurrent applications. Further, the CUDA runtime forces explicit procurement of GPU devices to application (that is, there is no transparency in the application-to-GPU mapping and scheduling process).
GViM [2], vCUDA [3], rCUDA [4] and gVirtuS [5] are runtime systems that use the split-driver model to allow visibility of GPUs from within Virtual Machines. In addition, all these proposals but gVirtuS abstract the underlying GPUs from the end-users (thus preventing explicit procurement of GPU resources to applications). However, none of these proposals offer GPU sharing and dynamic binding/unbinding of applications to/from GPUs.
The NVIDIA's CUDA runtime [1] provides very basic mechanisms for applications to time-share a GPU. In particular, by associating CUDA contexts to applications and serving CUDA calls from different applications in the order they arrive, the CUDA runtime allows concurrent applications to time-share a GPU. However, since the CUDA runtime pre-allocates a certain amount of GPU memory to each CUDA context and does not offer memory swapping capabilities between CPU and GPU, the described time-sharing mechanism works only: (i) in the absence of conflicting memory requirements among concurrent applications, and (ii) for a restricted number of concurrent applications. Further, the CUDA runtime forces explicit procurement of GPU devices to application (that is, there is no transparency in the application-to-GPU mapping and scheduling process).
The proposal in [6] explores kernel consolidation across applications as a means to time-share and space-share GPUs. However, the work assumes that concurrent applications fit the memory capacity of the GPU. Further, it does not allow dynamic binding/unbinding of applications to/from GPUs.
Accordingly, there is a need to provide sharing in the presence of conflicting memory requirements, and dynamic binding/unbinding of applications to/from many-core devices.