Field of the Invention
The present invention generally relates to multi-threaded computer architectures and, more specifically, to efficient memory virtualization in multi-threaded processing units.
Description of the Related Art
In conventional computing systems having both a central processing unit (CPU) and a graphics processing unit (GPU), the CPU and performs a portion of application computations, allocates resources, and manages overall application execution, while the GPU performs high-throughput computations determined by the CPU. In certain application spaces, such as high performance computing (HPC) applications, the GPU typically performs a majority of computations associated with a given application. As a consequence, overall application performance is directly related to GPU utilization. In such applications, high application performance is achieved with high GPU utilization, a condition characterized by a relatively large portion of GPU processing units concurrently executing useful work. The work is organized into thread programs, which execute in parallel on processing units.
A typical thread program executes as highly parallel, highly similar operations across a parallel dataset, such as an image or set of images, residing within a single virtual address space. If an application needs to execute multiple, different thread programs, then the GPU conventionally executes one of the different thread programs at a time, each within a corresponding virtual address space, until the different thread programs have all completed their assigned work. Each thread program is loaded into a corresponding context for execution within the GPU. The context includes virtual address space state that is loaded into page tables residing within the GPU. Because each different thread program conventionally requires a private virtual address space, only one thread program may execute on the GPU at any one time.
HPC applications are typically executed on an HPC cluster, which conventionally includes a set of nodes, each comprising a CPU and a GPU. Each node is typically assigned a set of tasks that may communicate with other tasks executing on other nodes via a message passing interface (MPI) task. A typical GPU computation task executes efficiently with high GPU utilization as set of parallel thread program instances within a common virtual memory space. However, given conventional GPU execution models, only one MPI task may execute on a given GPU at a time. Each MPI task may comprise a range of workloads for the GPU, giving rise to a corresponding range of GPU utilization. In one scenario, only one thread or a small number of threads is executed on the GPU as an MPI task, resulting in poor GPU utilization and poor overall application performance. As a consequence, certain HPC applications perform inefficiently on CPU-based HPC processing clusters. In general, applications that require the GPU to sequentially execute tasks comprising a small number of thread instances that each requires an independent virtual address space will perform poorly.
As the foregoing illustrates, what is needed in the art is a technique that enables concurrent GPU execution of tasks having different virtual address spaces.