1. Field of the Invention
The present invention generally relates to computer processing, and, more specifically, to enabling local generation of work within a graphics processing unit (GPU).
2. Description of the Related Art
Graphics processing units (GPUs) are designed to process a variety of intensive tasks within a computing system, such as graphics processing work and compute application work. In a typical configuration, a central processing unit (CPU) generates GPU-based work and loads the GPU-based work into a global memory that is accessible to both the CPU and the GPU. The CPU then accesses a work queue of the GPU—often referred to as a “channel”—through which the CPU is able to cause the GPU to process the GPU-based work stored in the global memory.
In one configuration, the processing activity of the GPU is controlled by the manipulation of two separate pointers that each refer to an entry in the work queue, referred to herein as the GP_GET pointer and the GP_PUT pointer. The GP_GET pointer points to a particular entry in the work queue and indicates to the CPU how far along the GPU is in executing the work stored in the work queue. Alternatively, the GP_PUT pointer points to the entry in the work queue right after the last entry written by the CPU. When the GPU completes execution of GPU-based work pointed to by a given work queue entry, the GPU increments GP_GET. Notably, when GP_GET reaches the entry count of the work queue, GP_GET is reset to a value of zero, since the work queue is circular. If, after being incremented, GP_GET is equal to GP_PUT, then no more entries in the work queue remain to be processed. Otherwise, the GPU executes the work pointed to by GP_GET. Also, If GP_GET is equal to “(GP_PUT+1) modulo ‘number of entries in the work queue’”, then the work queue is considered as full. As long as the work queue is not full, the CPU can increment the GP_PUT pointer in order to submit new entries written in the work queue for GPU processing. The GPU monitors changes to GP_PUT performed by the CPU, such that the CPU-submitted work queue entries are processed in a timely manner.
In many cases, it is desirable to enable the GPU to generate additional (i.e., nested) work that can be loaded into the work queue by the GPU and processed by the GPU. Unfortunately, specific hardware limitations exist between popular CPU-to-GPU communication channels—such as Peripheral Component Interconnect Express (PCI-E)—and prevent the GPU from being able to modify the GP_PUT pointer, which, as described above, needs to be incremented after inserting new work into the work queue. As a result, the GPU must rely on the CPU to generate and execute nested work, which is inefficient in comparison to locally generating and executing nested work within the GPU.
Accordingly, what is needed in the art is a technique for enabling a GPU to locally-generate work in the presence of CPU-to-GPU communication channel hardware limitations.