1. Field of the Invention
The present invention generally relates to parallel computation systems and, more specifically, to a technique for computational nested parallelism.
2. Description of the Related Art
In conventional computing systems having both a central processing unit (CPU) and a graphics processing unit (GPU), the CPU determines which specific computational tasks are performed by the GPU and in what order. A GPU computational task typically comprises highly parallel, highly similar operations across a parallel dataset, such as an image or set of images. In a conventional GPU execution model, the CPU initiates a particular computational task by selecting a corresponding thread program and instructing the GPU to execute a set of parallel instances of the thread program. In the conventional GPU execution model, only the CPU may initiate execution of a thread program on the GPU. After all thread instances complete execution, the GPU must notify the CPU and wait for another computational task to be issued by the CPU. Notifying the CPU and waiting for the next computational task is typically a blocking, serialized operation that leaves certain resources within the GPU temporarily idle, thereby reducing overall system performance.
Performance may be improved in certain scenarios by queuing up sequential computational tasks in a pushbuffer, from which the GPU may pull and perform work without waiting for the CPU. Computational tasks comprising fixed data-flow processing pipelines benefit from this pushbuffer model when the CPU is able to generate work for the GPU quickly enough to have work pending within the pushbuffer whenever the GPU is able to start a new task. However, data-dependent computational tasks are still left with a sequential dependence between GPU results, CPU task management, and subsequent GPU task execution, which must be launched by the CPU. Such data-dependent computational tasks inherently involve conditional execution, and therefore inherently require CPU involvement to facilitate flow control decisions because only the CPU may initiate execution of conditionally determined tasks. For example, algorithms that involve complex conditional execution of parallel library functions may not be performed entirely by the GPU. For such algorithms, the CPU must be involved at every flow control decision point where a parallel library function may conditionally execute. Thus, the conventional GPU execution model is of limited help in implementing data-dependent algorithms because determining which subsequent computational tasks need to run depends on the results of previous computational tasks, which must then be transmitted back to the CPU before subsequent tasks can be determined and issued to the GPU for execution.
Conditional execution is an inherent aspect of algorithms representing a significant portion of all known algorithms in the art. These algorithms do not fully benefit from potential efficiencies of GPU processing because of fundamental and long-standing limitations related to conditional execution in conventional GPU execution models.
Accordingly, what is needed in the art is a technique for enhanced GPU computational generality and performance.