1. Field of the Invention
The present invention generally relates to computer architectures and, more specifically, to a method and system for managing nested execution streams.
2. Description of the Related Art
In conventional computing systems having both a central processing unit (CPU) and a graphics processing unit (GPU), the CPU determines which specific computational tasks are performed by the GPU and in what order. A GPU computational task typically comprises highly parallel, highly similar operations across a parallel dataset, such as an image or set of images. In a conventional GPU execution model, the CPU initiates a particular computational task by selecting a corresponding thread program and instructing the GPU to execute a set of parallel instances of the thread program. In the conventional GPU execution model, the CPU is usually the only entity that can initiate execution of a thread program on the GPU. After all thread instances complete execution, the GPU has to notify the CPU and wait for another computational task to be issued by the CPU. Notifying the CPU and waiting for the next computational task is typically a blocking, serialized operation that leaves certain resources within the GPU temporarily idle, thereby reducing overall system performance.
Performance may be improved in certain scenarios by queuing sequential computational tasks in a pushbuffer, from which the GPU may pull work for execution without waiting for the CPU. Computational tasks that include fixed data-flow processing pipelines benefit from this pushbuffer model when the CPU is able to generate work for the GPU quickly enough to have work pending within the pushbuffer whenever the GPU is able to start a new task. However, data-dependent computational tasks are still left with a sequential dependence between GPU results, CPU task management, and subsequent GPU task execution, which has to be launched by the CPU. One solution to this problem is to provide a mechanism for GPU thread programs to queue additional computational tasks without requiring intervention from the CPU, and wait for the completion of those computational tasks. However, there are several drawbacks to such an approach. First, CPUs conventionally have a means to dynamically allocate memory, but GPUs do not. When new computational tasks are launched by the GPU, the computational tasks are allocated memory to store context and parameter information accessed during the execution of the task. In such cases, the GPU engages the CPU to allocate memory for the new computational task. Then, the GPU waits for the CPU to allocate memory to the computational task prior to queuing the new task, thereby reducing performance.
Second, where both the CPU and GPU are able to launch new computational tasks into the pushbuffer, deadlock conditions may occur. The CPU may occupy all communication channels to the GPU for the purpose of queuing new computational tasks. The GPU may then queue a new computational task that accesses the CPU in order to complete. In such cases, the CPU is waiting on a GPU task to complete before releasing any of the communication channels, while the GPU task cannot complete until the task is granted access to the CPU via one of the blocked communication channels, resulting in deadlock.
Finally, queuing new computational tasks and pulling tasks from the pushbuffer for execution typically utilizes locking operations to ensure that tasks are executed sequentially and the information in the pushbuffer is properly preserved and managed. Although GPUs perform similar locking operations, locking operations are inherently slow. If the GPU employed locking operations while queuing new tasks, then system performance would be negatively impacted.
As the foregoing illustrates, what is needed in the art is a technique that allows GPUs to more efficiently queue work for execution.