Field of the Invention
The present invention generally relates to multi-threaded computer architectures and, more specifically, to a method and system for processing nested stream events.
Description of the Related Art
In conventional computing systems having both a central processing unit (CPU) and a graphics processing unit (GPU), the CPU determines which specific computational tasks are performed by the GPU and in what order. A GPU computational task typically comprises highly parallel, highly similar operations across a parallel dataset, such as an image or set of images. In a conventional GPU execution model, the CPU initiates a particular computational task by selecting a corresponding thread program and instructing the GPU to execute a set of parallel instances of the thread program. In the conventional GPU execution model, only the CPU may initiate execution of a thread program on the GPU. After all thread instances complete execution, the GPU must notify the CPU and wait for another computational task to be issued by the CPU. Notifying the CPU and waiting for the next computational task is typically a blocking, serialized operation that leaves certain resources within the GPU temporarily idle, thereby reducing overall system performance.
Performance may be improved in certain scenarios by queuing sequential computational tasks in a pushbuffer, from which the GPU may pull work for execution without waiting for the CPU. Computational tasks that include fixed data-flow processing pipelines benefit from this pushbuffer model when the CPU is able to generate work for the GPU quickly enough to have work pending within the pushbuffer whenever the GPU is able to start a new task. However, data-dependent computational tasks are still left with a sequential dependence between GPU results, CPU task management, and subsequent GPU task execution, which must be launched by the CPU.
Multi-threaded computation models conventionally organize work into ordered streams of tasks that must complete in a defined order. In such computation models, execution semantics dictate that a given task must complete before a dependent task may execute. In a simple scenario, a serial dependence among an arbitrary sequence of tasks may be queued within a pushbuffer for efficient execution by the GPU. However, certain computation models allow for cross stream dependencies, whereby a task in one stream depends on two or more different tasks completing, potentially across two or more different streams. In such scenarios, the CPU schedules tasks to avoid deadlock. The process of waiting for certain tasks to complete before scheduling other tasks to avoid deadlock creates additional serial dependencies between the CPU and GPU task execution, reducing overall efficiency.
As the foregoing illustrates, what is needed in the art is a technique to enable more efficient and semantically complete GPU execution.