1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to a method for synchronizing independent cooperative thread arrays running on a graphics processing unit.
2. Description of the Related Art
A typical computer system includes, without limitation, a central processing unit (CPU), a graphics processing unit (GPU), a display device, and one or more input devices. A software application may execute on the CPU, or the software application may be distributed between the CPU and the GPU. A user may interact with the software application executing within the computer system by operating at least one input device and observing the results on the display device. The CPU usually executes the overall structure of the software application and configures the GPU to perform specific tasks. In current technology, the CPU tends to offer more general functionality using a relatively small number of large execution threads, while the GPU is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated hardware processing units. The execution model for threads within the GPU may include blocks of related threads, called cooperative thread arrays (CTAs), generally executing under a single-instruction, multiple-data (SIMD) regime. The GPU thread execution model may also allow for multiple, independently executing CTAs, providing a very high computational throughput.
In conventional computer systems, the CPU assigns specific computational work to CTAs within the GPU. When each CTA completes the work, the GPU generates an interrupt to the CPU. Highly parallel algorithms may advantageously assign work to many simultaneously executing CTAs within the GPU. As the CTAs complete assigned work, the CTAs may need to be synchronized. That is, certain CTAs may need to wait for other CTAs to finish before starting on subsequent computations. Some complex functions may require multiple CTA synchronization checkpoints prior to completion, each requiring an interrupt to the CPU. However, each repetitive interrupt may be detrimental to system performance, and a large number of CTAs generating repetitive interrupts may be crippling to the performance of certain complex algorithms.
One approach to reducing the impact of the interrupt service time is to improve the interrupt performance of the host operating system. However, interrupt performance is established by the specific operating system design, with little chance of modification or improvement beyond existing design requirements. Furthermore, CTA synchronization may generate a much larger volume of interrupt traffic than the customary design requirements of a standard operating system. With GPU performance increasing in successive product generations, the interrupt performance of a given operating system is, therefore, likely to constrain the maximum possible throughput of advanced CTA-based algorithms.
As the foregoing illustrates, what is needed in the art is a technique for performing efficient cooperative thread array synchronization.