Field of the Invention
Embodiments of the present invention relate generally to parallel processing systems and, more specifically, to techniques for comprehensively synchronizing execution threads.
Description of the Related Art
Graphics processing units (GPUs) are capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. In GPUs, a “thread group” or “warp” refers to a group of threads that, in general, concurrently execute the same instructions on different input data. However, developers may write code that, when executing on the GPU, causes only a portion of the threads in the warp to execute an instruction. The threads in the warp are referred to as “diverged” during the execution of this type of instruction. An example of code that causes such a divergence is code written in the C programming language that includes an “if” statement that results in two or more sequences of instructions, where a different set of threads in a warp follows each of the sequences.
One limitation of GPUs is that the proper behavior of some instructions presupposes that the threads in each warp are converged. For example, a GPU may implement a shuffle instruction that allows direct register-to-register data exchange between threads in a warp. If a GPU attempts to execute a shuffle instruction on a warp when the threads are diverged, then the results are unpredictable. For instance, the code that is executing on the GPU may produce incorrect results or terminate unexpectedly.
Although some compilers and GPUs implement some level of synchronization functionality, that functionality is limited and does not guarantee convergence for all situations. For example, many GPUs implement a barrier instruction that is intended to synchronize warps. However, the barrier instruction presupposes that the threads in each of the warps have converged and, consequently, is unreliable. In another example, some compilers analyze the code to detect relatively simple divergence patterns. Upon detecting a divergence pattern, the compilers bracket the divergent instructions between two instructions that, respectively, indicate a re-convergence point and continue execution at the re-convergence point. However, the compilers are unable to analyze certain types of complicated control flows and, consequently, the compilers are not always able to ensure that threads within a warp are converged when required for proper execution of the code.
As a general matter, the implementation of certain program instructions may require a level of convergence across the different threads in a warp that cannot be maintained by the compiler and hardware mechanisms included in the GPU that are normally tasked with ensuring such thread convergence. Accordingly, the only way to ensure proper execution of code is for the programmer to write code in programming languages that do not support complex control flows or write code only using limited subsets of instruction and operations defined richer programming languages. Restricting code in either of these ways would dramatically reduce the ability of programmers to efficiently configure GPUs, which is undesirable.
As the foregoing illustrates, what is needed in the art are more effective techniques for synchronizing execution threads within a thread group or warp.