Early graphic processing units (“GPUs”) had dedicated execution units for various 3D graphic functions of a graphics pipeline. These graphics functions include a vertex shader, a geometry shader, and a pixel shader. Over time, the dedicated execution units were replaced with general-purpose execution units that could be programmed to perform each of these graphic functions. To increase performance, the GPUs employed a single-instruction multiple data (“SIMD”) or vector model. A SIMD model allows a single issued instruction to operate on multiple elements of a vector. Thus, a GPU that employed a SIMD model could efficiently process large vectors. This efficiency, however, can be adversely affected by “branch divergence.” Branch divergence occurs when elements of a vector are processed differently based on a condition being satisfied. When elements are processed differently, a single instruction cannot be issued for the entire vector. Rather, instructions need to be issued to process different portions of the vector and even a separate instruction needs to be issued for each element effectively reducing the processing to scalar processing.
The NVIDIA TESLA GPU architecture employs a single-instruction multiple thread (“SIMT”) model to increase parallelism and reduce the adverse affects of branch divergence. One version of the Tesla GPU has 14 streaming multiprocessors (“SMs”). FIG. 1 is a block diagram that illustrates a streaming multiprocessor. Each SM 100 has 8 streaming processors (“SPs”) 101. Each SP includes an SP core with scalar integer and floating point arithmetic units. Each SP core is pipelined and multithreaded. An SP core executes an instruction of a thread per thread clock. The SM issues the same instruction to each of the SPs, and each SP executes that instruction as part of four separate threads. Thus, the Tesla GPU effectively has 32 threads that are executed as a parallel unit, referred to as a warp. Each SM supports 24 warps simultaneously. Thus, the Tesla GPU supports over 10,000 threads simultaneously. Each of the 32 threads of a parallel unit has its own instruction pointer and state and operates on its own data. If all 32 threads take the same path of execution, they all continue in lockstep until complete. However, if the paths diverge, then some of the threads will become inactive while the remaining threads continue to execute. At some point, the inactive threads will become active to continue their execution, and the remaining threads will become inactive. As a result, branch divergence can significantly reduce the parallelism with a GPU that employs a SIMT model with parallel units of threads.
To increase performance, many computer architectures employ predicated instructions to help reduce the effects of branches in an instruction pipeline. For example, with an if-then-else statement, the then-instructions (i.e., instruction implementing the then-portion) are to be executed only when the condition is true and the else-instructions are to be executed only when the condition is false. With a conventional architecture, the condition would need to be fully evaluated before the then-instructions or the else-instructions could be issued to the instruction pipeline. In such a case, the instruction pipeline may be completely empty when the condition is finally evaluated. When a program has many branches, the benefits of overlapped execution of the instruction pipeline are greatly reduced because of the need to wait until the condition is evaluated to determine whether the then-instructions or the else-instructions should be issued. Predicated instructions are issued but their results are not committed until and unless their predicate (e.g., condition) is true. With an if-then-else statement, the predicate is the condition. After the instructions to set the predicate are issued, the then-instructions can be immediately issued predicated on the predicate being true, and the else-instruction can also be immediately issued predicated on the predicate being false. Once the predicate is eventually set, then either the then-instructions or the else-instructions whose execution may be mostly complete can be committed depending on whether the predicate was set to true or false. In this way, the instruction pipeline can remain full albeit issuing some instructions that will never be committed. The NVIDIA TESLA GPU architecture supports predicated instructions.
The Tesla GPU architecture is designed to support not only graphics processing but also general purpose computing. Unfortunately, programs written in a high-level language (e.g., C++) may perform poorly on the Tesla GPU. This poor performance may be due to the inability of a compiler to generate code that is fully optimized to the SIMT model. In addition, even if a program is written in a low-level language (e.g., Nvidia's PTX), when branch divergence occurs within the threads of a warp, the program can still perform poorly. It would be desirable to have an automated way to translate a program written in a high-level language to a program in a low-level language that reduces the negative effects of branch divergence within a SIMT model with parallel units of threads.