1. Field of the Invention
The present invention relates generally to the field of parallel graphics processing and, more specifically, to branch instructions in a parallel thread processor.
2. Description of the Related Art
Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data; however, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In a SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency.
A problem typically arises, however, when the program includes predicated (conditional) branch instructions, and some threads execute (take) the branch to a target instruction address, but others do not and fall through to the next instruction. In some prior art systems, predicated branch instructions are inserted into a code sequence by a compiler that is compiling conditional code or by a programmer. A predicated branch instruction is associated with a predicate and/or a condition code test, and each thread in the thread group executes the branch instruction only when the predicate value is true and the condition code test is true. Predicated branch instructions use predicate guard registers and/or condition code (CC) tests to implement conditional branches and have the following three forms:                @Pg BRA target; // if (Pg) goto target;                    BRA CC.LT target; // if (CC.LT) goto target;                        @Pg BRA CC.LT target; // if (Pg && CC.LT) goto target;A “not” predicate guard @!Pg uses the Boolean complement of the predicate register value. A branch is unconditional if the predicate guard @Pg is omitted and the condition code test is omitted.        
With P threads in a thread group, a predicated or conditional instruction sequence can use a predicated branch instruction or can use predicated instructions without branch instructions. Depending on the run-time values of the predicate registers and condition code tests for the P threads, two different run-time cases can arise in code using a predicated branch, and a third case arises in predicated code without a branch. In a first case, all P threads have the same predicate guard register Pg value and the same condition code test result, thus all P threads branch to the target, or all P threads fall through and execute the immediately following instruction. The thread group is converged, and all P threads in the warp follow the same execution path. In a second case, some threads have a true Pg value while other threads have a false Pg value. In this scenario, the threads having the true Pg value branch to the target, while the remaining threads fall through and execute the immediately following instruction. The thread group diverges as some threads branch while the others do not. The thread group executes both code paths with different sets of active threads while it is diverged, and some prior art systems use a stack of synchronization tokens to manage diverging and synchronizing thread groups. At some point in the execution sequence, thread group synchronization is performed to re-converge the divergent thread group. This synchronization operation adds extra instructions and synchronization stack operations, thus reducing execution efficiency and increasing overhead.
In a third case, rather than using predicated branch instructions to implement a conditional code sequence which can diverge the thread group, only predicated instructions are used. The instructions are predicated on complementary Pg and !Pg predicates (or on complementary condition code tests), thus executing both code paths with different sets of active threads, without diverging the thread group. The execution of the predicated code sequences requires all threads in the thread group to be dragged through each part of the conditional code regardless of whether any threads execute that code or not. Given that a SIMT processor may execute upwards of 800 threads, such a design is inefficient since hundreds of threads may be needlessly dragged through a code path they don't execute. At program design time or compile time, it is difficult for the programmer or compiler to predict which run-time cases will arise, and therefore difficult to choose which instruction sequence to use to obtain efficient performance on conditional code sequences.
Accordingly, what is needed in the art is a more efficient branching mechanism for conditional code sequences in systems with SIMT architectures that does not cause a thread group to diverge or execute needless instructions.