Field of the Invention
The present invention relates generally to the field of parallel processing and, more specifically, to efficient predicated execution for parallel processors.
Description of the Related Art
Predicated execution is a mechanism for conditionally executing individual instruction operations, typically by conditionally committing or ignoring the results of executing an instruction, and thereby provides an alternative to conditional branching. In parallel processors, such as single-instruction multiple-thread (SIMT) and SIMD parallel processors where groups of parallel threads or data lanes execute a common instruction stream, predicated execution in each thread or data lane can greatly improve performance over divergent branching code where each thread of a thread group can independently take a different execution path.
In prior parallel processor designs, predicated execution within each thread or data lane makes use of a set of 4-bit condition code (CC) registers for each thread or lane instance, and instructions have a guard comprising several instruction bits to select one of the CC registers and additional bits to encode the comparison condition; a guarded instruction commits its result(s) for a thread or lane only if the condition for that thread or lane evaluates to True and is nullified otherwise. Additionally, many instructions optionally write to a CC register for each thread or data lane, requiring several instruction bits to encode the destination CC register plus one bit to enable/disable the register write operation.
As an example, a prior SIMT parallel thread processor has four 4-bit CC registers per thread, so instruction guards comprise seven bits: two bits to select one of four CC registers and five bits to encode the comparison test. There are 24 possible tests of the CC register. For instructions that optionally write a CC register, three bits are needed to encode the destination CC register and write-enable.
One problem with the prior approach is cost, both in terms of per-thread state (16-bits per thread for four CC registers) and instruction encoding space (7 bits per instruction for the guarding condition, plus 3 bits per instruction for any instruction that writes a CC register). Note that nearly every instruction must have a guard field, so reducing the encoding cost is a major concern. The 16-bits per-thread cost of CC registers is multiplied by the number of parallel threads or data lane instances, typically hundreds per SIMT or SIMD parallel processor, and is further multiplied by the number of parallel processors, which can number in the tens per chip. Per-thread register state costs chip area and power.
As the foregoing illustrates, what is needed in the art is a mechanism for minimizing per-thread state associated with predicated execution, minimizing instruction encoding bits required for predicated execution, and minimizing the number of instructions and cycles required to implement predicated execution.