Single instruction, multiple threads (SIMT) is a parallel execution model adopted by some modern graphics processing units (GPUs). Such a GPU can execute a single instruction on multiple threads concurrently in lock-step by utilizing its parallel data paths. Single-program multiple-data (SPMD) accelerator languages such as CUDA® and OpenCL® have been developed to enhance the computing performance of GPUs that have the SIMT architecture.
Some modern GPUs can execute a single instruction on more threads than the number of its parallel data paths. For example, a processor with 32 parallel data paths may execute one instruction on 128 threads in 4 sequential cycles. These 128 threads are hereinafter referred to as a thread block. All of the threads in a thread block share one program counter and instruction fetch, and are executed in lock-step, e.g., 32 threads in each of the 4 sequential cycles.
SIMT reduces program counters and instruction fetching overhead, but in some scenarios suffers from poor utilization of computing resources due to the lock-step execution model. For example, to handle an if-else block where various threads of a processor follow different control-flow paths, the threads that follow the “else” path are disabled (waiting) when the threads that follow the “if” path execute, and vice versa. That is, one control-flow path is executed at a time, even though the execution is useless for some of the threads. Furthermore, poor utilization also comes from redundant bookkeeping across the threads. For example, in a while-loop, all threads of a processor execute the loop count increment in lock-step even though the increment is uniform (i.e., the same) across all threads. In addition to redundant loop count calculations, often times threads calculate the same branch conditions, replicate the same base addresses, and perform similar address calculations to retrieve data from data arrays. Therefore, there is a need for reducing the redundancy in SIMT computing to improve system performance.