Current graphics processing units (GPUs) issue and execute groups of threads called a “wavefront.” GPU architectures issue wavefronts of a constant, fixed size that depends on the GPU hardware's microarchitecture. In some implementations, a wavefront is a group of 64 threads, which are issued in groups of 16 threads through a 16 thread wide single instruction, multiple data (SIMD) unit over four cycles. In many cases, all 64 threads are executing. But some of these threads may be predicated off at various times, meaning that they execute but the results of the executed instructions are discarded. Predicating the threads is done to simplify the microarchitecture, yielding a smaller area and better chip-wide performance. But predicating the threads is also a source of inefficiency in the pipeline, as the predicated instructions take up space and power in the vector pipeline of the GPU.
FIG. 1 shows how an eight thread wide wavefront can be executed over two cycles on a four thread wide GPU microarchitecture. Threads 1-4 are issued on the first cycle, and threads 5-8 are issued on the second cycle. Some of these threads may be predicated off (for example, threads 3, 4, and 6-8) and are shown in FIG. 1 as empty boxes, showing inefficiencies in the GPU pipeline.
Many GPU workloads are non-uniform, and have numerous wavefronts with predicated-off threads. These instructions still take up space in the pipeline. Unfortunately, the predicated instructions take up space, waste power, produce heat, and produce no useful output.
Modern GPU microarchitectures have vector, scalar, and other functional units within the GPU cores. The type of instruction to be performed determines which unit of the pipeline will execute that particular instruction. For instance, scalar instructions (which are used for control flow) execute on the scalar units, while vector math instructions are combined into wavefronts and executed in parallel on vector pipelines. This approach allows the compiler/finalizer to make certain tradeoffs that are knowable at compile time (e.g., that an operation is replicated across all lanes of the vector, and thus can be executed once on a scalar unit and have its single result shared with all threads).
The current approaches do not address dynamic runtime behavior that is difficult or impossible to know at compile time. For example, there may be instances where all but one thread is waiting at a barrier for the one thread to complete. Unfortunately, at compile time, it is often impossible to know which thread will be the laggard because of data-dependent loop trip-counts, memory latency, whims of the scheduler, etc.
Similarly, static techniques cannot know when the vector units will run inefficiently due to issues like wavefront imbalance, where many threads will be predicated off.