High-performance processing systems include multiple processing units and memory systems. Multi-threaded processing units (such as a graphics processing unit, GPU) typically implement multiple processing cores (referred to as “compute units”) that process multiple operations and request access to memory systems concurrently through multiple memory channels. In many applications, such as graphics processing in a GPU, a sequence of work-items (which can also be referred to as threads) are processed in order to output a final result.
During processing, the multiple processor cores are able to execute a thread concurrently with execution of other threads by the other compute units, e.g., according to the single instruction, multiple data (SIMD) execution model. Processing systems cluster threads into wavefronts, or warps, that concurrently execute the same instruction on different data. As the waves execute, instructions or data are retrieved from memory to be used by the processing elements. Execution of a wavefront terminates when all threads within the wavefront complete processing.