In the graphics processing domain, many tasks involve applying a same algorithm to multiple pixels, such as an image line or a macroblock. Therefore graphics processors were originally designed to apply a single instruction to multiple pixels at the same time in order to increase throughput. Such graphics processors thus include multiple hardware execution units, generally pipelined, where a single instruction is applied simultaneously to the data present in each execution unit.
A same execution unit is usually assigned to a same program thread and the multiple threads processed in parallel lockstep in the execution units are sometimes referred to as a “warp”. The data at the same state of processing in the units will be referred to as a data “wave”.
Such architectures have proven successful in general-purpose parallel computing, in particular because of their ability to manage tens of warps and to switch warps at each cycle. However, they only benefit applications whose control flow patterns and memory access patterns present enough regularity. Prior work has shown that the performance potential of GPU architectures is vastly underutilized by many irregular applications, for example:    V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100× GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture, pages 451-460, 2010,    W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim., 6:7:1-7:37, July 2009, and    G. Dasika, A. Sethia, T. Mudge, and S. Mahlke. PEPSC: A power-efficient processor for scientific computing. In PACT, 2011.