Data processing apparatuses which are programmed in a single program multiple data (SPMD) fashion to then execute multiple threads to perform the required data processing, in particular graphics processing units (GPUs), have grown in capability and indeed have emerged as a popular platform for non-graphics computations. Through programming models such as OpenCL, Renderscript and CUDA, programmers can use these massively parallel architectures for computational domains such as linear algebra, image processing and molecular science. The increased popularity of such devices has made programmability, maintainability and portability issues of major importance.
SPMD programming models such as those mentioned above are advantageously flexible in that they allow the programmer to specify the independence of threads, removing ordering constraints. However, this very flexibility also places a burden on the programmer, because of the effect that the particular manner in which the threads are programmed can have on the architecture performance. As such, programmers are still required to manually perform data-locality optimizations such as memory coalescing or loop tiling, and to take into account GPU architectural characteristics such as the core count and the on-chip memory hierarchy. Furthermore, the programmer's freedom to determine the allocation of the threads to workgroups (in OpenCL)/blocks (in CUDA) may be desirable from a pure programming point of view but can negatively affect the GPU's scheduling freedom and cache performance.
There remains the potential to improve the performance of a multi-threaded data processing apparatus (such as a GPU).