It is known for a contemporary data processing apparatus such as a graphics processing unit (GPU) to be programmed in a single program multiple data (SPMD) fashion in which the programmer provides a single program (or kernel) and a set of multiple data items on which that single program should operate. The data processing apparatus then creates and executes a thread for each data item within that data set. For example, in the context of image processing on a GPU, a thread can be created and executed for each pixel in a frame, with many threads being executed simultaneously in order to perform the same data processing operation on each pixel within the frame.
It is further known that a data processing apparatus such a GPU may benefit, in terms of data processing efficiency, if the order in which data values are processed is adapted to benefit the configuration of the data processing apparatus. For example, advantages related to cache locality can result if the data values are processed in an order which is modified in dependence on the requirements of the image processing to be performed. For example, it is known that where an image processing operation requires a comparison of data values related to one image pixel to be compared to its immediate neighbours, a block-wise approach to processing the pixels in the image is likely to result in improved cache performance with respect to an approach in which the pixels are handled in a row-by-row raster order configuration.
Programming models such as CUDA and OpenCL allow such a deviation from raster-scan processing of pixel data by grouping threads together, e.g. in blocks, in order to more efficiently make use of architectural features such as local memory and barriers.
In order to adapt the order in which the pixels in an image frame are processed, it is known for the kernel running in each thread to adapt its address calculation from a simple linear increment to a calculation which allows, for example, the above mentioned block-wise treatment of the pixels to result.
In a massively-parallel data processing apparatus such a GPU, it is however advantageous to reduce the processing that must be performed by each kernel running within each thread as much as possible in order to allow the data processing apparatus as a whole to more effectively process large volumes of input data.
Accordingly, it would be advantageous to enable a data processing apparatus to improve its data processing performance by adapting the order in which a set of data items are processed, without increasing the processing burden on each kernel running within each thread of that parallel data processing device.