1. Field
The present invention relates to a data processing apparatus. In particular, the present invention has relevance to data processing apparatuses that execute an access instruction for a plurality of threads.
2. Description
In a Single Instruction Multiple Thread (SIMT) system, a group of threads is said to execute in parallel. Each thread within the group may have its own registers and program counter, and may execute the same program. If each thread has its own registers, then at least a subset of those threads may execute the same instruction on different data values. Although the execution of the threads in a group is said to take place in parallel, the number of threads that can actually execute said instruction at the same time may be smaller than the number of threads in the group. Accordingly, the threads in a particular group may have to be divided up into a number of sets, with each set of threads executing the instruction substantially simultaneously.
U.S. Pat. No. 8,392,669 describes a method (herein referred to as ‘peeling’) of improving the efficiency of simultaneous memory access requests. Ordinarily, a memory access request will be issued in order to service a single thread. However, by configuring the memory circuitry to return more data than a particular thread requests, it is possible to use a single memory access request to additionally return data that is requested by other threads. For example, if a memory access request from a first thread is directed towards a data value at a memory base address, the memory circuitry may be configured to return data values stored at the base address as well as data values stored at the addresses following the base address if those data values will service the memory access requests of other threads.
Data values may be stored in groups defined by storage boundaries. For example, a cache stores data values in fixed sized groups known as cache lines. In such a situation, if the base address for a group of threads is not aligned with the storage boundary, then the process of peeling, in combination with the previously mentioned division of threads in a group into sets, may cause multiple memory access requests to be issued unnecessarily. For example, a single cache line may be accessed more than once.