1. Field of the Invention
The present invention relates to a data processing apparatus and method for moving data between registers and memory.
2. Description of the Prior Art
When it is necessary to perform a particular data processing operation on a number of separate data elements, one known approach for accelerating the performance of such an operation is to employ a SIMD (Single Instruction Multiple Data) approach. In accordance with the SIMD approach, multiple of the data elements are placed side-by-side within a register, and then the operation is performed in parallel on those data elements.
However, the real performance benefits of the SIMD approach can only be fully realised if the data elements can be arranged in the appropriate order within the register without significant overhead. Typically, prior to, performing the SIMD operation, the relevant data elements will need to be loaded from memory into the register(s), and it is often the case that the required data elements for the SIMD operation are not located consecutively within the memory address space. As an example, the data in memory may represent red, green and blue components for pixel values (i.e. be RGB values), and it may be desired to perform a particular SIMD operations on the red, green and blue components of certain pixels which are not located in a continuous block of memory.
This hence requires the data to be retrieved from memory into certain registers, and for the data to then be rearranged so that red, green and blue data elements occupy different registers. Multiple accesses will be required to retrieve the required data, and rearranging of the data will typically then be required to ensure the data is correctly ordered in the registers in order to allow the SIMD operations to be performed.
One way in which to access the required data elements would be to issue a separate instruction for each data element, with that data element then being placed within a specified destination register. As an example, considering the red data elements for pixel values discussed above, this would result in each red data element occupying a separate register. Then, rearrangement operations could be performed in order to gather the individual red data elements into one or more registers, whereafter SIMD processing could be performed on those data elements. A similar process would also be required for the green and blue data elements if SIMD processing were to be applied to those elements. It will be appreciated that such an approach involves a large number of accesses to the memory, and also requires a significant number of registers to be available to receive the data elements prior to them being rearranged. Further, in addition to the adverse impact on performance caused by the multiple accesses, there is also a performance hit due to the time taken to rearrange the data before it is in an order where it can be subjected to the SIMD processing, and this adversely impacts the potential performance benefit that can be realised through use of the SIMD operation.
If the architecture allowed it, one possible enhancement to the above known technique would be to retrieve the red, green and blue data elements of one pixel at the same time and place those three data elements into a particular register. Whilst this would reduce the number of accesses required, it would still require rearrangement operations to be performed in order to move the data elements of different components into different registers prior to the SIMD operation being able to take place. In addition, a significant number of registers are still required in order to store the retrieved data elements prior to them being rearranged in preparation for SIMD processing.
In the different technical field of vector processing, it is known to provide load instructions which can collect individual data elements from non-consecutive locations in memory by specifying a starting address and a stride. This can for example enable every third data element starting from a particular address to be loaded into a register. Similar store instructions may also be provided.
However, this approach typically involves an increase in complexity in the load/store hardware required to implement such striding functions, which whilst being considered worthwhile in vector processing systems, is not desirable in most other data processing systems.
Further, this approach is quite restrictive in that the data elements that are to be gathered into a register for subsequent SIMD processing will need to be separated by a particular stride, and this is often not the case. For example, the data elements might be related via a linked list, where the separation in memory between one data element and the next will vary from data element to data element.
Further, considering the earlier example where the data represents red, green and blue components of pixels, then if all three components are required, separate stride instructions will be needed for each component. Hence, whilst increasing the complexity of the load/store unit to support striding functionality might enable the red components to be gathered into a particular register, the blue components to be gathered into another register, and the green components to be gathered into another register (in the restrictive situation where the required data elements are separated by a fixed stride), this would require separate instructions to be issued for each component.
Accordingly, it would be desirable to provide an improved technique for moving data elements between specified registers and memory in order to support efficient SIMD processing operations.