1. Field of the Invention
The present invention relates to a data processing apparatus and method for moving data between registers and memory.
2. Background of the Invention
When it is necessary to perform a particular data processing operation on a number of separate data elements, one known approach for accelerating the performance of such an operation is to employ a SIMD (Single Instruction Multiple Data) approach. In accordance with the SIMD approach, multiple of the data elements are placed side-by-side within a register, and then the operation is performed in parallel on those data elements.
However, the real performance benefits of the SIMD approach can only be fully realised if the data elements can be arranged in the appropriate order within the register without significant overhead. Typically, prior to performing the SIMD operation, the relevant data elements will need to be loaded from memory into the register(s), and it is often the case that the required data elements for the SIMD operation are not located consecutively within the memory address space. As an example, the data in memory may represent red, green and blue components for pixel values (i.e. be RGB data), and accordingly when data is loaded from a continuous block of memory into a register, the data elements within the register will also represent red, green and blue components, repeated for each pixel. It may be desired to perform a particular operation on all of the red components retrieved, and accordingly a problem that arises is how to arrange the red components in a manner such that a SIMD operation can then be applied to them.
In accordance with one known technique, data from a continuous block of memory (incorporating the required data elements to be subjected to SIMD processing) is loaded from memory into one or more registers. If each data element within a register is then considered as occupying a different lane of processing, the processor can be arranged to operate on different lanes in different ways in order to perform the required SIMD processing of particular data elements. Alternatively, certain customised instructions can be developed for particular operations. Whilst use of these techniques can avoid the need to reorder the data before the SIMD operation is performed, both of these approaches are relatively complex, and significantly increase code size by requiring different instructions and/or processes to be defined for different operations. Accordingly, such approaches do not represent a generic solution.
In addition, it can be seen that such approaches present a large overhead in terms of wasted resource bandwidth. For example, if a particular register has the capacity to store eight data elements, but some of the locations within the register contain data elements which are not going to be subjected to the SIMD operation, then it is not possible to get the maximum potential benefit from the use of the SIMD operation. As a particular example, if only four of the data elements within a particular register are to be subjected to the SIMD operation, then only half of the potential bandwidth supported by the register is being utilised.
An alternative prior art approach is to load the required data from the memory into one or more registers in the same manner as described above, but then to employ certain rearrangement operations specified by additional instructions in order to rearrange the data so that the data elements to be subjected to a SIMD operation are placed side by side within one or more registers. Whilst this then enables the subsequent SIMD operation to make maximum use of the available bandwidth of the register, there is a significant performance impact due to the requirement to execute one or more further instructions prior to execution of the SIMD operation in order to rearrange the data as required. This can significantly adversely affect the potential benefit to be realised from use of the SIMD operation.
In the different technical field of vector processing, it is known to provide load instructions which can collect individual data elements from non-consecutive locations in memory by specifying a starting address and a stride. This can for example enable every third data element starting from a particular address to be loaded into a register. Similar store instructions may also be provided.
Such an approach can be advantageous in vector processing systems, since such systems do not generally employ caches within the memory system, and typically are not seeking to access continuous blocks of memory. Accordingly, the increased complexity in the load/store hardware required to implement such striding functions is deemed worthwhile.
However, data processing systems that may be used to perform the earlier described SIMD operations on data elements placed side-by-side within particular registers typically do wish to access continuous blocks of memory, and accordingly it would not be desirable to increase the complexity of the basic load/store hardware in order to support such striding functions. As an example, considering the earlier example where the data represents red, green and blue components of pixels, then it may be desired to access the red, green and blue data elements for a particular sequence of pixels, and these data elements will typically be stored within a continuous block of memory. Whilst increasing the complexity of the load/store unit to support striding functionality might enable the red components to be gathered into a particular register, the blue components to be gathered into another register, and the green components to be gathered into another register, this would require separate instructions to be issued for each component, and further would significantly increase the number of memory accesses required in order to retrieve the data. In particular, it can be seen that every data element would in that instance be accessed from a non-consecutive location in memory, and that hence potentially a separate access would be required for every data element, whereas in fact the data required does occupy a continuous block of memory. Accordingly, it will be appreciated that employing such an approach would not only increase the complexity of the load/store hardware, but would also have a very significant adverse impact on the speed with which the data can be accessed.
Accordingly, it would be desirable to provide an improved technique for moving data elements between specified registers and a continuous block of memory in order to support efficient SIMD processing operations.