1. Field of the Invention
The present invention relates to an apparatus and method for performing re-arrangement operations on data.
2. Description of the Prior Art
When it is necessary to perform a particular data processing operation on a number of separate data elements, one known approach for accelerating the performance of such an operation is to employ a SIMD (Single Instruction Multiple Data) approach. In accordance with the SIMD approach, multiple of the data elements are placed side-by-side within a register and then the operation is performed in parallel on those data elements.
Whilst the above approach works well for certain types of data processing operations, and allows a significant performance improvement to be realised, there are certain data processing operations where it is impractical to arrange the required data elements in such a way that the above SIMD approach can be used. For example, if a data processing operation required four input data elements, then the prior art SIMD approach would involve placing one set of values for those four input data elements within corresponding locations of four source registers, and to then pack into the other locations of those source registers further sets of values for those four input data elements, so that the data processing operation can be performed in parallel on those packed data elements. However, in some situations it may not be possible to perform the required data processing operation with the data packed in that manner, or alternatively it may take such a significant re-ordering of the data that the potential benefit of employing the SIMD approach would be negated by the extra time taken to re-arrange the data in the required manner.
To assist in re-ordering data such that it is amenable to being processed by SIMD based arithmetic, various permutation networks have been developed. For example, the Article “Fast Subword Permutation Instructions Based on Butterfly Networks” by Xiao Yang et al, Proceedings of SPIE, Media Processor 2000, Jan. 27-28, 2000, San Jose, Calif., Pages 80 to 86, describes the use of a butterfly network to provide a generalised technique for performing arbitrary subword permutations. Subword permutation instructions are proposed, and it is indicated that the maximum number of instructions used in accordance with the described technique for permuting n subwords is log n, or 2 log n if the loads for the configuration registers are taken into account.
The Article “SODA: A Low-power Architecture For Software Radio” by Yuan Lin et al, University of Michigan, ISCA Jun. 2006, describes a programmable architecture for high-end signal processing for software defined radio. This architecture makes use of a SIMD pipeline, and discloses the use of a SIMD shuffle network to support intra-processor data movements. The SIMD shuffle network is an iterative partially connected network consisting of a shuffle exchange network, an inverse shuffle exchange network and a feedback path. It is indicated that by using such a network any permutation of size N can be done with 2 log2 N-1 iterations of either the shuffle exchange or inverse shuffle exchange network, where N is the SIMD width.
Whilst the above-described permutation networks enable a wide variety of different re-ordering to be achieved, a significant number of instructions are required to perform any particular re-ordering, and accordingly this had an adverse effect on the size of the SIMD code.
GB-A-2,409,064 describes the provision of various permutation instructions, for example an interleave instruction, where the instruction itself can identify a lane size and a data element size, with the lane size being a multiple of the data element size. When such an instruction is decoded, control signals are sent to the processing logic to define based on the lane size a number of lanes of parallel processing in at least one of the registers, and then to perform in parallel a data processing operation on the data elements within each of those lanes. By such an approach, the same basic instruction can be used to perform different data manipulations, dependent on the lane size or data element size specified.
Whilst this can lead to some efficiencies in performing certain permutations, any particular instantiation of the instruction within the code will only perform one particular type of permutation. Accordingly, when performing complex operations where sequences of permutations may need to be performed at certain points, it is still necessary to provide separate instructions in the code to perform each particular permutation, and as a result there is an adverse impact on the code density achieved. For example, when performing a Fast Fourier Transform (FFT) process, each stage of the FFT process will typically require a particular data manipulation to be performed, followed by a particular FFT computation, and accordingly separate pieces of code will be required for each stage of the FFT process to define the required manipulations and subsequent FFT computations.
Freescale's Altivec and Intel's Wireless MMX (WMMX) instruction sets provide for performing an unaligned load, which from a programmer's perspective have the effect of loading two SIMD registers with aligned data, and then applying a rotation to the data of the two registers joined together based on the low order bits of an address stored in scalar logic, such that the rotation produces the required data at the unaligned address. The input of the address bits from the scalar side of the machine can hence in this limited situation be used to control the operation on the SIMD side of the machine. Irrespective of the address bits provided, the individual data elements manipulated are always of the same size.
It would be desirable to provide a mechanism for significantly improving SIMD code density for certain algorithms within a SIMD data processing system.