1. Field of the Invention
The present invention relates to a data processing apparatus and method. Embodiments of the present invention relate to a data processing apparatus and method operable to perform permute operations.
2. Description of the Prior Art
Permute operations are known. Permute operations typically take a sequence of data elements and reorder or permutate the data elements to create a new sequence.
For example, as shown in FIG. 1, a sequence of consecutive data elements 0 to 15 are provided. A permute unit 10 is provided which performs a permute operation on the data elements in response to a permute instruction. Such an instruction is typically supported by a vector or single instruction multiple data (SIMD) data processing apparatus for supporting transformation between arrays of structures (AoS) and structures of arrays (SoA). UK patent application 2,409,063 filed on 9 Dec. 2003 by ARM Limited describes examples of such permute instructions.
The sequence of data elements is spilt into a number of groups of data elements A0 to A3. Each group contains a fixed number of data elements. In this example, each group contains 4, 16-bit data elements. As illustrated in FIG. 1, group A0 contains data elements 0 to 3; group A1 contains data elements 4 to 7; group A2 contains data elements 8 to 11; and group A3 contains data elements 12 to 15. Each group of data elements is provided to the permute unit 10 and is buffered therein. Once all 4 groups of data elements have been provided to the permute logic 10 then the permute logic 10 will output the buffered data elements as permuted groups in which the data elements are provided in a revised order. It will be appreciated that the permuted groups can not be output until all groups of data elements have been received by the permute logic 10. In this example, 4 permuted groups B0 to B3 are output, each of which contains a data element from a different group A0 to A3 provided to the permute logic 10. Accordingly, permuted group B0 contains data elements 0, 4, 8 and 12; permuted group B1 contains data elements 1, 5, 9 and 13; permuted group B2 contains data elements 2, 6, 10 and 14; and permuted group B3 contains data elements 3, 7, 11 and 15.
FIG. 2 illustrates in more detail a known arrangement of the permute unit 10. The permute unit 10 comprises a register R1 which receives each group of data elements A0 to A3. Each group of data elements is provided to permute logic 12, which distributes the data elements within each group across the registers A to D. Once all the groups of data elements have been distributed across the registers A to D, a multiplexer 14 reads any one of the registers A to D and provide that content to the register E1.
Hence, in response to the instruction VLD 4.16, in a first clock cycle t0, the register R1 will contain the first group of data elements A0. Also in that cycle, any data previously held in one of the registers A to D may be output to the register E1.
In the next cycle, t1, the group of data elements A0 is distributed from the register R1 across the registers A to D. A similar process continues in the next three clock cycles, t2 to t4, until registers A to D are full. Hence, in clock cycles t1 to t4 no data elements are provided to the register E1 because the registers A to D are being filled.
However, in clock cycle t5, the contents of the register A (which contains the permuted group of data elements B0) are provided via the multiplexer 14 to the register E1. Thereafter, in clock cycle t6, the contents of the register B (which contains the permuted group of data elements B1) can be provided to the register El. In clock cycle t7, the contents of the register C (which contains the permuted group of data elements B2) are provided to the register E1. In clock cycle t8, the contents of the register D (which contains the permuted group of data elements B3) are provided to the register E1.
Hence, in clock cycles t4 to t7 no data elements can be received by the permute logic 12 because the contents of the registers A to D are being emptied.
However, in clock cycle t8, the first group of data elements A0′ associated with a following permute instruction can be provided from the register R1 to the permute logic 12.
Thereafter, the operation during clock cycles t9 to t13 will be analogous to those of t1 to t5.
Whilst the described permute unit enables a permute operation to be performed, the performance of that permute unit is less than optimal.
Accordingly, it is desired to provide an improved technique for performing a permute operation.