Data processing applications such as signal processing applications typically require data rearrangement to be performed at high data rates. When data processing is sufficiently accelerated, for example when using a single instruction multiple data (SIMD) data processing unit, then data rearrangements can become a bottleneck in performing the computations.
In SIMD processors, a data processing operation performed upon one or more specified registers results in parallel operations being performed upon multiple data elements stored within the data registers. The SIMD approach hence allows the data processing operation to be performed in a number of parallel lanes of processing on respective data elements from a source register so as to generate respective data elements for storing within a destination register.
An example of a signal processing application where the use of such SIMD processing is beneficial is in the processing of radio signals. In signal processing applications such as those used in radio standards, it is commonly required to perform rearrangement operations in the form of de-interleave operations and interleave operations. These operations are performed in order to separate two or more channels in the case of de-interleaving or to combine two or more channels in the case of interleaving. One example is the separation of data channels from pilot channels (i.e. control channels). Another area where such interleave and de-interleave operations are required is in the area of graphics processing, and again it is common for such graphics processing applications to be run within a SIMD processor.
In applications such as radio and graphics, there is often a need to perform a three way interleave or de-interleave operation. For example, in graphics, such operations may be used to interleave or de-interleave x, y and z components. However, processors typically execute instructions which operate on two source registers, and hence interleave and de-interleave instructions typically specify interleaving and de-interleaving operations that are two way, or a multiple thereof such as four way or eight way. For example, the NEON instruction set developed by ARM Limited, Cambridge, United Kingdom, provides zip and unzip instructions for performing two way interleave and de-interleave operations (ZIP is a name of an interleave instruction and UNZIP is a name of a de-interleave instruction). However, it is very difficult to synthesise a three way interleave or de-interleave using such operations. In particular, whilst it is possible to construct code sequences that involve multiple uses of such instructions along with extract/shift instructions in order to perform a three way interleave or de-interleave in respect of data elements stored in registers, such an approach takes many clock cycles, having a significant impact on performance.
Some data processing systems provide generically programmable hardware forming a programmable permute network to enable arbitrary permutations to be performed. One or more control instructions are then executed to set up programmable masks whereafter selected permutations can be performed under the control of further instructions. Commonly owned, co-pending, U.S. patent application Ser. No. 12/314,760, the entire contents of which are hereby incorporated by reference, describes one arrangement of such a permute network.
The following papers described general bit permutation instructions:                1. Dimitrakopoulos et al “Sorter based permutation units for media-enhanced microprocessors”. IEEE Tran on VLSI systems Vol 15 No 6 Jun. 2007. pp 711-715        2. Y. Hilewitz, Z. J. Shi and R. B. Lee “Comparing fast implementations of bit permutation instructions” Proc. Asilomar Conf. Signals, Syst. Comput. 2004. pp 1856-1863        3. Z. J. Shi and R. B. Lee “Implementation complexity of bit permutation instructions” Proc. Asilomar Conf. Signals, Syst. Comput. 2003 pp 879-886        4. X. Yang, M. Vachharajani, R. B. Lee, “Fast Subword Permutation Instructions Based on Butterfly Networks” Proc SPIE, Media Proc 2000, Jan. 27-28, 2000, San Jose, pp 80-86.        
These articles generally concern the programming of permute networks to perform general permutations.
Whilst appropriate programming of a generic permute network would allow three way interleave and de-interleave operations to be performed, such an approach requires a significant amount of hardware not only to provide the programmable permute network itself, but also in respect of the registers required to store masks, etc. Several instructions are needed to program the permute network for particular operations, and accordingly there is significant overhead in using such an approach to perform the required three way interleaving and de-interleaving operations that signal processing applications such as radio and graphics require.
The NEON instruction set developed by ARM Limited, Cambridge, United Kingdom, provides load and store instructions which can perform rearrangement operations as the data is loaded from memory into a register, or stored from a register back to memory. One particular instantiation of those load and store instructions can provide three way interleaving and de-interleaving. Execution of these instructions causes the transfer of the data elements within multiple registers between memory and the register bank in a sequential manner, and these instructions can be used to achieve a three way interleave or de-interleave on register contents, but only with high latency due to the latency of the transfer between memory. For example, a rearranging store instruction could be executed to cause a three way de-interleave of the data elements in multiple registers as they are stored from those registers out to memory. A normal load instruction could then be used to retrieve the rearranged data elements back into the registers, thereby resulting in a three way de-interleave having been performed on the content of those registers. However, this is a high latency and high energy consumption approach, and hence gives rise to considerable power and cycle cost due to the need to go via memory, if a rearrangement within registers in the course of an algorithm is needed.
Accordingly, it would be desirable to provide an improved technique for performing three way interleave and de-interleave operations on register contents, or more generally n-way interleave and de-interleave operations, where n is a plural odd number.