To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, a Single Instruction, Multiple Data (SIMD) architecture has been implemented in computer systems to enable one instruction to operate on several operands simultaneously, rather than on a single operand. In particular, SIMD architectures take advantage of packing many data elements within one register or memory location. With parallel hardware execution, multiple operations can be performed on separate data elements with one instruction, resulting in significant performance improvement.
One set of SIMD instructions was defined for the Pentium® Processor with MMX™ Technology by Intel® Corporation and described in “IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference,” which is available from Intel Corporation, Santa Clara, Calif. on the world-wide-web (www) at intel.com/design/litcentr.
Currently, the SIMD addition operation only performs “vertical” or inter-register addition, where pairs of data elements, for example, a first element Xn (where n is an integer) from one operand, and a second element Yn from a second operand, are added together. An example of such a vertical addition operation is shown in FIG. 1, where the instruction is performed on the sets of data elements (X3, X2, X1 and X0) and (Y3, Y2, Y1, and Y0) accessed as Source 1 and Source2, respectively to obtain the result (X3+Y3, X2+Y2, X1+Y1, and X0+Y0).
Although many applications currently in use can take advantage of such a vertical add operation, there are a number of important applications which would require the rearrangement of the data elements before the vertical add operation can be implemented so as to provide realization of the application.
For example, an 8-point decimation in time operation of a Walsh-Hadamard transform and of a Fast-Fourier Transform (FFT) is shown in FIG. 2b. The larger 8-point transforms may be performed in stages through successive doubling. That is to say an 8-point transform can be computed from two 4-point transforms, which can be computed from four 2-point transforms. The computations at each stage are called butterflies.
A butterfly for the staged computations of FIG. 2b is shown in FIG. 2a. At each successive stage, data elements at even positions are combined with the data elements at odd positions to generate data elements of the next successive stage. In order to perform these staged computations using prior-art SIMD vertical additions and vertical subtractions, instructions substantially similar to the instruction sequence example of Table 1 may be used to shuffle and rearrange data elements for each stage.
TABLE 1Exemplary Code To Prepare Data for Vertical-Add/Vertical-SubtractOperations:movdqaxmm7, [esi]//shuffle pattern to put even elements inlow half odd in high halfpshufbxmm0, xmm7//shuffle data in xmm0pshufbxmm1, xmm7//shuffle data in xmm1movdqaxmm2, xmm0//copy xmm0 datapunpcklqdqxmm0, xmm1//combine even elements of xmm0 andxmm1. xmm1 in high half.punpckhqdqxmm2, xmm1//combine odd elements of xmm2 (equal toxmm0) and xmm1.
One drawback of this approach is that it requires additional processing time to perform the operations that shuffle and recombine the data elements between stages. Another drawback is that an additional register is used to hold a shuffle pattern for sorting the even elements into the low half of the register and the odd elements into the high half of the register. A third drawback is that the extra instructions that are required due to the necessity to rearrange data between stages reduces the code density and requires more storage in memory and in cache.
Accordingly, there is a need in the technology for providing an apparatus and method which more efficiently performs butterfly computations, such as those used in 2-point or 4-point transforms for example, without requiring additional time to perform operations that shuffle and recombine data elements. There is also a need in the technology for a method and operation for increasing code density by eliminating the necessity for the rearrangement of data elements and thereby eliminating the corresponding rearrangement operations from the code. By eliminating the necessity for the rearrangement of data elements, an additional register could also be made available that might otherwise have been used to store patterns for shuffling the odd and even data elements.