A wide range of applications today, from audio and video signal processing and multimedia compression to automotive collision detection, use discrete transforms of a signal in their algorithms. Such discrete transforms, including, for example, the discrete cosine transform and the discrete Fourier transform, often need to be performed in real time at data rates in excess of tens of megabits per second, which demands not only high clock rates and fast processors, but also efficiency in the transform computations and in the data handling by such processors. Discrete transform operations can often be computed efficiently by using the Fast Fourier Transform (FFT), which comes in two basic “flavors”, namely decimation-in-time (Cooley-Tukey) and decimation-in-frequency (Sande-Tukey). Both flavors of the FFT include a so-called “butterfly” computation as a basic computational element. Butterfly computations are also used in other transforms (e.g., Walsh-Hadamard) and in Viterbi encoding/decoding algorithms. Hence, efficient execution of butterfly computations in the processing hardware has considerable value in numerous applications.
A basic butterfly computation involves both addition and subtraction of the real and imaginary components of complex operands. For example, in the decimation-in-time FFT variant, representative pseudo-code for performing one butterfly operation with complex values a, b, ci, A and B is given as follows, where Re( ) and Im( ) represent the respective real and imaginary components of a complex value:Re(tmp):=Re(b)Re(ci)−Im(b)Im(ci);Im(tmp):=Re(b)Im(ci)+Im(b)Re(ci);Re(A):=Re(a)+Re(tmp);Re(B):=Re(a)−Re(tmp);Im(A):=Im(a)+Im(tmp);Im(B):=Im(a)−Im(tmp);From this computation we can see that there are two occurrences of both addition and subtraction operations upon the same input operands.
If the precision of the fixed-point operands that are used in a computation are half that of the microprocessor's word length, and if the microprocessor's ALU supports single-instruction, multiple-data (SIMD) instructions for operating upon packed half-words, then the microprocessor might be used to perform both addition and subtraction in one operation. For example, the ARM11 processor, provided by ARM Limited (incorporated in the United Kingdom), has instructions that can perform half-word addition and subtraction at the same time upon packed data. Thus, the instructions SADDSUBX Rd, Rn, Rm and UADDSUBX Rd, Rn, Rm carry out respective signed and unsigned versions of:Rd[31:16]:=Rn[31:16]+Rm[15:0] andRd[15:0]:=Rn[15:0]−Rm[31:16].Likewise, the instructions SSUBADDX Rd, Rn, Rm and USUBADDX Rd, Rn, Rm carry out respective signed and unsigned versions of:Rd[31:16]:=Rn[31:16]−Rm[15:0] andRd[15:0]:=Rn[15:0]+Rm[31:16].However, these instructions cannot perform the add-subtract operation of a butterfly operation unless both the half-word operands are packed in the same register, which requires extra processing.
U.S. Patent Application Publication No. 2004/0078404 (Macy et al.) describes a processor that can perform, among a number of operations, a horizontal or intra-add-subtract operation on four packed data elements (x3, x2, x1, x0) of a first operand and four packed data elements (y3, y2, y1, y0) of a second operand to produce a result comprising the four packed data elements (y2+y3, y1−y0, x2+x3, x1−x0), or alternatively, (y2-y3, y1+y0, x2−x3, x1+x0), in order that the 8-point decimation-in-time Walsh-Hadamard transform may be efficiently computed. Computation of fast Fourier transforms is also suggested in combination with a SIMD multiplication operation.
U.S. Pat. No. 6,754,687 (Kurak, Jr. et al.) describes a processing system for efficiently computing inverse discrete cosine transforms upon two-dimensional data matrices. The computation includes performing butterfly (BFLYS) instructions comprised of separate add and subtract operations upon either quad half-word data (four packed 16-bit operands) or dual word data (two 32-bit operands).