It is known to provide data processing systems incorporating both main processors and a coprocessor. In some systems it is known to be able to provide one or more different coprocessors with a main processor. In this case, the different coprocessors can be distinguished by different coprocessor numbers.
A coprocessor instruction encountered in the instruction data stream of the main processor is issued on a bus coupled to the coprocessor. The one or more coprocessors (that each have an associated hardwired coprocessor number) attached to the bus examine the coprocessor number field of the instruction to determine whether or not they are the target coprocessor for that instruction. If they are the target coprocessor, then they issue an accept signal to the main processor. If the main processor does not receive an accept signal, then it can enter an exception state to deal with the undefined instruction.
One type of instruction may perform operations on packed data. Such instructions may be referred to as Single-Instruction-Multiple-Data (SIMD) instructions. One set of SIMD instructions was defined for the Pentium® Processor with MMX™ Technology by Intel® Corporation and described in “IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference,” which is available online from Intel Corporation, Santa Clara, Calif. at www.intel.com/design/litcentr.
Currently, the SIMD addition or subtraction operation only performs addition or subtraction, where pairs of data elements, for example, a first element Xn (where n is an integer) from one operand, and a second element Yn from a second operand, are added together or subtracted. For example, such an addition operation may be performed on sets of data elements (X3, X2, X1 and X0) and (Y3, Y2, Y1, and Y0) accessed as Source1 and Source2, respectively to obtain the result (X3+Y3, X2+Y2, X1+Y1, and X0+Y0).
Although many applications currently in use can take advantage of such an operation, there are a number of important applications which would require the rearrangement of the data elements before the above addition operation can be implemented so as to provide realization of the application.
For example, a complex radix-4 decimation in time operation of a Fast-Fourier Transform (FFT) is shown in FIG. 19a. The computations at each stage are called butterflies. In general, a radix-4 butterfly involves 3 complex multiplications and 12 complex additions.
The complex radix-4 butterfly is equivalent to the matrix operations shown in FIG. 19b. Product 1950 represents the multiplication of the complex inputs by the complex twiddle factors as seen on the left hand side of the radix-4 butterfly illustrated in FIG. 19a. Transformation matrix 1920 selectively reorders and negates the complex product components to produce the output vector 1910 for a particular butterfly stage.
Selective reordering and negation of complex SIMD components represents a significant computational overhead in complex multiplications and transformations such as those performed in the radix-4 FFT butterfly.
Accordingly, there is a need in the technology for providing an apparatus and method which more efficiently performs complex multiplication and butterfly computations, such as those used in FFTs for example, without requiring additional time to perform operations that negate, shuffle and recombine data elements. There is also a need in the technology for a method and operation for increasing code density by eliminating the necessity for the rearrangement of data elements and thereby eliminating the corresponding rearrangement operations from the code. By eliminating the necessity for the rearrangement and selective negation of data elements, additional registers could also be made available that might otherwise have been used to store patterns for shuffling and/or negating data elements.