1. Field of the Invention
The present invention relates in general to the field of computer systems, and in particular, to an apparatus and method for performing multi-dimensional computations based on an intra-add operation.
2. Description of the Related Art
To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, a Single Instruction, Multiple Data (SIMD) architecture has been implemented in computer systems to enable one instruction to operate on several operands simultaneously, rather than on a single operand. In particular, SIMD architectures take advantage of packing many data elements within one register or memory location. With parallel hardware execution, multiple operations can be performed on separate data elements with one instruction, resulting in significant performance improvement.
Currently, the SIMD addition operation only performs "vertical" or inter-register addition, where pairs of data elements, for example, a first element Xn (where n is an integer) from one operand, and a second element Yn from a second operand, are added together. An example of such a vertical addition operation is shown in FIG. 1, where the instruction is performed on the sets of data elements (X.sub.3, X.sub.2, X.sub.1 and X.sub.0) and (Y.sub.3, Y.sub.2, Y.sub.1, and Y.sub.0) accessed as Source1 and Source2, respectively to obtain the result (X.sub.3 +Y.sub.3, X.sub.2 +Y.sub.2, X.sub.1 +Y.sub.1, and X.sub.0 +Y.sub.0).
Although many applications currently in use can take advantage of such a vertical add operation, there are a number of important applications which would require the rearrangement of the data elements before the vertical add operation can be implemented so as to provide realization of the application.
For example, a matrix multiplication operation is shown below. ##EQU1##
To obtain the product of the matrix A with a vector X to obtain the resulting vector Y, instructions are used to: 1) store the columns of the matrix A as packed operands (this typically requires rearrangement of data because the rows of the matrix A coefficients are stored to be accessed as packed data operands, not the columns); 2) store a set of operands that each have a different one of the vector X coefficients in every data element; 3) use vertical multiplication where each data element in the vector X (i.e., X.sub.4, X.sub.3, X.sub.2, X.sub.1) has to be first multiplied with data elements in each column (for example, [A.sub.14, A.sub.24, A.sub.34, A.sub.44 ]) of the matrix A. The results of the multiplication operations are then added together through three vertical add operations such as that shown in FIG. 1, to obtain the final result. Such a matrix multiplication operation based on the use of vertical add operations typically requires 20 instructions to implement, an example of which is shown below in Table 1.
Exemplary Code Based on Vertical-Add Operations:
Assumptions:
TABLE 1 1/X stored With X1 first, X4 last 2/transposed of A stored with A11 first, A21 second, A31 third, etc. 3/availability of: DUPLS: duplicate once DUPLD: duplicate twice MOVD mm0, / /[0,0,0,X1] &lt;mem_X&gt; DUPLS mm0, mm0 / /[0,0,X1,X1] DUPLD mm0, mm0 / /[X1,X1,X1,X1] PFMUL mm0, / /[A41*X1,A31*X1,A21*X1,A11*X1] &lt;mem_A&gt; MOVD mm1, / /[0,0,0,X2] &lt;mem_X+4&gt; DUPLS mm1, mm1 / /[0,0,X2,X2] DUPLD mm1, mm1 / /[X2,X2,X2,X2] PFMUL mm1, / /[A42*X2,A32*X2,A22*X2,A12*X2] &lt;mem_A+16&gt; MOVD mm2, / /[0,0,0,X3] &lt;mem_X+8&gt; DUPLS mm2, mm2 / /[0,0,X3,X3] DUPLD mm2, mm2 / /[X3,X3,X3,X3] PFMUL mm2, / /[A43*X3,A33*X3,A23*X3,A13*X3] &lt;mem_A+32&gt; MOVD mm3, / /[0,0,0,X4] &lt;mem_X+12&gt; DUPLS mm3, mm3 / /[0,0,X4,X4] DUPLD mm3, mm3 / /[X4,X4,X4,X4] PFMUL mm3, / /[A44*X4,A34*X4,A24*X4,A14*X4] &lt;mem_A+48&gt; PFADD mm0, mm1 / /[A42*X2+A41*X1,A32*X2+A31*X1, / /A22*X2+A21*X1,A12*X2+A11*X1] PFADD mm2, mm3 / /[A44*X4+A43*X3,A34*X4+A33*X3, / /A24*X4+A23*X3,A14*X4+A13*X3] PFADD mm0, mm2 / /[A44*X4+A43*X3+A42*X2+A41*X1, / /A34*X4+A33*X3+A32*X2+A31*X1, / /A24*X4+A23*X3+A22*X2+A21*X1, / /A14*X4+A13*X3+A12*X2+A11*X1] MOVDQ &lt;mem_Y&gt;, mm0 / /store[Y4,Y3,Y2,Y1]
Accordingly, there is a need in the technology for providing an apparatus and method which efficiently performs multi-dimensional computations based on a "horizontal" or intra-add operation. There is also a need in the technology for a method and operation for increasing code density by eliminating the need for the rearrangement of data elements and the corresponding rearrangement operations.