1. Field of the Invention
The present invention relates in general to the field of computer systems, and in particular, to an apparatus and method for performing multi-dimensional computations based on an intra-add operation.
2. Description of the Related Art
To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, a Single Instruction, Multiple Data (SIMD) architecture has been implemented in computer systems to enable one instruction to operate on several operands simultaneously, rather than on a single operand. In particular, SIMD architectures take advantage of packing many data elements within one register or memory location. With parallel hardware execution, multiple operations can be performed on separate data elements with one instruction, resulting in significant performance improvement.
Currently, the SIMD addition operation only performs “vertical” or inter-register addition, where pairs of data elements, for example, a first element Xn (where n is an integer) from one operand, and a second element Yn from a second operand, are added together. An example of such a vertical addition operation is shown in FIG. 1, where the instruction is performed on the sets of data elements (X3, X2, X1 and X0) and (Y3, Y2, Y1, and Y0) accessed as Source1 and Source2, respectively to obtain the result (X3+Y3, X2+Y2, X1+Y1, and X0+Y0).
Although many applications currently in use can take advantage of such a vertical add operation, there are a number of important applications which would require the rearrangement of the data elements before the vertical add operation can be implemented so as to provide realization of the application.
For example, a matrix multiplication operation is shown below.             MATRIX      ⁢                           ⁢      A      *      VECTOR      ⁢                           ⁢      X        =          VECTOR      ⁢                           ⁢      Y                                                                                      A                14                                                                    A                13                                                                    A                12                                                                    A                11                                                                                        A                24                                                                    A                23                                                                    A                22                                                                    A                21                                                                                        A                34                                                                    A                33                                                                    A                32                                                                    A                31                                                                                        A                44                                                                    A                43                                                                    A                42                                                                    A                41                                                                ⁢      ∞      ⁢                                                                          X                4                                                                                        X                3                                                                                        X                2                                                                                        X                1                                                                  =                                                                                              A                  14                                ⁢                                  X                  4                                            +                                                A                  13                                ⁢                                  X                  3                                            +                                                A                  12                                ⁢                                  X                  2                                            +                                                A                  11                                ⁢                                  X                  1                                                                                                                                          A                  24                                ⁢                                  X                  4                                            +                                                A                  23                                ⁢                                  X                  3                                            +                                                A                  22                                ⁢                                  X                  2                                            +                                                A                  21                                ⁢                                  X                  1                                                                                                                                          A                  34                                ⁢                                  X                  4                                            +                                                A                  33                                ⁢                                  X                  3                                            +                                                A                  32                                ⁢                                  X                  2                                            +                                                A                  31                                ⁢                                  X                  1                                                                                                                                          A                  44                                ⁢                                  X                  4                                            +                                                A                  43                                ⁢                                  X                  3                                            +                                                A                  42                                ⁢                                  X                  2                                            +                                                A                  41                                ⁢                                  X                  1                                                                              
To obtain the product of the matrix A with a vector X to obtain the resulting vector Y, instructions are used to: 1) store the columns of the matrix A as packed operands (this typically requires rearrangement of data because the rows of the matrix A coefficients are stored to be accessed as packed data operands, not the columns); 2) store a set of operands that each have a different one of the vector X coefficients in every data element; 3) use vertical multiplication where each data element in the vector X (i.e., X4, X3, X2, X1) has to be first multiplied with data elements in each column (for example, [A14, A24, A34, A44]) of the matrix A. The results of the multiplication operations are then added together through three vertical add operations such as that shown in FIG. 1, to obtain the final result. Such a matrix multiplication operation based on the use of vertical add operations typically requires 20 instructions to implement, an example of which is shown below in Table 1.
Exemplary Code Based on Vertical-Add Operations:
TABLE 1Assumptions:1/X stored with X1 first, X4 last2/transposed of A sotred with A11 first, A21 second, A31 third, etc.3/availability of:-DUPLS: duplicate once-DUPLD: duplicate twiceMOVD  mm0, <mem_X>// [0, 0, 0, X1]DUPLS  mm0, mm0// [0, 0, X1, X1]DUPLD  mm0, mm0// [X1, X1, X1, X1]PFMUL  mm0, <mem_A>// [A41*X1, A31*X1, A21*X1, A11*X1]MOVD  mm1, <mem_X + 4>// [0, 0, 0, X2]DUPLS  mm1, mm1// [0, 0, X2, X2]DUPLD  mm1, mm1// [X2, X2, X2, X2]PFMUL  mm1, <mem_A + 16>// [A42*X2, A32*X2, A22*X2, A12*X2]MOVD  mm2, <mem_X + 8>// [0, 0, 0, X3]DUPLS  mm2, mm2// [0, 0, X3, X3]DUPLD  mm2, mm2// [X3, X3, X3, X3]PFUML  mm2, <mem_A + 32>// [A43*X3, A33*X3, A23*X3, A13*X3]MOVD  mm3, <mem_X + 12>// [0, 0, 0, X4]DUPLS  mm3, mm3// [0, 0, X4, X4]DUPLD  mm3, mm3// [X4, X4, X4, X4]PFMUL  mm3, <mem_A + 48>// [A44*X4, A34*X4, A24*X4, A14*X4]PFADD  mm0, mm1// [A42*X2 + A41*X1, A32*X2 + A31*X1,// A22*X2 + A21*X1, A12*X2 + A11*X1]PFADD  mm2, mm3// [A44*X4 + A43*X3, A34*X4 + A33*X3,// A24*X4 + A23*X3, A14*X4 + A13*X3]PFADD  mm0, mm2// [A44*X4 + A43*X3 + A42*X2 + A41*X1,// A34*X4 + A33*X3 + A32*X2 + A31*X1,// A24*X4 + A23*X3 + A22*X2 + A21*X1,// A14*X4 + A13*X2 + A12*X2 + A11*X1]MOVDQ  <mem_Y>, mm0// store [Y4, Y3, Y2, Y1]
Accordingly, there is a need in the technology for providing an apparatus and method which efficiently performs multi-dimensional computations based on a “horizontal” or intra-add operation. There is also a need in the technology for a method and operation for increasing code density by eliminating the need for the rearrangement of data elements and the corresponding rearrangement operations.