1. Field of the Invention
The invention relates generally to the field of processor chips and specifically to the field of Single-Instruction Multiple-Data (SIMD) processors. More particularly, the present invention relates to matrix operations in a SIMD processing system.
2. Background
Matrix multiplication operations are used often in many digital signal-processing operations, because linear operations could be expressed as matrix operations. For example, 4×4 transformation matrix is used to perform any color space to another color space transformation. The four-color components could be Red, Green, Blue, and Alpha, or any three-color component types and an Alpha channel typically used for keying applications. Color space transformation is used often in video, image and graphics applications. Another application of the matrix multiplication is in the implementation of Discrete Cosine Transform (DCT) for H.264 video compression standard (also known as AVC and MPEG-4 Part 10), which requires multiplication of multiple 4×4 matrices.
Existing SIMD processor architectures try to use SIMD instructions to speed up the implementation of these matrix operations by parallelizing multiple operations. SIMD instructions work on vectors consisting of multiple elements at the same time. An 8-wide SIMD could operate on eight 16-bit elements concurrently, for example. Even though SIMD architecture and instructions could be very efficient in implementing Finite Impulse Response (FIR) filters and other operations, the implementation of matrix operations are not efficient because matrix operations require cross mapping of vector elements for arithmetic operations. SIMD processors that are on the market at the time of this writing work with intra vector elements, which mean one-to-one fixed mapping of vector elements are used for vector operations. For example, when we use vector-add instruction, this instruction adds element-0 of source-1 vector register to element-0 of source-2 vector register and stores it to element-0 of output vector register; adds element-1 of source-1 to element-1 of source-2 and stores it to element-1 of output vector register, and so forth. In contrast, inter-mapping of source vector elements refers to any cross mapping of vector elements. Most SIMD processors provide little or no support for inter-element arithmetic operations. Motorola's AltiVec provides only two inter element vector instruction for arithmetic operations: sum of products and sum across (vector sum of elements). The other SIMD processors, such as MMX/SSE from Intel, or ICE from SGI does not even provide support for inter element operations other than vector sum operations. Silicon Graphics' ICE provides a broadcast mode, whereby one selected element of a source vector could operate across all elements of a second vector register. Intra element operation and broadcast modes of SIMD operation of prior art are shown in FIG. 1.
Some processors like the AltiVec provides a non-arithmetic vector permute instructions, whereby vector elements could be permuted prior to arithmetic operations. However, this requires two vector-permute instructions followed by the vector arithmetic instruction, which reduces the efficiency significantly for core operations such as DCT or color space conversion. It is conceivable that a SIMD processor has multiple vector units, where two vector permute units and vector arithmetic unit could operate concurrently in a super scalar processor executing two or more vector instructions in a pipelined fashioned. However, this requires the results of one instruction to be immediately available to the other unit by bypassing of intermediate vector results before these are written to the vector register file. Such bypassing of intermediate results becomes increasingly costly in terms of die area or number of gates, as the number of SIMD elements is increased over eight elements or 128-bit wide data paths.
The following describes how the above two example applications are handled by prior art.
Color Space Conversion
A transformation between color spaces requires multiplication of a 4×4 matrix with 4×1 input color matrix, as shown below converting input vector {I0, I1, I2, I3} to output vector of {Q3, Q2, Q1, Q0}, as shown in FIG. 8. The values of 4×4 matrix, M0 to M15, depend on the color space conversion being performed and consist of constant values. First, let us assume we have a 4-wide SIMD processor, and we are trying to calculate this matrix. First, we could pre-arrange the constant values of 4×4 matrix so that values are in column sequential, as we have done in this example. One could implement the matrix operations in two different ways. The classic approach is to multiply the first row of matrix with the first column of second matrix (only one in this case) and vector sum the results to obtain the first value. Similarly, second value is obtained by multiplying second row of 4×4 with the input vector and vector-summing the intermediate results. The second approach, that is more appropriate for SIMD implementation, is to calculate the partial sums of four output values at the same time. In other words, we would first multiply I0 with all the values of the first column of 4×4 matrix and store this vector of four values in a vector accumulator register. Next, we would multiply I1 with all the second column values of 4×4 matrix, and add this vector of four values to the vector accumulator values, and so forth for the remaining two columns of 4×4 matrix. This partial-product approach minimizes inter-element operations, and avoids the use of vector-sum operations. Column sequential ordering of 4×4 matrix allows us to load vectors of {m0, m1, m2, m3}, {m4, m5, m6, m7}, and others as vector load of sequential data elements into a vector register without the need to re-shuffle them. The arithmetic operations to be performed are shown in the table shown in FIG. 9
A 4-wide SIMD could calculate such matrix operations very efficiently, assuming there is a broadcast mode where one element could operate across all other elements of a second vector. A somewhat less efficient approach used by SIMD processors is to first use a “splat” instruction, which operates to copy any element from one vector register into all elements of another register. This would reduce the performance by 50 percent in comparison to a SIMD with broadcast mode, since all multiply or multiply-accumulate operations has to be proceeded by a splat vector instruction. MMX from Intel does not have such an instruction, and therefore, it takes four instructions, whereas SSE from Intel has a packed-shuffle instruction, which could implement the splat operation. In summary, other than 50 percent performance reduction due to splat operation, 4-wide SIMD processors could perform matrix operations about twice as fast as their scalar counter parts. But, the performance of these processors are not close to meeting high performance requirements of video, graphics, and other digital signal processing applications, because their level of parallelism is small due to small number of parallel processing elements. Furthermore, AltiVec and Pentium class processors do not meet the low cost and low power requirements of embedded consumer applications.
Let us look at an 8-wide SIMD to exemplify what happens when SIMD processor has higher potential “data crunching” power. In this case, we could process two input pixels at the same time, i.e., we would load eight values {I0, I1, I2, I3, I4, I5, I6, I7} where I0-3 is color components of the a pixel and I4-7 is the color components of adjacent pixel. In this case, we could preconfigure 4×4 matrix columns as {m0, m1, m2, m3, m0, m1, m2, m3} and so forth, and prestore these in memory so that we could access any column as a vector of eight values by a single vector load instruction. This is equivalent to multiplying the same 4×4 matrix with an input matrix of 4×2, resulting in a 4×2 matrix. FIG. 10 illustrates the operations in four steps of partial-product accumulation. The problem is, however, how we form a vector of {I0, I0, I0, I0, I4, I4, I4, I4} from input vector of {I0, I1, I2, I3, I4, I5, I6, I7}. Intel's MMX and SSE do not offer any direct help to resolve this, not to mention they are also 4-wide SIMD. Motorola's AltiVec Vector permute instruction could be used, but again this reduces performance by 50 percent, because during the permute step arithmetic unit remains idle, or we face complicated and costly bypassing of intermediate variables between concurrent vector computational units.
As shown in FIG. 11, multiplying matrix A 1110 and matrix B 1120 may be calculated by first decomposing matrix A to its columns, and matrix B to its rows, and performing a series of matrix multiplication of columns of A with respective rows of B and summing the partial results. The columns of matrix A 1110 is shown by A1 1130, A2 1131, and An 1132. Similarly, rows of B 1120 are decomposed as B1 1160, B2 1161, and Bn 1162. Then, matrix multiplication of A and B matrices could be expressed summation of A1·B1 1170, A2·132 1171, and An·Bn 1172, where symbol “·” denotes matrix multiplication. Matrix multiplication of a column matrix of Ax with a row matrix Bx, where “x” denotes column or row number before decomposition, generates an m-by-p matrix 1180. For above example of 4-by-4 A matrix 1110 and 4-by-2 B matrix, output matrix has 8 elements. Calculating 8 of these parallel will require an 8-wide SIMD, which could iterate over four columns to calculate the result in four clock cycles. However, such a calculation is not possible with that efficiency because pairing of input vectors requires the following mapping of input vectors:
{1, 2, . . . 4, 1, 2, . . . , 4} and {1, 1, . . . , 1, 2, 2, . . . , 2}. The requirement for mapping is further compounded for wider SIMD architectures and for larger input matrices.
DCT Calculation
H.264 video compression standard frequently uses a DCT based transform for 4×4 blocks in residual data. This transformation requires the calculation of the following operation:
Y=A·X·B where A and B are constant 4×4 matrices and X is the input as a 4×4 data matrix. We could first calculate A·X or X·B, because this is a linear operation. Since X matrix is the input data, it is typically in row-sequential order. This means the mapping of elements required for multiplication or multiply-accumulation cannot be pre-arranged, as in the previous example, and requires a cross mapping of inter elements. Let us take the case of Q=X·B, and let us assume that all input and output matrices are stored in row-sequential order.
This matrix multiplication could be done in four steps, as illustrated in FIG. 12, using the multiply accumulate feature of a SIMD processor to accumulate partial results. The first step is a multiply operation of two vectors:
{X0, X0, X0, X0, X4, X4, X4, X8, X8, X8, X8, X12, X12, X12, X12} and
{B0, B1, B2, B3, B0, B1, B2, B3, B0, B1, B2, B3, B0, B1, B2, B3}.
This step would require a single vector-multiple instruction, if we had the input vectors mapped as required, but this mapping would require two additional vector-permute instructions. The second step is a vector multiply-accumulate operation of the following two vectors:
{X1, X1, X1, X1, X5, X5, X5, X5, X9, X9, X9, X9, X13, X13, X13, X13} and
{B4, B5, B6, B7, B4, B5, B6, B7, B4, B5, B6, B7, B4, B5, B6, B7}
This step could also be done with a single vector-multiply accumulate instruction, if we had the input matrix elements already mapped as follows. To get this mapping, the prior art such as AltiVec from Motorola requires two additional vector-permute instructions.
The requirement of two additional vector permute instructions reduces the performance to about one third, because the duty cycle of arithmetic operations where vector arithmetic unit is used is one-third. A super scalar implementation would require two concurrent vector permute units, in parallel with the vector arithmetic unit, which also would complicate the intermediate bypassing due to wide data widths, for example for a 32-wide SIMD using a 512-bits wide data path. A scalar processor typically bypasses results so that result of one instruction becomes immediately useable by the following instruction. Even though each instruction takes several cycles to execute in a pipeline fashion, such as Instruction Read, Instruction-Decode/Register Read, Execute-1, Execute-2 and Write-Back stages in a five stage pipelined RISC processor, this becomes costly with regard to gate count in a wide data path processor such as SIMD. That is why SIMD processors and some VLIW processors from Texas Instruments (TMS3206000 series) do not employ bypass, and it is left to the user to put enough instruction padding, by the use of interleaving instructions or NOPs, to take care of this.