Matrix computations are widely used for wireless communication, wired communication, and image processing. Recently, to cope with an increase in wireless or wired communication speed, attention has been paid to a systolic array with which parallel computations can be performed effectively.
A systolic array is configured to have a plurality of processing cells arranged one-dimensionally or two-dimensionally, in which data exchange between processing cells is performed only between adjacent processing cells. Because of its regularity and ease of wiring, a systolic array is suitable for integration into such as a VLSI and so forth. The optimal topology of a systolic array differs according to the type of matrix computations (matrix multiplication, QR factorization, least square method, etc.). For example, a triangular configuration is optimal for performing QR factorization, while a square configuration is optimal for computing a matrix multiplication. It is suggested that, in case of implementing multiple types of matrix computations, a systolic array which implements a single algorithm be prepared and each matrix computation be mapped to a single algorithm. This method, though not optimal for each matrix computation, has an advantage in that it is versatile.
As an algorithm for carrying out multiple matrix computations, the Modified Faddeeva Algorithm (abbreviated to “MFA”) is known. In the MFA, two steps of processing are performed as shown in Expression (1).
                              [                                                    A                                            B                                                                                      -                  C                                                            D                                              ]                →                              [                                                            R                                                                                            Q                      t                                        ⁢                    B                                                                                                                    -                    C                                                                    D                                                      ]                    →                      [                                                            R                                                                                            Q                      t                                        ⁢                    B                                                                                                0                                                  E                                                      ]                                              (        1        )            
In the first step, QR factorization is performed (that is, A=QR where Q is a unitary matrix) on the matrix A using the Givens rotation. Similarly, the Givens rotation is applied to matrix B as well. This processing is equivalent to the multiplication of matrices A and B by QT from left. That is, when [A B]=[QR B] is multiplied by QT (where T denotes Hermite transpose) from left, [QT QR QT B]=[R QT B] is obtained because of QTQ=I (a unit matrix) and, therefore, the expression in the middle of Expression (1) is obtained. In the second step, C is eliminated by the Gaussian elimination method with diagonal elements of the triangular matrix R as pivot elements. In this case, the matrix E is given by Expression (2).E=D−(−R−tCt)tQtB=D+CA−1B  (2)
By changing the matrices to be substituted into A, B, C, and D, the MFA can implement various matrix computations that will be shown below. When only the first step is performed, the QR factorization can be implemented.
Linear system solution (AX=B):
            [                                    A                                B                                                              -              I                                            0                              ]        ⇒    E    =            A              -        1              ⁢    B  Matrix multiplication:
            [                                    I                                B                                                              -              C                                            0                              ]        ⇒    E    =  CBInverse matrix:
            [                                    A                                I                                                              -              I                                            0                              ]        ⇒    E    =      A          -      1      
FIG. 8A and FIG. 8B are diagrams showing signal flow graphs when real MFA computation is carried out using systolic arrays.
In the triangular systolic array in FIG. 8A, the upper triangular matrix R is obtained in step 1. As shown in FIG. 8A, the rotation parameter of the Givens rotation propagates to the right and, in the square systolic array, QTB is computed.
In step 2, C and D are input to the triangular systolic array and the square systolic array, respectively, as shown in FIG. 8B. In this case,−R−TCT propagates in the horizontal direction, and from the lower side of the square systolic array,E=D+CA−1B is output.
FIG. 9 is a diagram showing the configuration of a real MFA systolic array 4001 where the matrix sizes are A(2×2), B(2×2), C(2×2), and D(2×2) and all elements of the matrices are real numbers. The detail of this related art is disclosed in Non-Patent Document 1. Note that, in this specification, “A(m×n)” indicates that the matrix A is a m×n matrix. The same holds true in other matrices as well. In FIG. 9, aij represents the element in the ith row and in the jth column of the matrix A. The same holds true in other matrices as well.
Referring to FIG. 9, the real MFA systolic array 4001, a trapezoid-shaped systolic array created by combining a triangular systolic array 1000 and a square systolic array 2000, has boundary cells 101 and internal cells 201.
The boundary cells 101 are arranged diagonally in the triangular systolic array 1000.
In FIG. 9, xin and xout are an input from the upper cell, and an output to the lower cell, of each cell, respectively.
s, c, and d are parameters propagated horizontally from the boundary cells 101 to the internal cells 201.
The lower subscript of the boundary cell 101, internal cell 201, xin, and xout denotes each the position index of each cell, and (ij) indicates the cell in the ith row and in the jth column; for example, xin11 indicates an input to the cell in the first row and in the first column, xin12 indicates an input to the cell in the first row and in the second column, and so on. Because this systolic array is a trapezoid array, there is no cell with the lower subscript (2,1).
The parameters si, ci, and di are parameters output from the boundary cell 101 in the ith row.
The real MFA systolic array 4001 performs the MFA processing in two steps.
In step 1, the matrices A and B are input to the top of the triangular systolic array 1000 and the square systolic array 2000.
In step 2, the matrices C and D are input to the top of the triangular systolic array 1000 and the square systolic array 2000.
In step 2, the matrix E is output from the bottom of the square systolic array 2000. Note that, as shown in FIG. 9, a delay (skew) in the input/output data must be adjusted. The symbol (▪) in the input xin12, xin13, xin14 and in the output xout23 and xout24 in FIG. 9 represents a delay (unit delay), respectively.
FIG. 10 is a diagram showing the computation processing in steps 1 and 2 in the boundary cell 101 and the internal cell 201. In FIG. 10, r is a variable saved in a cell, and its initial value is 0.
In the real MFA systolic array 4001, the processing differs according to the type of cell and according to steps.
This means that, if each cell is implemented by a general-purpose processor and the processing shown in FIG. 10 is performed exactly as shown, the processing delay (processing load) differs according to the type of cell and according to steps. This makes the inter-cell synchronizing control circuit complex and increases the circuit size.
Conversely, if the processing delay is made uniform among the cells (processing delays are adjusted to the maximum processing delay) for simplifying the synchronization control circuit, the operation rate of the processors in the cells is lowered and so the efficiency is deteriorated.
In addition, if special computation circuits such as a multiplier, a divider, and a square rooter are combined to realize a cell, some special computation circuit performs operation only in one of the steps. In this case, too, the operation rate of the computation circuits is decreased and so the efficiency is reduced.
To solve this problem, an MFA systolic array, which realizes the processing of each cell with one CORDIC circuit, is disclosed in Non-Patent Document 2.
FIG. 11 is a diagram showing the configuration of the real MFA systolic array disclosed in Non-Patent Document 2 where the matrix sizes are A(2×2), B(2×2), C(2×2), and D(2×2) and all elements of the matrices are real numbers.
The real MFA systolic array 4001 is similar to the real MFA systolic array disclosed in Non-Patent Document 1 except that, instead of the rotation parameters s and c, the angle θ is propagated horizontally.
FIG. 12 is a diagram showing the computation processing of the boundary cell 101 and the internal cell 201 in the real MFA systolic array 4001 in steps 1 and 2 in FIG. 11.
First, the operation in step 1 will be described. The boundary cell 101 receives the input xin from the upper side and computes the norm t and the vector angle θ of the vector [r xin]t using the CORDIC algorithm.
The internal variable r is updated by the norm t, and the vector angle θ is propagated to the right and is supplied to the internal cell 201 in the same row.
The internal cell 201 receives the input xin from the upper side, receives the vector angle θ from the left and, using the CORDIC algorithm, performs the vector rotation processing represented byxout=cos θ·xin−sin θ·r r=sin θ·xin+cos θ·r  (3)and computes the output xout and the internal variable r.
The internal cell 201 supplies the output xout to the cell in the lower row.
In the internal cell 201, the vector rotation angle θ supplied from the left cell is propagated directly to the internal cell 201 on the right side.
Next, the operation in step 2 will be described.
The boundary cell 101 receives the input xin from the upper side and performs the following division using the CORDIC algorithm.d=xin/r  (4)
The division result d is propagated horizontally and is supplied to the internal cell 201 in the same row.
In addition, the internal cell 201 performs the following multiply-and-accumulate using the CORDIC algorithm.xout=xin−d·r  (5)
In the internal cell 201, the output xout is supplied to the cell in the lower row. Note that, the output xout of the internal cell 201 at the bottom is output outside the MFA systolic array 4001 as an element of the MFA computation result matrix E.
The processing of the boundary cell 101 and the internal cell 201 in step 1 corresponds respectively to the vector angle computation processing and the vector rotation processing.
The processing of the boundary cell 101 and the internal cell 201 in step 2 corresponds respectively to the division processing and the sum-of-product processing.
It is known that the processing described above can be performed with the same delay by the CORDIC algorithm.
Therefore, because the processing delay of the cells is not affected by the cell type or the step but is constant and the inter-cell connection relation is fixed in the real MFA systolic array 4001 disclosed in the Non-Patent Document 2, there is no need for an inter-cell synchronization control circuit.
In addition, because the CORDIC circuit that is the computation circuit in the cells is always in operation, efficiency is increased.
[Non-Patent Document 1] J. G. Nash, “Modified Faddeeva algorithm for concurrent execution of linear algebraic operations”, IEEE Trans. Computers, vol. 37, No 2, pp 129-137 (1988)
[Non-Patent Document 2] M. Otte, J. Gotze, M. Bucker, “Matrix based signal processing on a reconfigurable hardware accelerator”, Digital Signal Processing Workshop, 2002 and the 2nd Signal Processing Education Workshop. Proceedings of 2002 IEEE, 13-16 Oct. 2002 Page(s): 350-355