Matrix calculations are extensively used in wireless communication, wired communication, and image processing. In recent years, in order to accommodate an improvement in wired and wireless communication speeds, the systolic array capable of efficiently performing parallel computations receives attention.
In the systolic array, a plurality of processing elements (referred to as “PEs”) are one-dimensionally or two-dimensionally arranged, and data exchange between the PEs is performed only by adjacent two of the PEs.
Due to regularity and simplicity of wiring of the systolic array, the systolic array is suitable for being integrated into a VLSI or the like.
According to the type of the matrix computation (such as matrix-matrix multiplication, QR factorization, least square solution), an optimal topology of the systolic array differs. For QR factorization, for example, a triangular configuration is optimal, while for matrix multiplication, a square configuration is optimal.
It is suggested that when plural types of matrix calculations are implemented, the systolic array which implements a single algorithm be prepared, and each of the matrix calculations be mapped to the single algorithm. Though this approach is not always optimal means for implementing each of the matrix calculations, there is an advantage in versatility of this approach.
As an algorithm that can perform a plurality of matrix calculations, a Modified Faddeeva Algorithm (abbreviated as the “MFA”) is known. In the MFA, two steps of processing are performed as shown in Expression (1).
                              [                                                    A                                            B                                                                                      -                  C                                                            D                                              ]                ->                              [                                                            R                                                                                            Q                      T                                        ⁢                    B                                                                                                                    -                    C                                                                    D                                                      ]                    ->                      [                                                            R                                                                                            Q                      T                                        ⁢                    B                                                                                                0                                                  E                                                      ]                                              (        1        )            
In a first step, QR factorization (in which A=QR, and Q is a unitary matrix) is performed on a matrix A, using a Givens rotation. The Givens rotation is likewise applied to a matrix B as well. This processing corresponds to multiplication of QT from left. That is, when [A B]=[QR B] is multiplied by QT (where T indicates a transposition) from left, [QTQR QTB] becomes equal to [R QTB] due to QTQ=I (which is a unit matrix). An expression in the middle of Expression (1) is thereby obtained.
In a second step, a matrix C is eliminated by Gaussian elimination method, using diagonal elements of a triangular matrix R as pivot elements. In this case, a matrix E is given by Expression (2).E=D−(−R−TCT)TQTB=D+CA−1B  (2)
By changing matrices assigned as the matrices A, B, C, and D, respectively, the MFA can implement various matrix calculations that will be shown below. When only the first step is implemented, the QR factorization can be implemented.
Linear System Solution (AX=B)
            [                                    A                                B                                                              -              I                                            0                              ]        ⇒    E    =            A              -        1              ⁢    B  
Matrix-Matrix Multiplication
            [                                    I                                B                                                              -              C                                            0                              ]        ⇒    E    =  CB
Matrix Inversion
            [                                    A                                I                                                              -              I                                            0                              ]        ⇒    E    =      A          -      1      
FIGS. 9A and 9B show signal flow graphs, respectively, when MFA computations are implemented by systolic arrays. In a triangular systolic array in FIG. 9A, an upper triangular matrix R is obtained in Step 1. As shown in FIG. 9A, a rotation parameter of the Givens rotation propagates to right. Then, in a square systolic array, GTB is computed.
In Step 2, the matrices C and D are supplied to the triangular systolic array and the square systolic array, respectively, as shown in FIG. 9B. In this case, −R−TCT propagates in a horizontal direction, and from a lower side of the square systolic array, E=D+CA−1B is output.
When the unitary matrix Q is obtained, −A is substituted into C in the second step (Step 2). In this case, since C=−A, an output of the square systolic array in the horizontal direction becomes as follows:−R−TCT=R−TAT=(AR−1)T=QT 
FIGS. 10A, 10B, and 10C show detailed operations of a two-dimensional systolic array that implements the MFA when sizes of the matrices are A(m×4), B(m×4), C(n×4), and D(n×4) (where m and n are arbitrary numbers of rows, respectively) (refer to Non-Patent Document 1). In this specification, “A (m×4)”, for example, indicates the matrix A has m rows×4 columns. The same holds true in other matrices as well. FIG. 10A and FIG. 10B correspond to the Step 1 in FIG. 9A and the Step 2 in FIG. 9B, respectively. FIG. 10C shows computation processing in the Steps 1 and 2 in a boundary cell and an internal cell.
As shown in FIGS. 10A, 10B, and 10C, it is necessary to adjust a delay (skew) for input and output data. The boundary cell indicated by a circle in FIG. 10A outputs C=1, and s=0 in the Step 1 when an input xin is zero. Otherwise, it is set as follows:t=(r2+xin2)1/2 c=r/t s=xin/t 
A vector angle is then obtained, and then r is updated to be equal to t. r in the circle of the boundary cell in the drawing indicates the updated r.
In the internal cell indicated by a square in FIG. 10A performs vector rotation in the Step 1,using xout=c·xin−s·r, r=s·xin+c·r 
Referring to FIG. 10B, the boundary cell indicated by the circle obtains a division s=xin/r with respect to the input xin, in the Step 2. Further, the internal cell indicated by the square in FIG. 10B carries out a multiply-and-add calculation of xout=xin−s·r.
FIG. 11 shows an overall configuration of a matrix calculator using a two-dimensional MFA systolic array. The matrix calculator includes a two-dimensional MFA systolic array 301 of a trapezoidal shape (formed of a triangular systolic array and a square systolic array), a memory 302 that stores input data, a memory 303 that stores an output from a lower side of the square systolic array of the two-dimensional MFA systolic array 301, and a memory 304 that stores an output from a side of a side face of the square systolic array of the two-dimensional MFA systolic array 301.
There is a document that has disclosed a configuration in which projection of a two-dimensional MFA systolic array onto a one-dimensional array in a horizontal direction or a vertical direction is performed (refer to Non-Patent Document 2). However, this document never discloses a configuration in which projection of a two-dimensional MFA systolic array using the MFA algorithm onto a one-dimensional array is performed.
[Non-patent Document 1]
J. G. Nash, “Modified Faddeeva Algorithm for Concurrent Execution of Linear Algebraic Operations”, IEEE Trans. Computers, vol. 37, No 2, pp 129-137 (1988)
[Non-patent Document 2]
R. Walke, R. Smith, “Architecture for Adaptive Weight Calculation on ASIC and FPGA”, Signals, Systems, and Computers, 1999. Conference Record of the Thirty-Third Asilomar Conference on, Volume 2, 24-27 Oct. 1999 Page(s): 1375-1380, vol. 2