The present invention relates generally to processor elements and techniques, and more particularly to pipelined processing devices comprising arrays of processor elements for performing matrix computations.
As advances in technology enable manufacturers to put more wires and transistors on a single integrated circuit, new opportunities arise for performing digital signal processing operations that previously would have been impractical.
An example of one such emerging application is the use of adaptive beamforming for arrays of antennas in a base station of a wireless communication system. The objective in this case is to allow more mobile users and higher data rates in a given cell. This is achieved by adding together the signals received by multiple antennas with time varying complex weights, in order to form a radiation pattern which points toward the mobile of interest and includes nulls on sources of interference. Various solutions for optimizing this type of problem are known and involve performing computations with matrices and vectors of complex numbers. See, e.g., S. Haykin, xe2x80x9cAdaptive Filter Theory,xe2x80x9d 3rd ed., Prentice Hall, 1996. Instead of only a single complex multiplication being called for at a given time, there is the need for massive numbers of complex multiplications.
Another emerging application of this type is multi-user detection for wireless system base stations using Direct Sequence Code Division Multiple Access (DS-CDMA) coding. In these algorithms, as described in, e.g., S. Moshavi, xe2x80x9cMulti-User Detection for DS-CDMA Communications,xe2x80x9d p. 124, IEEE Communications Magazine, October 1996, collective knowledge gained in real time about the various mobiles that are interfering with each other is used to un-entangle that interference and improve the detection of each individual mobile. These algorithms also involve intensive calculations with matrices of complex numbers.
The above-noted emerging applications for wireless base stations often use variations of well-known algorithms which make use of fundamental matrix computations. The four operations typically of most interest are:
1. Solution of a set of n linear equations with n unknowns. One possible approach is to triangulate the set of equations using QR decomposition via Givens rotations, and then solve the set of triangulated equations by back substitution.
2. Matrix inversion. One possible approach is to triangulate a set of equations using QR decomposition via Givens rotations, and then solve the set of triangulated equations for the inverse by multiple back substitutions.
3. Matrix-matrix multiplication.
4. Covariance and cross correlation. One often needs to compute such statistics on vectors of input signals in order to form matrices and vectors that are further processed.
Conventional techniques for performing each of these operations will be described in greater detail below.
In the case of determining the solution of a set of linear equations, the problem in matrix notation may be expressed as:
Ax=yxe2x80x83xe2x80x83(1)
where A is a known square matrix of complex values, y is a known vector of complex values, and x is an unknown complex vector. There are numerous techniques for the numerical solution of such equations. However, some of these techniques suffer from numerical instabilities. A numerically robust technique for solving Equation (1) is known as QR decomposition or QR factorization via Givens rotations, and is described in, e.g., G. H. Golub and C. F. Van Loan, xe2x80x9cMatrix Computations,xe2x80x9d 3rd ed., Johns Hopkins Univ. Pr., 1996. This approach involves factoring the matrix A into the product:
Axe2x89xa1QRxe2x80x83xe2x80x83(2)
where Q is a unitary matrix and R is a right upper triangular matrix. The meaning of xe2x80x9cright upper triangularxe2x80x9d is that all of the components of R below the main diagonal are zeroes. Then, both sides of Equation (1) are multiplied by the conjugate transpose (i.e., the Hermitian transpose) of Q, denoted QH.
QHAx=QHyxe2x80x83xe2x80x83(3)
Substituting Equation (2) into Equation (3) gives:
QHQRx=QHyxe2x80x83xe2x80x83(4)
Since QHQ equals the Identity matrix, Equation (4) reduces to:
Rx=QHyxe2x80x83xe2x80x83(5)
Equation (5) is in a form that can be solved easily for each component of x using a technique known as xe2x80x9cback substitution.xe2x80x9d The term xe2x80x9cback substitutionxe2x80x9d refers to an iterative method where first the bottom row of the matrices in Equation (5), which has the most zeroes and only one unknown, is solved. Then the solution is substituted into the next row above, and that row is then solved. This process continues until all the rows of the matrices in Equation (5) are solved.
An effective and numerically stable method for finding the unitary matrix QH is as a product of individual so-called xe2x80x9cGivens rotations,xe2x80x9d each one of which zeroes out one element of the preceding matrix without disturbing the previously placed zeroes. See, e.g., W. Givens, xe2x80x9cComputation of Plane Unitary Rotations Transforming a General Matrix To Triangular Form,xe2x80x9d J. Soc. Indust. Appl. Math., Vol. 6, No. 1, p. 26, March 1958. Givens rotations are also known in the art as being useful for performing Recursive Least Squares adaptive algorithms.
An example illustrating the triangulation of a small matrix equation using Givens rotations will now be described, with reference to FIGS. 1 through 5. Equation (6) below is an expansion of Equation (1) showing the individual matrix and vector elements for the example.                                           (                          xe2x80x83                        ⁢                                                                                a                    11                                                                                        a                    12                                                                                        a                    13                                                                                        a                    14                                                                                                                    a                    21                                                                                        a                    22                                                                                        a                    23                                                                                        a                    24                                                                                                                    a                    31                                                                                        a                    32                                                                                        a                    33                                                                                        a                    34                                                                                                                    a                    41                                                                                        a                    42                                                                                        a                    43                                                                                        a                    44                                                                        ⁢                          xe2x80x83                        )                    ⁢                      (                                                                                x                    1                                                                                                                    x                    2                                                                                                                    x                    3                                                                                                                    x                    4                                                                        )                          =                  (                                                                      y                  1                                                                                                      y                  2                                                                                                      y                  3                                                                                                      y                  4                                                              )                                    (        6        )            
FIG. 1 shows the stages of the corresponding Givens rotations. In the six stages shown in FIG. 1, zeroes are successively placed in a five by four array. The letters at the head of each array indicate that four of the columns are elements of the matrix A being transformed and the fifth column has elements of the vector y being transformed. The zeroes in bold type indicate the zeroes that are being placed in each stage. Equation (7) below shows in more detail how one does the first unitary transformation in stage one of FIG. 1. More particularly, one applies a two by two complex unitary matrix to the first two rows to force one element to a zero, the other elements all being transformed to new values. The xe2x80x9c*xe2x80x9d symbol indicates the complex conjugate.                               (                                                                      c                  *                                                                              s                  *                                                                                                      -                  s                                                            c                                              )                ⁢                  
                ⁢                              (                                                                                a                    11                                                                                        a                    12                                                                                        a                    13                                                                                        a                    14                                                                                        y                    1                                                                                                                    a                    21                                                                                        a                    22                                                                                        a                    23                                                                                        a                    24                                                                                        y                    2                                                                        )                    =                      (                                                                                a                    11                    xe2x80x2                                                                                        a                    12                    xe2x80x2                                                                                        a                    13                    xe2x80x2                                                                                        a                    14                    xe2x80x2                                                                                        y                    1                    xe2x80x2                                                                                                0                                                                      a                    22                    xe2x80x2                                                                                        a                    23                    xe2x80x2                                                                                        a                    24                    xe2x80x2                                                                                        y                    2                    xe2x80x2                                                                        )                                              (        7        )            
By multiplying out the matrices to find the element which is being forced to zero, one arrives at the equation:
xe2x80x83xe2x88x92sa11+ca21=0xe2x80x83xe2x80x83(8)
together with the unitary constraint:
|s|2+|c|2=1xe2x80x83xe2x80x83(9)
These equations may be satisfied by:                     c        =                                            a              11                                                                                            "LeftBracketingBar"                                          a                      11                                        "RightBracketingBar"                                    2                                +                                                      "LeftBracketingBar"                                          a                      21                                        "RightBracketingBar"                                    2                                                              ⁢                      xe2x80x83                    ⁢          and                                    (        10        )                                s        =                              a            21                                                                                "LeftBracketingBar"                                      a                    11                                    "RightBracketingBar"                                2                            +                                                "LeftBracketingBar"                                      a                    21                                    "RightBracketingBar"                                2                                                                        (        11        )            
The value of the non-zero transformed element a11xe2x80x2 is:                                           a            11            xe2x80x2                    =                                                                      "LeftBracketingBar"                                      a                    11                                    "RightBracketingBar"                                2                            +                                                "LeftBracketingBar"                                      a                    21                                    "RightBracketingBar"                                2                                                                                                          "LeftBracketingBar"                                          a                      11                                        "RightBracketingBar"                                    2                                +                                                      "LeftBracketingBar"                                          a                      21                                        "RightBracketingBar"                                    2                                                                    ,                            (        12        )            
a real number. The matrix also xe2x80x9crotatesxe2x80x9d all other elements on the two rows.
It is well known that QR decomposition using Givens rotations, as described above, can be solved rapidly in a triangular shaped two dimensional systolic array. See, e.g., W. M. Gentleman and H. T. Kung, xe2x80x9cMatrix triangularization by systolic arrays,xe2x80x9d SPIE V. 298, p. 19, 1981.
FIGS. 2-5 show the beginning of the procedure for triangulating a five by four matrix equation on a triangular array of processor elements (PEs). The round boxes in FIGS. 2 through 5 are special xe2x80x9cheadxe2x80x9d elements that compute the c, s, and updated element as described in Equations (10)-(12) above. The square boxes are xe2x80x9cregularxe2x80x9d elements that rotate two elements using c and s. They also propagate c and s to the right. In Step 1 of the Givens rotation, shown in FIG. 2, the first c1 and s1 are calculated. In Step 2, shown in FIG. 3, the second c2 and s2 are calculated and the first square box in the uppermost row of elements uses c1 and s1 to do a rotation. In Step 3, shown in FIG. 4, the third c3 and s3 are calculated, the first square box uses c2 and s2 to do a rotation, and the second square box uses c1 and s1 to do a rotation. In Step 4, shown in FIG. 5, the head element of the second row can recursively begin to calculate the fourth c4 and s4. The square boxes above the head element of the second row are still doing rotations. The calculation proceeds systolically through the triangular array.
Note that there exists some opportunity in this problem to reduce the latency of the computation by a recursive procedure. After two zeroes have been placed in a given column, it is possible to begin placing zeroes in the next column to the right simultaneous to continuing to place further zeroes in the original column. The degree of parallelism available increases with the size of the matrix equation being processed.
There is a variation of the above-described architecture in which computations are performed in parallel instead of systolically. In this variation, once c1 and s1 are known, they can be transmitted simultaneously to all the square boxes to the right of the head element and all the rotations using those values of c1 and s1 can be performed in parallel. It should be noted that the computation of c and s in the head element is computationally intensive because absolute square and reciprocal square root are required.
FIG. 6 illustrates generally how data is processed over time in the two architectures for QR decomposition using Givens rotations. In the parallel case, an approximately horizontal band of processor elements is busy at any one time. Over time, the band sweeps downward through the triangular array as shown. In the systolic case, an angled band of active processor elements sweeps downward through the triangular array, maintaining its angle downward through the array. It is apparent that in both cases there is poor utilization of the processor elements. In the QR decomposition of a larger matrix, most of the processor elements are idle at any given time. For example, with a matrix equation of size 17 columnsxc3x9716 rows at the peak of the computation, up to five rows worth of processor elements in the triangle can be active at once, assuming the optimum latency reduction. Nonetheless, on the average, only a small percentage of processor elements can be kept busy.
It is therefore desirable to find an architecture for performing QR decomposition and other types of matrix computations that is more efficient than the two dimensional triangular array described above. One approach is to consider a one dimensional processor array and sequentially map the operations of the virtual two dimensional array onto the physical one dimensional array. Such an approach is described in G. Lightbody, R. Walke, R. Woods, and J. McCanny, xe2x80x9cNovel Mapping of a Linear QR Architecture,xe2x80x9d ICASSP99: IEEE Int. Conf. Acoust., Speech, Signal Proc., p. 1933, 1999. This approach uses a remapping and scheduling scheme to map from a two dimensional triangular array to a linear array which achieves high processor element utilization. However, the total latency of the algorithm is much longer than that for the replaced triangular array. Also, in order for this approach to achieve 100% hardware utilization, it is necessary to interleave the steps of two independent triangulation problems, the second beginning when the first is half finished.
Another approach, described in J. H. Moreno and T. Lang, xe2x80x9cMatrix Computations on Systolic-Type Arrays,xe2x80x9d pp. 206-210, Kluwer Academic Publishers, Boston Mass., 1992, uses a linear array of pipelined PEs to perform matrix algorithms of this type with high utilization.
There exists another technique for dealing with the adaptive antenna array problem described above. A less computationally complex algorithm known as Recursive Least Squares (RLS), as described in, e.g., J. G. McWhirter, xe2x80x9cRecursive least-squares minimization using a systolic array,xe2x80x9d Proc. SPIE V. 431, p. 105, 1983, may be applied to adaptive antenna problems. This approach achieves less than optimum results, but has the advantage of reduced computational complexity. The RLS algorithm avoids the need for the back substitution described above. It is known how to perform the RLS algorithm rapidly on a two dimensional triangular systolic array. Hardware solutions based on this two dimensional array are described in, e.g., B. Haller, Algorithms and VLSI Architectures for RLS-Based Time Reference Beamforming in Mobile Communications,xe2x80x9d IEEE 1998 Intl. Zurich Seminar on Broadband Comm., p. 29, 1998.
The above-noted conventional back substitution process will now be described in greater detail. As described above, after QR decomposition has been completed, the triangulated set of equations will be in the form of Equation (5), which expanded for the current example is:                                                                                                               r                    11                                    ⁢                                      x                    1                                                  +                                                      r                    12                                    ⁢                                      x                    2                                                  +                                                      r                    13                                    ⁢                                      x                    3                                                  +                                                      r                    14                                    ⁢                                      x                    4                                                              =                              y                1                                                                                                                                              r                    22                                    ⁢                                      x                    2                                                  +                                                      r                    23                                    ⁢                                      x                    3                                                  +                                                      r                    24                                    ⁢                                      x                    4                                                              =                              y                2                                                                                                                                              r                    33                                    ⁢                                      x                    3                                                  +                                                      r                    34                                    ⁢                                      x                    4                                                              =                              y                3                                                                                                                          r                  44                                ⁢                                  x                  4                                            =                              y                4                                                                        (        13        )            
It should be noted that following the Givens procedure described previously, all of the diagonal elements of the R matrix except the last, in this example r11, r22, and r33, are real, and the last diagonal element, in this example r44, is complex. One may solve the set of Equations (13) by the back substitution process. More particularly, the last of the four equations, which has only one unknown, x4, may be solved first, e.g., x4 can be obtained by dividing y4 by r44. Alternately, the reciprocal,       1          r      44        ,
may be computed and then multiplied by y4 to find x4. Then x4 may be substituted into the third of the four equations. The reciprocal of r33 may be computed. Then x3 may be found as       1          r      33        ⁢            (                        y          3                -                              r            34                    ⁢                      x            4                              )        .  
This process then continues backwards through the equations until all the unknowns are found.
The inverse of a matrix may be found by a procedure involving a series of back substitutions as described above. Assume that one wishes to find the inverse Axe2x88x921 of a matrix A. By definition:
AAxe2x88x921=Ixe2x80x83xe2x80x83(14)
where I is the identity matrix. Similar to the discussion above, there is a unitary matrix Q such that:
A=QRxe2x80x83xe2x80x83(15)
where R is right triangular. Substituting this into Equation (14) gives
QRAxe2x88x921=Ixe2x80x83xe2x80x83(16)
One can then proceed to find the Hermitian conjugate of Q, QH, as the product of a series of Givens rotations as described above. Multiplying both sides of Equation 16 by QH gives:
RAxe2x88x921=QHxe2x80x83xe2x80x83(17)
Remember that at this time R and QH are both known and Axe2x88x921 is the unknown. Since each column of the left side of Equation (17) must equal the corresponding column on the right side, Equation (17) may be broken up into a series of equations:
Rxj=yjxe2x80x83xe2x80x83(18)
where xj is a column of Axe2x88x921 and yj is the corresponding column of QH. Each of the equations in the series of equations (18) for different values of j looks like the back substitution problem of Equation (5). All of these back substitution problems share a common right triangular matrix, R. Hence the inverse of a matrix can be found by performing QR decomposition via Givens rotations followed by a number of back substitutions.
Another common operation that is performed in advanced processing of arrays of signals is the multiplication of one matrix with another matrix, where the elements of each matrix have complex values. Consider the formation of matrix C by the multiplication of matrices A and B, where in this example each of the three matrices is 4xc3x974 with complex elements. Equation (19) expands this example, showing the matrix elements:                                           (                          xe2x80x83                        ⁢                                                                                c                    11                                                                                        c                    12                                                                                        c                    13                                                                                        c                    14                                                                                                                    c                    21                                                                                        c                    22                                                                                        c                    23                                                                                        c                    24                                                                                                                    c                    31                                                                                        c                    32                                                                                        c                    33                                                                                        c                    34                                                                                                                    c                    41                                                                                        c                    42                                                                                        c                    43                                                                                        c                    44                                                                        ⁢                          xe2x80x83                        )                    ≡                      (                          xe2x80x83                        ⁢                                                                                a                    11                                                                                        a                    12                                                                                        a                    13                                                                                        a                    14                                                                                                                    a                    21                                                                                        a                    22                                                                                        a                    23                                                                                        a                    24                                                                                                                    a                    31                                                                                        a                    32                                                                                        a                    33                                                                                        a                    34                                                                                                                    a                    41                                                                                        a                    42                                                                                        a                    43                                                                                        a                    44                                                                        ⁢                          xe2x80x83                        )                          ⁢                  
                ⁢                  (                      xe2x80x83                    ⁢                                                                      b                  11                                                                              b                  12                                                                              b                  13                                                                              b                  14                                                                                                      b                  21                                                                              b                  22                                                                              b                  23                                                                              b                  24                                                                                                      b                  31                                                                              b                  32                                                                              b                  33                                                                              b                  34                                                                                                      b                  41                                                                              b                  42                                                                              b                  43                                                                              b                  44                                                              ⁢                      xe2x80x83                    )                                    (        19        )            
Each of the sixteen complex elements of C results from four complex multiplications of elements from a row of A and elements from a column of B. For example:
c23=a21b13+a22b23+a23b33+a24b43xe2x80x83xe2x80x83(20)
Thus there are a total of 4xc3x974xc3x974=64 complex multiplications required. The matrix-matrix multiplication of Equation (19) may be computed in many ways on many different hardware architectures. On an ordinary programmable digital computer, there is hardware for multiplying a real number by another real number, but there is normally no hardware assistance beyond that. A program normally performs each complex multiplication as a sum of four real multiplications:
a21b13=(a21realb13realxe2x88x92a21imaginaryb13imaginary) +i(a21realb13imaginary+a21imaginaryb13real)xe2x80x83xe2x80x83(21)
On many current digital computers, the arithmetic hardware for multiply and accumulate has been pipelined such that a new term can begin to be multiplied before the previous term has completed its multiplication and accumulation, thereby improving the throughput of computations such as those shown in Equations (20) and (21).
On existing vector supercomputers, hardware under program control assists in loading vector registers with operands (a21, a22, etc. in the present example) from memory so they can rapidly be fed to the input of the pipelined multiplication hardware.
Some recent programmable digital signal processors (DSPs) contain more than one multiplier. For example the StarCore SC140 DSP, from Lucent Technologies Inc. Microelectronics Group of Allentown, Pa., contains four arithmetic units each capable of multiplying two real numbers together and accumulating the result, every cycle. If programmed properly, the SC140 can achieve a peak rate of one complex multiply accumulate per cycle in performing the operation of Equation (20).
Two dimensional systolic arrays of PEs have been proposed in U.S. Pat. No. 4,493,048, issued in 1985 in the name of inventors H. T. Kung and C. E. Leiserson and entitled xe2x80x9cSystolic Array Apparatuses for Matrix Computations,xe2x80x9d for performing operations such as matrix-matrix multiplication. Each PE is described as capable of performing a xe2x80x9cstepxe2x80x9d operation consisting of the multiplication of two numbers followed by an accumulate per cycle. In these arrays, the elements of A and B are fed into PEs on the periphery of the array with a particular pattern. The elements of C emerge from PEs on the periphery. Other two dimensional array examples have been described, e.g., in S. Y. Kung, xe2x80x9cVLSI Array Processors,xe2x80x9d p. 213, Prentice Hall, 1988. If such an array were built with each PE capable of performing a complex multiply-accumulate, then the array could have very high performance. However, such an array would be quite costly and it would be difficult to feed the data in and get the data out.
In the above-noted adaptive antenna application, a number of spatially separated antennas all receive the signal from the same set of desired mobile users and undesired sources of interference. By adding together the signals received from the antennas with a particular set of complex weighting values, the adaptive antenna array can be made to have a high gain (directionality) in the direction of one mobile and have nulls in the direction of sources of interference. Let x be a vector of signals received by the antenna array. Ideally for a stationary set of mobiles and interferers and practically for a non-stationary set that is not changing too rapidly, the vector of optimum weights, wopt, for detecting a desired signal, yd, is given by the Wiener-Hopf Equation:
xe2x80x83Rxwopt=rxdxe2x80x83xe2x80x83(22)
where
Rx=E{xxH}xe2x80x83xe2x80x83(23)
is the complex covariance matrix and
rxd=E{xyd}xe2x80x83xe2x80x83(24)
is the complex cross correlation vector. The { } operation indicates taking a statistical average over time. In practical situations, that average can be weighted more heavily for recently received data and with decreasing weight for older data. In some practical systems, the transmitter periodically sends a known signal and it is this known signal that is used for yd to train the system and discover wopt.
It is therefore necessary to compute a covariance matrix as in Equation (23) and a cross correlation vector as in Equation (24) fairly frequently. Although both of these procedures are computationally less complex than the matrix-matrix multiplication discussed above, conventional techniques are unable to provide the desired computational efficiency.
In view of the foregoing, it is apparent that a need exists for a more efficient processor element architecture, suitable for use in performing computational operations such as solution of a set of equations via QR decomposition followed by back substitution, as well as other matrix-related operations such as matrix inversion, matrix-matrix multiplication, and computation of covariance and cross correlation.
The present invention provides an architecture for a pipelined programmable processor element (PE) and a linear array of such PEs that are particularly efficient at performing a variety of matrix operations. In accordance with the invention, each PE includes arithmetic circuitry operative: (i) to multiply real and imaginary parts of at least two complex numbers by real and imaginary parts of at least another two complex numbers, thereby forming at least sixteen partial products, and (ii) to form one or more additive combinations of the partial products, each of the additive combinations representing a real or imaginary number. A register file in each PE includes at least a first port and a second port, each of the first and second ports being capable of reading two complex words or writing two complex words to or from the register file. The ports are coupled to the arithmetic circuitry for supplying the complex numbers thereto and receiving the real or imaginary numbers therefrom.
In accordance with another aspect of the invention, a linear array of PEs suitable for performing matrix computations includes a head PE and a set of regular PEs, the head PE being a functional superset of the regular PE, with interconnections between nearest neighbor PEs in the array and a feedback path from a non-neighbor regular PE back to the head PE. The head PE further includes a non-linear function generator. Each PE is pipelined such that the latency for an arithmetic operation to complete is a multiple of the period with which new operations can be initiated. A Very Large Instruction Word (VLIW) program or other type of program may be used to control the array. The array is particularly efficient at performing complex matrix operations, such as, e.g., the solution of a set of linear equations, matrix inversion, matrix-matrix multiplication, and computation of covariance and cross correlation.
Advantageously, the invention allows matrix computation previously implemented using two dimensional triangular arrays to be implemented in an efficient manner using pipelined linear arrays of processing elements. The invention thus allows matrix computations to be implemented with a substantial reduction in hardware complexity, while providing only a slight increase in latency as a result of the pipelining.