Certain matrix operations require that a matrix be factored. For example, factoring a matrix may be necessary when a matrix is to be inverted. The result may be a “triangulated” matrix—i.e., a matrix with zero values above the diagonal. The consequence is that only the values on the diagonal, and in the columns below those values, need to be calculated.
In Cholesky decomposition, to factor an input matrix A, an element Li,i of the diagonal of the resultant triangulated matrix M, may be calculated as:
      L          i      ,      i        ⁢            (                        a                      i            ,            i                          -                              ∑                          k              =              1                                      i              -              1                                ⁢                                    L                              i                ,                k                                      ·                          L                              i                ,                k                                                        )      where ai,i is the i,ith element of the original input matrix A, and Li,k is the i,kth element in the resultant triangulated matrix M. The subsequent elements in the jth column of M may be calculated as:
            L              i        ,        j              =                  1                  L                      j            ,            j                              ⁢              (                                            a                              i                ,                j                                      -                                          ∑                                  k                  =                  1                                                  j                  -                  1                                            ⁢                              L                                  i                  ,                  k                                                              ⁣                      ·                          L                              j                ,                k                                                    )              ,            for      ⁢                          ⁢      i        >    j  where ai,j is the i,jth element of the original input matrix A, and Li,k and Lj,k are the i,kth and j,kth elements, respectively, in the resultant triangulated matrix M. To perform this calculation, the Lj,j term needs to be calculated before any of the Li,j (i>j) elements can be calculated. The inner product in each term (i.e., Σk=1j-1Li,k·Li,k or Σk=1j-1Li,k·Lj,k)—which, in the case of all real values is the same as a dot product, but in the case of complex values requires computing complex conjugates—may require dozens of clock cycles. Similarly, the square root calculation in the computation of Li,j can also impose noticeable latency.
In standard implementations, partial computations (e.g., each product in the aforementioned inner product) are carried out in order and in adjacent cycles. A delay line is used to assemble results of the partial computations and an adder tree is used to combine the assembled results. Such implementations have various limitations. For example, these implementations limit the use of important resources such as the adder tree to a fraction of the operation time only, which increases inefficiency. Moreover, these standard implementations suffer from significant latency, especially when manipulating floating point data types. For example, typical latencies for multipliers and adders are in the range of 10 clock cycles; a dot product operator with a few tens of inputs may thus exceed a latency of 100 clock cycles. Such long latencies may cause significant routing congestion and render dataflow management intractable. Another limitation of the standard implementations is the need to know the maximum number of items to be combined at creation time, which reduces the run-time flexibility and increases the state-machine complexity of the design. Because of these limitations, standard implementations may result in systems with poor performance levels, especially with floating point operations.
Different Cholesky decomposition implementations may need to accommodate different matrix sizes or satisfy different speed grades, target frequencies, or throughput requirements. This may particularly be the case in programmable devices, where different users may require resources for matrix operations of different sizes or at different speeds.