Linear Algebra PACKage (LAPACK) is a very efficient and robust world-wide-used linear algebra function library jointly developed by Oak Ridge National Lab, Davis branch of California University and Illinois University, for solving numerical linear algebra problems highly effectively in various high performance computing environments. It has served the HPC (High Performance Computing) and Computational Science community remarkably well for twenty years. See http://netlib.amss.ac.cn/lapack/index.html for the detail of the LAPACK.
As a professional linear algebra library, LAPACK provides various linear algebra subroutines, including the routine for implementing the QR factorization of matrix.
The meaning of QR factorization of matrix is: for a given M×N matrix A, seeking the factorization:A=Q*R, 
where Q is an M×M orthogonal matrix, and R is an M×N upper triangular matrix.
The existing QR factorization routine in LAPACK is implemented according to a panel QR factorization solution, which is a blocked factorization solution.
FIG. 1 is an illustration of the existing panel QR factorization solution, wherein FIGS. 1(a) and (b) are the overall and stepped illustrations of kth iteration computation in the existing panel QR factorization solution, respectively, and FIG. 1(c) is a description of the algorithm of the existing panel QR factorization solution. FIG. 2 is a flowchart of the existing panel QR factorization solution.
Generally, as shown in FIG. 1(a), the idea of the existing panel QR factorization solution is that, for a given M×N matrix A, iteratively, factorization operation is performed on one panel of the matrix at one time, to finally factorize the matrix A into the product of an M×M matrix Q and an M×N upper triangular matrix R. For simplicity in the present invention, the matrix A is illustrated as a square matrix in the figures. In fact, the matrix A in the figures may be an arbitrary M×N matrix instead of the square matrix, wherein M and N are unequal positive integers. By taking an iteration therein as an example, as shown on the left side of FIG. 1(a), the matrix parts V and R in light grey are ones factorized through the 1th˜(k−1)th iteration operations, while the matrix part in dark grey combined by the matrix parts A1(k) and A2(k) is not factorized and is also the object of the kth (k=1, 2, 3 . . . ) iteration operation. Further, in the kth iteration operation, the matrix part in dark grey is partitioned into two panels A1(k) and A2(k), where A1(k) is the current working panel; then an QR factorization computation is performed on the current working panel A1(k), and A2(k) is updated by using the result of the factorization computation, thus the matrix on the right side of FIG. 1(a) is obtained. Therein, in the matrix on the right side of FIG. 1(a), the matrix part Ã2(k) in dark grey becomes the factorization object for the (k+1)th iteration operation.
Specifically, as shown in FIGS. 1(b), (c) and FIG. 2, in the existing panel QR factorization solution, for a given M×N matrix A, partition is performed first to partition it into m×n blocks, where each block is Nb×Nb such as 32×32 in size, then the following steps 1˜3 will be performed in the kth (k=1, 2, 3 . . . ) iteration operation according to:
      A          (      k      )        =            (                                                  A              1                              (                k                )                                                                        A              2                              (                k                )                                                        )        =                  (                                                            A                11                                                                    A                12                                                                                        A                21                                                                    A                22                                                    )            =              Q        ·                  (                                                                      R                  11                                                                              R                  12                                                                                    0                                                              R                  22                                                              )                    
At step 1, a panel A1(k) composed of m×nb blocks is partitioned out from the object matrix A(k) of the iteration operation this time as the current working panel, and the QR factorization computation is performed on the current working panel A1(k) to factor it into a V part and an R part;
at step 2, the triangular factor T of the current working panel A1(k) is calculated based on the computation result of step 1; and
at step 3, the current working panel A1(k) and the triangular factor T of A1(k) are applied to the rest matrix part A2(k) of A(k) to update its data. LAPACK only outputs matrixes V and R, and the user can compute on the matrix V to obtain matrix Q, thus completing the QR factorization.
FIG. 3 shows a process of QR-factorizing a matrix partitioned into 3×3 blocks with the above existing panel QR factorization solution (a case of only one time iteration). Therein as shown in FIG. 3(a), at step 1, QR factorization is performed on the current working panel of 3×1 blocks on the left of the matrix; as shown in FIG. 3(b), at step 2, the triangular factor Tk of the current working panel is calculated; as shown in FIG. 3(c), the rest matrix part of 3×2 blocks is updated by using the current working panel and the triangular factor Tk.
There will be a lot of matrix-multiply operations in the QR factorization routine designed based on the above existing panel QR factorization solution, for such routine, performance is very critical.
The Cell Broadband Engine (CBE) is a single-chip multiprocessor system. As shown in FIG. 4, the CBE system has 9 processors operating on a shared, coherent memory, including a Power Processing Unit (PPU) and 8 Synergistic Processing units (SPU). Under such system architecture, the CBE can provide outstanding computing capability. Specifically, the Cell processor is capable of achieving 204 Gflops/sec when clocked at 3.2 GHz. Having such a high computing capability, CBE is obviously an ideal running platform for matrix QR factorization having a large amount of computation tasks.
However the above existing panel QR factorization solution is designed for a single-processor system. If it is directly applied to such a multiprocessor system as CBE, there will be a memory bandwidth limitation problem. The reason is as follows. In CBE, the capacity of the local memory of each SPU is 256K, thus in the case of a large data size exceeding 256K, it is needed to execute read in/read out operations repetitively between a main memory and the local memory of the SPU by way of DMA. For example, in the case that the matrix is partitioned into a plurality of blocks each in the size of 32×32, if the above existing panel QR factorization solution is implemented for the matrix on 8 SPUs of the CBE, then the maximum memory requirement will be 20.6 GB/second. However, QS20 and QS21 blade in the CBE is only capable of sustaining roughly a memory bandwidth of 20.5 GB/second. So the memory bandwidth becomes a bottle neck for the above existing panel QR factorization solution to be applied to a multiprocessor system like CBE to improve the performance of QR factorization. Therefore, there is a need for designing a QR factorization solution suitable for a multiprocessor system like CBE.