1. Field of the Invention
The present invention relates to a parallel matrix processing method adopted in a shared-memory scalar parallel-processing computer.
2. Description of the Related Art
There is provided a method of using a computer to find solutions to simultaneous linear equations whereby the equations are expressed in terms of matrixes which allow the solutions to be found by processing the matrixes. Thus, in accordance with this method, the solutions to the simultaneous linear equations are found after converting the equations into such a form that allows the solutions to be found with ease.
To put it in more detail, coefficients of the simultaneous linear equations are expressed in terms of a coefficient matrix and variables of the equations are expressed in terms of a variable matrix. The problem is to find such a variable matrix representing the variables that a product of the coefficient matrix and the variable matrix is equal to a predetermined column vector. In accordance with a technique called LU factorization, the coefficient matrix is factorized into an upper-triangular matrix and a lower-triangular matrix. The operation to factorize a coefficient matrix in accordance with the LU-factorization technique is an important step in finding solutions to simultaneous linear equations. A special version of the LU-factorization technique for factorizing a matrix is known as Cholesky factorization.
a) Technique for Solving Real-matrix Simultaneous Linear Equations
In accordance with a technique of solving simultaneous linear equations expressed in terms of a real matrix, the equations are solved by using a vector parallel-processing computer wherein parallel processing is based on blocked outer-product LU factorization. To put it concretely, this technique has the steps of:
1. Applying the LU factorization to a block which is a bunch of bundled column vectors;
2. Updating a block which is a bunch of bundled corresponding row vectors; and
3. Updating a rectangular small matrix.
The technique is implemented by executing the sequence of the steps repeatedly.
Conventionally, the processing of step 1 is carried out sequentially by using one processor. In order to improve a parallel processing efficiency, a block width is set at a relatively small value of about 12. A block width is a row width of a coefficient matrix. The row width of a coefficient matrix is the number of columns in the matrix. A block width can also be a column width of a coefficient matrix. The column width of a coefficient matrix is the number of rows in the matrix. Thus, the pieces of processing carried out at steps 2 and 3 are each processing to update a matrix with a width of about 12.
The most costly computation is the processing of step 3. There is an efficient technique for this processing of a matrix with a small width of about 12. A shared-memory scalar parallel-processing (SMP) computer is not capable of displaying the most of its performance when processing a matrix with a small width for the following reason.
The processing of step 3 is an operation to find a product of matrixes. In this case, elements of a matrix with a small width are loaded from a memory and updating results are stored back into the memory. The cost incurred in making accesses to the memory is high in comparison with the processing to update the matrix so that the most of the performance cannot be displayed.
For the reason, it is necessary to increase the block size. If the block size is increased, however, the cost of the LU factorization of the block also rises so that the efficiency of the parallel processing decreases.
b) Technique for Solving Positive-value-symmetrical-matrix Simultaneous Linear Equations
In accordance with a technique for solving simultaneous linear equations expressed in terms of a positive-value symmetrical-matrix, the Cholesky factorization is applied only to a lower triangle matrix. In this case, a load of processing of small matrix blocks is cyclically distributed among processors in a distributed-memory parallel-processing computer. The load is distributed uniformly among the processors to solve the simultaneous linear equations. Much like the technique for solving simultaneous linear equations expressed in terms of a real matrix, a block width used in blocking can be set at a relatively small value to increase the efficiency of the parallel processing. Since an SMP computer displays high performance for a matrix product with a large block width obtained as a result of the update processing carried out at step 3 described above, it is necessary to increase the block size.
As described above, in the shared-memory scalar parallel-processing computer, a cost incurred in making an access to a shared memory employed in the computer is higher than a cost entailed in updating a matrix by using a matrix product through adoption of the LU factorization or the Cholesky factorization. Thus, if a method adopted in the conventional vector parallel-processing computer is applied to the shared-memory scalar parallel-processing computer as it is, sufficient performance of the shared-memory scalar parallel-processing computer cannot be displayed.