Large-scale numerical calculations such as scientific calculations may be carried out using supercomputers or other high performance computers. Such large-scale numerical calculations often involve high-dimensional matrix operations. For example, in the flow analysis field, structural analysis field, and other fields, large-scale simultaneous equations may be solved using a coefficient matrix that represents the coefficients of the simultaneous equations. Another example is that, in the circuit analysis field, vibration analysis field, and other fields, the eigenvalues of a large-scale matrix may be computed. In a matrix operation using a computer, an approximate solution may be computed by iterations of matrix-vector multiplication. For example, an approximate solution of a differential equation, which is analytically difficult to solve, may be computed by iterations of multiplying a coefficient matrix by a vector with the finite element method.
A matrix to be used in a large-scale numerical calculation may be a sparse matrix with a high percentage of zero-valued elements (zero elements) and a small percentage of non-zero-valued elements (non-zero elements). Sparse matrix representation including zero elements based on the matrix structure produces a large amount of data and therefore is inefficient. To deal with this, there are compressed storage formats for representing a sparse matrix as compressed data without zero elements. Such compressed storage formats include Compressed Column Storage (CCS) format and Compressed Row Storage (CRS) format.
In the compressed column storage format, the elements included in an N×M matrix are searched in column-major order (i.e., in the order of row 1 and column 1, row 2 and column 1, . . . , row N and column 1, row 1 and column 2, row 2 and column 2, . . . , row 1 and column M, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements in the above order, a second array that stores the row numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new columns in the first array are generated. With regard to the compressed row storage format, the elements included in an N×M matrix are searched in row-major order (i.e., in the order of row 1 and column 1, row 1 and column 2, . . . , row 1 and column M, row 2 and column 1, row 2 and column 2, . . . , row N and column 1, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements, a second array that stores the column numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new rows in the first array are generated.
By the way, in a large-scale matrix operation, a matrix may be divided so as to be assigned to a plurality of threads which are then executed in parallel by a plurality of processors. This technique achieves high-speed operation. In this case, a plurality of threads may perform operations for the same element of a final result, depending on the way of dividing the matrix. Therefore, it is considered that a storage area for storing an intermediate result may be allocated to each thread.
For example, there has been proposed a parallel processing method for multiplying a sparse matrix represented in the compressed column storage format by a column vector with a plurality of processors. In this parallel processing method, the columns of the matrix are equally divided and assigned to a plurality of threads, and a storage area having the same size as a column vector that is the final result is allocated to each thread. Then, the column vectors that are the intermediate results obtained by the respective threads are added to thereby obtain the final result.
By the way, in a coefficient matrix that represents the coefficients of simultaneous equations, non-zero elements tend to concentrate on the vicinity of a diagonal line and some square areas. Using these characteristics, there has been proposed a high-speed operation method in which a coefficient matrix is divided into a plurality of blocks and calculations for the blocks are carried out in parallel with a plurality of processors.
See, for example, Japanese Laid-open Patent Publication No. 2009-199430 and International Publication Pamphlet No. WO2008/026261.
However, the approach of allocating each of a plurality of threads a storage area for storing a vector that is an intermediate result needs more memory, and this is a problem. For example, consider the case of executing 1000 threads in parallel to multiply a sparse matrix with one million rows and one million columns by a column vector. The data of the sparse matrix may be compressed with a compressed storage format, but for the aforementioned storage areas as a whole, space for storing 1000 column vectors with one million rows needs to be prepared. Therefore, this method needs more memory with an increase in the number of threads.
Another method is that one shared storage area is prepared for a plurality of threads, and the threads sequentially add values to the elements of a resulting vector. However, if a shared storage area is used simply, some threads may possibly add values to the same element (for example, in the same row of the resulting column vector) at the same time, which is an access conflict. If exclusive control is exercised between the threads so as not to cause access conflict, the overhead for memory access may increase and the efficiency of the parallel processing may decrease.