The LINPACK benchmark is known as a benchmark for measuring the computational performance of a computer system when solving a set of simultaneous linear equations. Since the LINPACK benchmark is used for the ranking on the TOP500 list, attention has been drawn to a technique to solve at high speed a set of simultaneous linear equations using a computer system. Here, LINPACK itself is a software library for performing numerical computations. Particularly, high-performance LINPACK (HPL) is a library for solving in parallel a set of simultaneous linear equations for a dense matrix using the nodes (for example, processes or processor cores) of a parallel computer system.
In the computation of a set of simultaneous linear equations Ax=b, the matrix A is first factorized into an upper triangular matrix and a lower triangular matrix (this factorization is called the LU factorization), and then x is obtained. In the case of HPL, the matrix A is factorized into blocks of width NB, and processing is executed on a block basis and the LU factorization proceeds. One or more blocks are allocated to each of the nodes.
The LU factorization is described with reference to FIG. 1. In the example of FIG. 1, the matrix A is factorized in 10×10=100 blocks. The number of elements belonging to each of the blocks is assumed to be 100×100=10000. Hence, NB=100, and the matrix A has (100×10)×(100×10)=1000000 elements. A block denoted with a circle contains a diagonal element of the matrix, and the part above the blocks denoted with circles corresponds to the upper triangle while the part below the blocks denoted with circles corresponds to the lower triangle.
In the example of FIG. 1, the blocks of the matrix A are allocated to six nodes, and blocks allocated to the same node have the same color. The allocation of the blocks is described with reference to FIG. 2. In the example of FIG. 2, the blocks of the matrix A are allocated to the nodes (0, 0), (0, 1), (1, 0), (1, 1), (2, 0), and (2, 1), and parts of the matrix A allocated to each node are stored as a local array in a memory or other storage devices. Here, the number of blocks allocated is non-uniform among the nodes. Specifically, the number of blocks allocated to the node (0, 0) or (0, 1) is 20, whereas the number of blocks allocated to each of the nodes (1, 0), (1, 1), (2, 0), and (2, 1) is 15.
When performing LU factorization, the computational efficiency for matrix products increases as the width of a submatrix for the computation of matrix products increases (that is, as the block size increases), and thus processing time is reduced. However, increasing the block size causes the number of blocks allocated to be non-uniform among the nodes as illustrated in FIG. 2, and deteriorates load balancing. Hence, it is not possible to simply increase the block size. In the related art, the above-mentioned problem is not fully considered.
Examples of the related art are disclosed, for example, in International Publication Pamphlet No. WO2008/136045, Japanese Laid-open Patent Publication Nos. 2008-176738, 2000-339295, and 2006-85619.
Hence, in one aspect, an object of the present disclosure is to provide a technique for reducing processing time to solve a set of simultaneous linear equations using a parallel computer system.