A recent scalar computer has the characteristic of high-speed performance of a CPU but the tendency to degrade the performance when memory access becomes locally concentrated. Especially when a solution of simultaneous linear equations of a sparse symmetric positive definite matrix is obtained by performing a Cholesky decomposition (LDLT decomposition), the reference and update of data often concentrate locally on a memory storage area, thereby possibly failing in improving the efficiency of parallel processing.
Described below is a practical example of the problem.
Assume a sparse symmetric positive definite matrix L as illustrated in FIG. 13. A diagonal element is a node number, “●” indicates a non-zero element, “∘” indicates a fill-in generated in the LDLT (LLT) decomposition.
The LDLT decomposition on the basis of a left-looking method can be obtained as described below.
A set of nodes having non-zero elements except i in the i-th row in the matrix L is calculated. The result is referred to as rowstruct(i). For example, rowstruct(4)={1,2,3} and rowstruct(8)={5,7}.
The rowstruct(i) configures a puruned row subtree having a node i of an elimination tree as a root node, and can be calculated therefrom. Practically, {4}+{1,2,3}, and {8}+{5,7} are subtrees. The details of the process is described in the non-patent document 1 below.
Assume that the column vector of L of node i is defined as li, and the diagonal element of the diagonal matrix of the LDLT decomposition is defined as dii. Additionally, the matrix for performing the LDLT decomposition is defined as A, and the i-th column vector in the lower triangle matrix is defined as ai.li←ai−Σdjj×lij×lj  (1)                j ε rowstruct(i)li←li/dii, where dii=lii  (2)        
The li is calculated by repeating the processes indicated by the equations (1) and (2) above from i=1 to n.
If lj is calculated in the node j of the element of the rowstruct(i), the calculation about the j can be performed with respect to the li using them. That is, li=ai is set as a result of the initialization, and the calculation for updating the li can be performed using the lj already calculated at this time point.
The size of the memory storage area required for storing the li can also be calculated using the puruned row subtree of the elimination tree. That is, by counting the number of puruned row subtrees including the node i, the size of the memory storage area can be calculated.
The set of indexes (row number) of a non-zero element of li can be calculated by obtaining the sum of the indexes of the node i and rowstruct (i). The calculating processes can be obtained by analyzing the input array A, and is referred to as a symbolic decomposition. The details are described in the non-patent document 1 below.
According to the information above, the li in each column can be stored in memory after compressing a non-zero element as compressed column storage.
In updating the node i, it is known by the equation (1) above that any index, which is larger than i, of a node as a descendant of the elimination tree is a subset of the index of the node i.
Therefore, the updating calculation of each li can be performed by executing the calculation with compressed data in a temporarily area, searching for an index, and adding the result to the corresponding position of the li.
In performing the parallel calculation of li, the update of the node i is allocated to a thread, and the li can be calculated in parallel using lj that can be referenced. When the update of the node i is completed, the update result can be referenced. Therefore, the thread that has waited for the node can be calculated. The calculation of a node located in the task chain (task queue) and to be next updated is newly assigned to the thread for which the update has been completed. The update about a node j that can be referenced is performed, and a node for which a calculation has not been completed yet is processed later. If the calculation of an available node is completed, and there is a node j required for the calculation, then the completion of the calculation of the node j is awaited. Thus, the calculation of the LDLT decomposition can be realized in parallel in a pipeline system.
The elimination tree can be calculated from the parentage indicated by the equation below.parent(i)=min{j|lji≠0, j>i}  (3)                :row number that is a first non-zero number in the i-th column of L        
For example, in the matrix illustrated in FIG. 13, the parent of node 1 is node 2, the parent of the node 2 is node 4, the parent of node 3 is a node 4, etc. A practical elimination tree can be calculated by searching the lower triangle ({aij|j≧j, j=1, . . . , n}) of the array A processed by the LDLT decomposition. The details of the process are described in the non-patent document 1 below.
FIG. 14 is an example of an elimination tree calculated for the matrix in FIG. 13 by the equation (3) above. For example, the following findings for a parallel calculation can be obtained from the example of the elimination tree.
First, since there is no common node to be referenced between the node 4 and node 7, the nodes 4 and 7 can be independently calculated.
Generally, by the equation (1) above, when {i}+rowstruct(i) and {j}+rowstruct(j) have no common portion, the calculation of li belonging to each of them can be independently performed. In the elimination tree, the subtrees having no common portion can be independently calculated. Since a node required in calculating the li of each node is a puruned subtree having the node i as a root, it is included in the subtree. Therefore, since the subtrees having no common portion do not depend on each other, an independent calculation can be performed.
In FIGS. 13 and 14, the number of non-zero elements including the fill-in in the result of the LDLT decomposition and the index (row number) of the non-zero element in each column can be obtained by he symbolic decomposition prior to the decomposition as described above.
Each node is numbered in a post order by searching for the elimination tree from the root node (node 21 in FIG. 14) according to the depth first. The meanings of the depth first and the post order are described later in detail when the embodiments are described. In the examples in FIGS. 13 and 14, the post order matches the original number.
The subtree {1,2,3,4} and the subtree {5,6,7} are decomposed by the equations (1) and (2) above. The calculation is performed in the post order from the dependency in the decomposition. That is, the calculation is performed in the order of node 1→2→3→4, and node 5→6→7. The calculations of these two sequences can be performed in parallel.
When the calculations above are completed, calculations are performed in the order of node 8→9→ . . . →21. Since both nodes 8 and 9 refer to the column of 7, the portions updated using the column 7 can be calculated in parallel, and the calculation can be performed if the calculation of the node 7 is completed. Each time a blank thread occurs in 8, 9, and 10, a node is assigned from a task chain, and update of a column is assigned. The calculations of the nodes 11 through 20 have the dependency as in the case above. For example, the column 13 can be calculated, and the nodes 14 and 15 can be updated in parallel. When the update of the node 14 is completed, the node 16 is assigned to the thread. Then, the update of the node 15 and the update of the node 16 are performed in parallel.
Thus, the update can be performed in the pipeline system.
As the conventional technology relating to the technology disclosed by the present application, the following documents of the conventional technology are disclosed.
[Document of Prior Art] T. DAVIS, Direct Methods for Sparse Linear Systems, SIAM 2006
In the above-mentioned conventional parallel calculating system, a memory storage area of a size obtained by adding up the column counts of the columns from 1 to n is assigned as a calculation resultant memory storage area. In this case, the matrix L is stored in the compressed column storage system. The li is stored in the area as a result of adding up 1 through n.
For example, assume that memory is divided into three portions, and a memory storage area is also provided as about three equal portions. Also assume that the memory storage area is simply divided equally into three portions depending on the size of the index i of each li. If the subtree {1,2,3,4} and the subtree {5,6,7} are calculated in parallel, only the first memory storage area (corresponding to i=1˜7) is accessed as illustrated in FIG. 13. Then, the nodes 11 through 20 are processed in parallel in the pipeline system in the post order. In this case, only the second (corresponding to i=8˜14) and third (corresponding to i=15˜21) areas are locally accessed. That is, access is concentrated on a local memory storage area.
When the parallelism is enhanced from 2 to 3, there are less than three subtrees that can be performed in parallel in the examples in FIGS. 13 and 14. Therefore, it is desired to remove the calculation of the subtrees from the target of parallel calculation to maintain balance by reducing the grading of parallel processing. In this case, the calculating process is performed in the post order, and the parallelism in the pipeline process is derived. In this process, a task chain is generated in the post order, and the update process of the li of a node is assigned to each thread as pipeline processing. However, when the system is adopted, the access to the memory is locally concentrated.
That is, if the parallelism is enhanced in the conventional technology, the performance is to be logically improved in the range of performing a parallel operation in the pipeline. However, since the locality of the access to the memory is intensified, there occurs the problem that there is a strong probability that the calculation efficiency is reduced.