In a solution for linear simultaneous equations expressed by a sparse symmetric positive definite matrix (Ax=b where A indicates sparse symmetric positive definite matrix, x indicates a variable column vector, and b indicates a constant column vector), the position of a nonzero element in a corresponding matrix can be changed by changing the arrangement of the variables in a variable column vector. The rearrangement is referred to as ordering. It is known that the fill-in portion of LDL ^T decomposition (a matrix element which is zero in the original matrix, but nonzero as a result of the decomposition) can be reduced by performing appropriate ordering. In the LDL ^T decomposition, a sparse matrix A is decomposed into A=LDL^T, thereby solving the linear simultaneous equations Ax=b. In this example, L indicates a lower triangular matrix, and ^T specifies matrix transposition.
The dependence of the data update between matrix elements in the LDL ^T decomposition about the column of the lower triangular matrix portion L including the diagonal element of a sparse symmetric positive definite matrix is expressed by a tree structure referred to as an elimination tree. For the elimination tree, refer to the non-patent document 1. Based on the dependence, the decomposition is performed in the direction from the leaf (a node having no child node in the component nodes configuring an elimination tree) to the root (a node having no parent node).
The calculation of decomposition can be independently performed on a subtree having no common portion between elimination trees, and can be processed in parallel.
A nested dissection is typical ordering. A nested dissection refers to a method of sectioning a space into two sectioned areas and a sectional face between the sectioned areas, and numbering them. For the nested dissection, refer to the non-patent document 1. Furthermore, a sparse matrix is expressed by a graph, the graph is sectioned into two subgraphs having the substantially equal number of configuration nodes so that the smallest possible number of nodes can be removed for the sectioning process. The sectioning process is recursively repeated, and is repeated until it cannot be performed any more. The thus obtained ordering is referred to as the ordering by standardized nested dissection.
The sectioning divides nodes into nodes forming a sectional face and nodes of sectioned areas. Assuming that two sets of nodes are sets A and B sectioned by a sectional face relating to the sectioning at a certain level, the component nodes belonging to the set A are assigned sequential numbers, then the component nodes belonging to the set B are assigned sequential numbers, and finally component nodes configuring the sectional face are assigned sequential number continuously not to be further sectioned. The numbering rule can also be applied to a recursively sectioned portion.
It is known that the ordering of the nested dissection generated as a result refers to ordering with a small fill-in portion.
Although the LDL ^T decomposition is described above, the decomposition is referred to a modified Cholesky decomposition to which substantially the same algorithm as the Cholesky decomposition (LL ^T decomposition) can be applied
When the linear simultaneous equations for a sparse symmetric positive definite matrix are solved by the Cholesky decomposition, a fill-in portion is commonly reduced by rearranging the rows or columns of a matrix. The rearrangement is referred to as ordering. The Cholesky decomposition is performed on the rearranged matrix.
The Cholesky decomposition is conducted by two phases, that is, a symbolic decomposition and a numeric decomposition. The symbolic decomposition analyzes the dependence of each column, and expresses the result on the elimination tree. Using the information, a generation pattern of nonzero elements in the sparse matrix is analyzed.
The generation pattern of nonzero elements in the Cholesky decomposition is analyzed by generating an elimination tree. When the simultaneous equations of Ax=b (A=(aij)) are solved by the Cholesky decomposition, it is decomposed into A=LL^T where L indicates a lower triangular matrix. An elimination tree is generated by coupling after retrieving a child configuration node from the first configuration node using the following theorem.
Theorem
In the Cholesky decomposition LL^T=A, excluding a numerical cancellation, the node i is a child node of the node k when aki≠0 and k>i (refer to the non-patent document 1).
The node k is referred to as a parent n of the node i, and the data corresponding to the node i is the data of the i-th column of the matrix A. The component node of the elimination tree corresponds to the lattice point of the corresponding number. Therefore, the component node i corresponds to the lattice point i of a discrete space, and the data of the component node i is the data of the column i of the matrix A.
In deriving performance by processing in parallel the Cholesky decomposition on a large sparse symmetric positive definite matrix, the subtrees not crossing one another in an elimination tree can be independently calculated. Therefore, the calculation can be performed in parallel on independent subtrees. Otherwise, the parallel performance can be derived by sequentially performing pipeline processing for the process of updating a column according to the propagation of the influence of the update of the elements of a sparse matrix indicated by an elimination tree. In addition, it is important to configuring a block by bundling columns and increasing the computational complexity in enhancing the parallel effect. When a large amount of data is read in one data read, the frequency of low speed data reading/writing operations can be reduced, thereby enhancing the parallel processing effect. Therefore, the nonzero element generation pattern for the same column configures a super node. When a super node is configured, only the data of rows having nonzero values are held when the columns configuring the super node are combined. Therefore, since it is not necessary to hold data of the rows of zero, the data can be compressed, stored, and processed.
Furthermore, to improve the performance of the Cholesky decomposition, a super node is configured by moderating the conditions and combining the columns similar in nonzero pattern as a result of the decomposition. As a result of the combination, in the updating process of the Cholesky decomposition, a row including excess zero in the data of the super node is generated, thereby increasing the computational complexity. However the memory access per unit computational complexity can be reduced. Therefore, the improvement of the performance of the Cholesky decomposition can be expected. The parallel processing of the Cholesky decomposition can be logically performed in parallel in the calculation for each column. However, since the frequency of memory access and the computational complexity is low, the overhead of the parallel processing is not ignorable, and the parallel processing effect cannot be easily derived.
That is, to perform efficient parallel processing, it is effective to increase the granularity to reduce the overhead of the parallel processing.
The fill-in portion can be logically minimized by rearranging a matrix using the ordering by a complete nested dissection (a numbering method by performing spatial sectioning up to utmost step), and performing the calculation of the Cholesky decomposition column by column without combining the columns. A super node is generated by combining columns as data read to one reading operation to cache to derive performance, and the process of combining columns is performed from the column as a leaf of the elimination tree. When columns are combined in the Cholesky decomposition using the ordering of the nested dissection, it is necessary to combine nodes having a number of branches of the elimination tree to increase the width of a column-combined block.
A branch node has a number of nonzero elements common with a branch source node, but there can be a number of nonzero elements not common between branch nodes. Therefore, when a number of nodes are combined, and a number of branch nodes are included, the number of zeros suddenly increases. Accordingly, there has been the problem that it is difficult to find an appropriate block width of a super node while gradually increasing the number of zeros.
When nested dissection ordering is used while expecting the effect of reducing fill-in portions, computational complexity occurs on one column at a portion corresponding to the leaf of the binary tree structure of the elimination tree generated by the nested dissection unless columns are combined. Although the fill-in portion is reduced, the expected performance may not be acquired because the overhead becomes outstanding by a reduced ratio of memory access to the computation and a low granularity of parallel processing.
Refer to non-patent document 1 for the details of the basic technique directly used for the solution of a sparse matrix.
<Ordering Generated from Nested Dissection and its Elimination Tree>
The problem in a two-dimensional square area is considered. The case in which a sparse symmetric positive definite matrix is acquired by discrete partial differential equations representing the problem in this area is considered. In this case, the discrete lattice points are combined with adjacent lattice points in the area.
The x-y plane is assumed as illustrated in FIG. 1. By removing the set a of the lattice points configuring the straight lines parallel to the y axis from the lattice points configuring the area, the set of the lattice points configuring the area is halved so that the number of lattice points configuring each portion is substantially equal to each other.
Each of the two remaining areas is similarly halved by removing the set b of the lattice points horizontally parallel to the x axis so that the number of lattice points of each portion can be substantially equal to each other. To halve the set to have substantially equal number of lattice points, the area can be sectioned to allow each section to have the lattice points of the number obtained by dividing the total number of lattice points in the area to be sectioned by 2. The procedure is recursively continued until the area which has been generated and become smaller by the sectioning operation cannot be further sectioned. Not to be further sectioned refers to having the number of lattice points of 3 included in the small area obtained by the sectioning operation, and not allowing the next operation of sectioning the area into four portions. Finally, the area is sectioned into 2**n small areas.
The process corresponds to the selection for a smaller number of lattice points configuring the boundary line to be removed by equally sectioning the area.
The numbers are sequentially assigned to a set of lattice points configuring one area in two areas generated by the sectioning operation. Next, numbers are continuously assigned to the lattice points removed by the halving operation.
Numbers are assigned to the set generated by the above-mentioned sectioning operation by recursively applying the rule. The elimination tree of the matrix arranged by the numbering process is a binary tree.
That is, for example, assume that a Laplace's equation (Δφ=0) is to be solved in a discrete space. In the case of a one-dimensional problem, a Laplacian is a second order partial differential equation of coordinates. Therefore, if φ(i) is a function value in the lattice point i, the discrete Laplace's equation is expressed as follows.{φ(i+1)+φ(i−1)−2φ(i)}/h2=0
where h indicates the distance between adjacent lattice points. It is represented by the following matrix.
                                          1                          h              2                                ⁢                      (                                                            ⋱                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1                                                                      -                    2                                                                    1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1                                                                      -                    2                                                                    1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1                                                                      -                    2                                                                    1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ⁢                    0                                                                                                                                                                                                                                                                                                                                            ⋱                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          )                    ⁢                      (                                                            ⋮                                                                                                  ϕ                    ⁡                                          (                                              i                        -                        1                                            )                                                                                                                                        ϕ                    ⁡                                          (                      i                      )                                                                                                                                        ϕ                    ⁡                                          (                                              i                        +                        1                                            )                                                                                                                                        ϕ                    ⁡                                          (                                              i                        +                        2                                            )                                                                                                                                        ϕ                    ⁡                                          (                                              i                        +                        3                                            )                                                                                                                                                                                                                                        ⋮                                                      )                          =                  (                                                    ⋮                                                                    0                                                                    0                                                                    0                                                                    0                                                                    0                                                                                                                                                                          ⋮                                              )                                    [                  math          ⁢                                          ⁢          1                ]            
As clearly indicated above, when adjacent lattice points are continuously numbered, coefficients are generated adjacently in a sparse matrix. However, when lattice points are sectioned as described above, sequential numbers appear on adjacent lattice points in the sectioned areas. Therefore, coefficients are concentrated. However, the coefficients corresponding to the lattice points separately sectioned and numbered are generated in places apart from other coefficients. Therefore, when an elimination tree of such a matrix is generated, the trees corresponding to the columns in which coefficients are concentrated are combined with each other, but separate columns are independent of each other. Therefore, when an area is sectioned into areas A and B, and C as a boundary area for the areas A and B, the component node of the elimination tree corresponding to the lattice point in the area A and the component node of the elimination tree corresponding to the lattice point in the area B are branched into two portions in and after the component node of the elimination tree corresponding to the lattice point of the boundary area C.
The binary tree is configured by branch nodes arranged in a linear array.
The numbering method in the recursive calling procedure is described below.
In the program code below, the function partition refers to the procedure of sectioning an area, and it is determined whether or not an area can be sectioned. If the area can be sectioned, the two sectioned areas are returned to seta and setb, and a sectional plane if returned to cutsuface. The “set” refers to a set of lattice points to be sectioned.    call numbering (set)    recursive procedure numbering (set)    call partition (set, seta, setb, cutsurface, icon)    if (partition is done) then ! Numbers are sequentially assigned in the numbering call order.    call numbering (seta) ! A number is assigned to seta.    call numbering (setb) ! A number is continuously assigned to setb.    call numbering (cutsurface) ! A number is assigned to cutsurface.    else    call numbering (set)    endif    return    end
Relating to the algorithm of the partition, refer to non-patent document 2.
A three-dimensional problem can be similarly assumed. For example, for a cube, an area is substantially equally sectioned excluding the set of lattice points configuring a surface parallel to the y-z plane. Continuously, each of the two areas generated by the sectioning operation is similarly halved on the surface parallel to the x-z plane. As with the case in the two-dimensional problem, the sectioning operation is repeated until no further sectioning can be performed.
The same numbering rule is applied. The elimination tree of the rearranged matrix in the numbering operation is also a binary tree.
To apply them to the commonly coupled areas, the combination of the discrete lattice points is represented by a graph, the sectioning operation is repeated so that the number of nodes configuring the surfaces which section the graph can be minimized and the number of component nodes configuring the sectioned subgraphs can be leveled, thereby configuring a common nested dissection ordering. Also in this case, the elimination tree is a binary tree.
<Coupled Node (Super Node)>
Relating to the component nodes of subsequent node numbers in the columns of the matrix corresponding to the component node of the elimination tree, the nonzero pattern of a larger node pattern couples a column of a smaller node number to a column of an equal node number. Furthermore, the component nodes configuring a branch are checked, and relatively similar nonzero patterns which do not largely increase the ratio of zero elements are coupled. As a result, if the Cholesky decomposition is performed, the number of rows having nonzero elements increased in the rows of the coupled column vectors. That is, zero is included in the necessary area in compressing and storing the rows. Thus, the block generated by coupling the columns based on similar nonzero patterns is referred to as a panel.
FIG. 2 illustrates a panel and an index list.
A panel is provided with an index list correspondingly. An index list holds data about where in the row of the columns of the sparse matrix the data of the panel is located. Since the panel holds only the data of the nonzero elements other than the elements as common zero in the column held in the panel, the index list indicates wherein the row of the columns the held nonzero element is located in the columns. Since the panel does not hold the element as common zero in the held column, the actual value can be zero although it is held as a nonzero element.
The process of the Cholesky decomposition for each panel relatively reduces the load store cost of the memory because of increasing computational complexity with a larger block width. However, an unnecessary wasteful computation is performed on a zero element portion. Therefore, it is necessary to determine the block width by adjusting the block width by coupling with the number of included zero taken into account. It is important in deriving the performance.
A super node is a configuration node continuously detected in the elimination tree in which same nonzero element patterns are coupled, and only the lows having nonzero elements are collected, compressed, and stored. That is, data is held for only nonzero element rows, and data in the nonzero element rows is not held, thereby compressing the amount of data to be held.
FIG. 3 illustrates the state in which unnecessary zero is included in the panel.
In the component nodes of the parent of the elimination tree, those having matching nonzero element patterns are coupled, and only nonzero elements in each column is compressed and stored. It is defined as a panel. The nonzero pattern does not match that of the next parent node, but there is an inclusion relationship. That is, the row corresponding to the nonzero element in the next parent node is updated in the updating process in the decomposing process, and becomes a nonzero element. Therefore, the nonzero pattern of the next parent node is reflected in the panel. Such a relationship is referred to as an inclusion relationship. If the next parent node of this type is coupled to the panel, an unnecessary 0 is included. In the example illustrated in FIG. 3, an unnecessary zero is generated below the panel. In this case, the index and the coupled parent node have a number of uncommon portions with each other.
<When Branched Nodes are Coupled, the Number of Zeros Greatly Increases.>
The node connected to the end of the branch from a branched node corresponds to the node configuring a sectioned area in the nested dissection.
As a result of the Cholesky decomposition, a first node adjacent to a second node appears as a nonzero element of the column corresponding to the second node. In the process of the Cholesky decomposition, when a path is traced to the root of the elimination tree, the influence of the nonzero element is propagated as an element having an influence when update to another column is performed. That is, in the two areas sectioned on a surface, the node of the surface adjacent on another surface is propagated to a branch node as a nonzero element. Since the halved node group has no common element on the boundary surface other than the sectional plane, the nonzero element corresponding to the node outside the adjacent area is taken over by a branch node. That is, in the branch node, it is necessary to process a zero element as a nonzero element through a coupling process. That is, when branch nodes are coupled, zero greatly increases in panel width as a result of coupling the nodes configuring the branched of branch nodes.
In the nested dissection, it is difficult to adjust the number of zeros and the width of a panel because there are a number of branch nodes encountered while coupling the component nodes of the elimination tree.
Assume that a rectangular parallelepiped as illustrated in FIG. 4 has been generated as a result of some sectioning operations. The central portion is a set of nodes configuring a surface to be removed after halving the rectangular parallelepiped. Before the sectioning operation, these three portions are not separated. It is assumed that the surface is adjacent to another area before the sectioning operation.
When a node (lattice point) in a sectioned area is numbered, an appropriate number is assigned not to section the area. In this case, in the elimination tree, they are represented as a portion in the form of an inverted Y in the tree.
Considering the portion branched in the branch node, the elimination tree is halved below the branch node. The structure of the lower triangular matrix L as a result of the Cholesky decomposition is included in the structure of the nonzero pattern of the parent node. Therefore, the nonzero data of the nodes existing in other areas adjacent to the blocks other than the blocks A and C propagates from the position where it appears.
The description below is made using the column data. That is, the nonzero pattern of the two branch nodes z and w are exemplified in FIG. 5. That is, the node z is configured by a diagonal element, the nonzero element CCN generated from the configuration node of the block C, and the nonzero element AN generated from the external node of the block A adjacent to a block other than the block C. The node w is configured by a diagonal element, a nonzero element CCN generated from the configuration node of the block C, and the nonzero element BN generated from the external node of the block B adjacent to a block other than the client C. On the other hand, the branch node y is configured by a diagonal element, a nonzero element CCN generated from the configuration node of the block C, a nonzero element AN generated from the external node of the block A adjacent to a block other than block C, a nonzero element BN generated from the external node of block B adjacent a block other than the block C, and a nonzero element CN generated from an external node of the client C adjacent to a block other than the blocks A and B.
It is assumed that the sectioning operation is performed by sectioning an area so that the sectioned areas can be equal in size and the sectional plane can be minimized. That is, if AN can be substantially equal to BN, and there are a number of surfaces adjacent to external portions, the expression of AN˜BN>CCN>CN holds.
As a result, as illustrated on the right in FIG. 5, a considerably large number of zeros suddenly increases.
When the sectioning operation is further performed until it cannot be performed any further, the structure is repeated.