1. Field of the Invention
The present invention relates to computer systems for performing sparse matrix computations. More particularly, the present invention relates to a method and an apparatus that uses a hybrid one-dimensional trapezoidal and two-dimensional rectangular representation for a factor in solving a system of linear algebraic equations involving a sparse coefficient matrix.
2. Related Art
The solution of large sparse symmetric positive-definite systems of equations constitutes the primary computational cost in numerous applications, such as finite-element design, linear programming, circuit simulation and semiconductor device modeling. Efficient solution of such systems has long been the subject of research, and considerable progress has been made in developing efficient algorithms to this end. A direct solution technique known as xe2x80x9cCholesky Factorizationxe2x80x9d is the most widely used approach to solve such a system. Under Cholesky factorization, the complete solution sequence requires many stages, including matrix reordering, symbolic factorization, numerical factorization and triangular solution. Of these stages, numerical factorization is typically the most computationally expensive.
One method of performing numerical factorization is based on a right-looking supemode-supemode method described in xe2x80x9cParallel Algorithms for Sparse Linear Systemsxe2x80x9d by Michael T. Heath, Esmond Ng and Barry W. Peyton, in xe2x80x9cParallel Algorithms for Matrix Computationsxe2x80x9d by Gallivan, et al. (Editors), SIAM (1994) (referred to as HNP). In a sparse matrix, a supernode is a set of contiguous columns that have essentially the same sparsity structure. Supernodes can be used to organize the numerical factorization stage around matrix-vector (supemode-column) and matrix-matrix (supernode-supernede) primitive operations leading to a substantial performance improvement arising from more efficient use of the processor caches and pipelining units.
The numerical factorization step involves two fundamental sub-tasks:
(1) cdiv(j): division of column j of the factor by a scalar; and
(2) cmod(j,k): modification of column j by column k, k less than j.
These sub-tasks can be organized around supernodes. For example, DIV(j) can be organized as an internal factorization/update of supernode j, and
CMOD(j,k) can be organized as a modification of supernode j by supernode k, k less than j.
Typically, the second sub-task is where the bulk of the computational cost is incurred. In order to increase computational efficiency, the CMOD(j,k) operation can be divided into three steps (see HNP):
(a) computation of the update and accumulation into a temporary array;
(b) carrying out the non-zero index matching between the first columns of the source and destination supernodes and computing relative indices; and
(c) scattering updates from the temporary vector into the target destination supernode.
By dividing the CMOD(j,k) operation in this way, it is possible to apply techniques used in dense matrix operations in the step (a). Note that step (a) is where the dominant amount of time is spent. In the discussion that follows, we refer the step (a) as the xe2x80x9clocal dense CMOD operationxe2x80x9d. The local dense CMOD operation involves computing a trapezoidal-shaped dense update that can be represented as a combination of a dense rank-k update and a matrix multiplication.
Library routines can be used to speed up the CMOD computation. These library routines are typically written in assembly language and are hand-tuned for a specific machine architecture. For example, on the current generation of ULTRASPARC-II(trademark)-based machines, the SUN PERFORMANCE LIBRARY(trademark) (see http://www.sun.com/workshop/performance/wp-perflib/) provides SPARC(trademark) assembly-language implementations of BLAS1, BLAS2 and BLAS3 routines. These hand-tuned assembly language implementations can yield performance close to the theoretical peak of the underlying processor.
For example, portions of the CMOD operation can be efficiently performed by invoking matrix multiplication code from the Sun Performance Library (specifically, the BLAS3 dgemm code). Unfortunately, invoking the BLAS3 dgemm matrix multiplication code requires supernodes to be copied into and out of temporary storage because of incompatibility between data-structures used by a typical sparse solver and those expected by the dgemm API. This copying can add a significant overhead. Consequently, using the BLAS3 xe2x80x9cdgemmxe2x80x9d matrix multiplication code only makes sense for supernodes above a certain size. Otherwise, the performance gains from using the BLAS3 xe2x80x9cdgemmxe2x80x9d library code are cancelled out by the additional overhead involved in copying.
Hence, what is needed is a system that performs the CMOD operation using library routines in cases where the performance gains from using the library routines exceed the computational overhead required to use the library routines.
Another difficulty in attaining high performance in numerical factorization is due to the fact that supernodes can have varying shapes, sizes, and sparsity patterns. These varying shapes, sizes and sparsity patterns can greatly influence computational performance. In order to optimize computational performance for the CMOD operation, the supernodes of varying shapes and sizes must be divided into smaller sub-units for computation so as to balance computational operations with memory references in a way that is tuned for the particular machine architecture on which the computation is to be run.
Yet another difficulty in attaining high performance in sparse factorization is due to the fact that direct solution methods have high memory storage requirements that necessitate using a compact storage scheme, such as a one-dimensional trapezoidal representation, that makes integration of library routines (such as BLAS3) directly into the CMOD operation difficult. Other storage schemes, such as two-dimensional rectangular block representations can be used, which are easier to integrate with library routines, but these other storage schemes architecture a storage penalty.
What is needed is a system that performs the CMOD operation by taking advantage of a two-dimensional rectangular representation for supernodes when it is advantageous to do so, but otherwise uses a compact one-dimensional trapezoidal representation for supernodes.
One embodiment of the present invention provides a system that efficiently performs a CMOD operation in solving a system of linear algebraic equations involving a sparse coefficient matrix. The system operates by identifying supernodes in the sparse matrix, wherein each supernode comprises a set of contiguous columns having a substantially similar pattern of non-zero elements. Next, the system performs a CMOD operation on the supernodes. This involves for each supernode, determining a structure for the supernode, and computing a function of the structure. The system uses a one-dimensional trapezoidal representation for the supernode during the CMOD operation, if the result of the function is lower than a threshold value, and uses a two-dimensional rectangular representation for the supernode during the CMOD operation if the result of the function is greater than the threshold value.
In one embodiment of the present invention, the function on the structure of the supemode is a function of a number of computational operations involved in computing a lower-triangular sub-block portion of the supernode and a number of computational operations involved in computing a rectangular sub-block portion of the supernode.
In one embodiment of the present invention, the system splits each supernode into vertical panels that fit into cache lines within a cache of a processor performing the computation prior to performing the CMOD operation.
In one embodiment of the present invention, the function on the structure of the supemode is a function of the memory requirements for storing the supernode in the two-dimensional rectangular representation and the memory requirements for storing the supernode in the one-dimensional trapezoidal representation.
In one embodiment of the present invention, the system additionally performs a CDIV operation on the plurality of supernodes prior to performing the CMOD operation. This CDIV operation divides each element in each column by a scalar value for normalization purposes (see HNP).
In one embodiment of the present invention, the system uses the two-dimensional rectangular representation for the supernode during the CMOD operation if the supernode is a root supernode. The xe2x80x9crootxe2x80x9d supernode is the rightmost supernode, which is typically in the rightmost corner of the triangular matrix (see HNP).
In one embodiment of the present invention, in performing the CMOD operation, the system additionally stores a result of the CMOD operation in a temporary vector, and scatters a contents of the temporary vector into a destination representation of the sparse matrix.