The finite element method (FEM) is a general and powerful technique for solving partial differential equations. This technique is used in numerous engineering fields such as structural mechanics to model the stress and strain relationships for a given structure under an applied force, biomedical engineering to model bone deformations due to forces placed on joints resulting from such activities as running, and electrical engineering to model electromagnetic field problems or to measure the capacitance from geometric integrated circuit layout data.
The computationally-intensive step of this technique is the solution of a linear system of equations. The system matrices resulting from three-dimensional finite-element analyses are very large and very sparse (i.e. very few nonzero entries). It is not uncommon to have a 10.sup.6 .times.10.sup.6 matrix for which only 0.0027% of the nodes are nonzeros (e.g., a 100.times.100.times.100 node mesh with a 8-node brick elements). Sparse matrices arise more frequently than dense matrices. A point or node in a system is usually coupled with its neighboring points only, not the entire system. This limited coupling results in sparse matrices.
These matrices are too large to be stored as full matrices. Consequently, compact formats have been developed to store only the nonzero entries. Ideally, the format remains compact throughout execution. Execution is generally performed by a floating point unit (FPU). As is known in the art, the datapath of most large computer architectures consists of two major sections: the integer section and the floating point section. Addresses are generated in the integer section; floating point computations are executed in the floating point sections. Optimal processing requires efficient use of memory and of the FPU.
Iterative solutions for processing sparse matrices are known in the art. Iterative methods are attractive for solving sparse linear systems because they maintain sparsity. The system matrix is used only as an operator to generate new iterates; no matrix modifications or increase in storage occur. In contrast, direct methods can introduce substantial fill-in which limits their utility to average size 2-D and 3-D applications. However, implementing iterative methods is not easy. To maintain sparsity, iterative methods incur the overhead of accessing matrices stored in a compact format that can cause considerable performance loss on conventional architectures.
Typically, compact formats have at least two parts: an auxiliary data structure describing the positions of the matrix entries that are nonzero, and the values of the nonzeros entries. While this reduces storage requirements to 0(N) rather than N.sup.2 for an N.times.N matrix, three types of overhead arise. First, there is high memory bandwidth since the auxiliary data structure must be accessed every time a nonzero entry is accessed. Next, an indirect access system is generally used wherein the auxiliary data structure is used to access another data structure used in the computation. This indirection is sometimes implemented using gather-scatter operations which are expensive and add to the memory bandwidth requirement. Finally, short vector lengths are involved; typically, each row or column has few nonzero values, independent of the matrix size.
For sparse matrix-vector multiplication, which constitutes the computational core of iterative linear solvers, these overheads substantially affect performance. Empirical data indicates that the Cray Y-MP as well as the Intel 80860 processor operate at less than 33% of their peak floating-point rate. Attempts to solve these shortcomings can be divided into three categories: use of direct banded solvers, development of data structures for sparse matrices for use with conventional supercomputers, and special purpose architectures.
Direct banded solvers are widely used by commercial finite element method (FEM) packages. These methods suffice for average size 2-D problems in the range of 10.sup.4 to 10.sup.5 equations, but are inefficient, in terms of memory use and computations, for 3-D analyses.
Direct, banded linear solvers are commonly used to solve 2-D FEM problems on supercomputers such as the Cray computers. This is because in addition to being very sparse, FEM matrices are often banded. Computations are executed on all the entries in the band; no auxiliary data structure is required. It is estimated that the width of the band is approximately N.sup.2/3 .times.N=N.sup.5/3 entries. In contrast, the original matrix has only approximately 24N nonzero entries (using 3-D brick elements). For a 10.sup.5 -node problem, the memory requirement is about 215 Mwords, or about 80 times that required for the nonzero entries in the original matrix.
The storage requirement for direct banded solvers can exceed the available memory in many supercomputers, causing the analyst to compromise accuracy by down scaling the problem. Moreover, the number of computations required for this class of solvers is proportional to N.sup.7/3 for 3-D applications in contrast to N.sup.3/2 for iterative solvers. Thus, banded solvers are inadequate for 3-D problems.
Many data structures for the compact representation of sparse matrices have been developed for use on supercomputers. These structures represent the dichotomy that exists with efficient storage of sparse matrices: store only the nonzero entries at the expense of short vector lengths or zero pad the data structure to achieve long vector lengths. Below is a general data structure, sparse matrix K: ##EQU1## Methods for performing the matrix-vector multiplication, v=Kp, are provided below. As used herein, N denotes the order of matrix K, and N.sub.nz denotes the number of nonzero values in the matrix K.
One conventional data structure known in the art is the Column-Major Nonzero Storage (CMNS) scheme which consists of three one-dimensional vectors. The nonzero values are stored in column-major format in the vector K.sub.v ; the corresponding row indices are stored in the vector R; and the number of nonzeros per column are stored in the vector, L. The matrix of equation (1) is represented in the following manner: EQU K.sub.v.sup.T =[k.sub.11, k.sub.31, k.sub.22, k.sub.52, k.sub.13, k.sub.33, k.sub.53, k.sub.44, k.sub.25, k.sub.35, k.sub.55 ] EQU R.sup.T =[1,3,2,5,1,3,5,4,2,3,5] EQU L.sup.T =[2,2,3,1,3]
This data structure is fairly compact, requiring storage for N.sub.nz floating-point values and N.sub.nz +N integer values.
Matrix-vector multiplication of matrix K with vector p can be implemented using N vector operations, one per column of K.sub.v as follows: ##EQU2## With this approach, floating point unit (FPU) efficiency is compromised because of vector startup overhead and excessive memory bandwidth. The hardware does not exploit any locality of reference between updates required for successive vector computations, resulting in excessive input and output operations.
Another storage scheme known in the art, ITPACK, uses two rectangular arrays, K.sub.v and C. The rows of K.sub.v contain the nonzero entries from the rows of K padded with zeros to insure equal row length. The length of each row is equal to M=max {length(k.sub.i)} over i=1... N where length (k.sub.i) represents the number of nonzero entries in row i of K. C contains the corresponding column indices. For k in equation 1, K.sub.v and C are defined as: ##EQU3## The D values in C correspond to the zero padding in K.sub.v, and can be set to any value between 1 and N. This data structure requires storage for NM floating-point values and NM integer values.
Psuedocode for implementation of matrix-vector multiplication is: ##EQU4## This code entails M vector operations, where each vector has a length equal to N.
FPU operation with this scheme is marginally efficient. Relatively low efficiency is attributable to the large number of computations relating to zero padding (typically, approximately 12-48% of the stored data consists of zero
padding) Next, memory reads are required for K.sub.v, C, v, and the gathered values of p, and memory writes for v. This large memory bandwidth requirement generally exceeds the available memory bandwidth.
Special purpose architectures have also been utilized to improve the processing of sparse matrices. In particular, an application-specific architecture has been developed at Carnegie Mellon University called the White Dwarf. This architecture exploits matrix symmetry by storing only the upper triangular portion of the matrix. Consequently, the matrix data as well as the multiplier data is accessed indirectly. Good speedup is obtained by using five memory ports and running problems for which all data can fit in main memory.
In sum, while a number of proposals have been made to improve the processing of sparse matrices, there are still a number of shortcomings associated with the prior art. In particular, a high memory bandwidth is involved, expensive indirect accessing techniques are employed, and floating point processing potential is not exploited.
Ideally, solutions to prior art problems will utilize standard components which may be used with existing and future architectures. In particular, the solution should be applicable to general purpose architectures.