Field of the Invention
The present invention generally relates to a non-uniform memory access (NUMA) computer system. More specifically, the present invention relates to computing eigenvalues and eigenvectors of a very large matrix in a NUMA computer system.
Description of the Related Art
Some data modeling applications use matrices that describe transformations of physical parameters. For example, a cosmological model of a galaxy might use a matrix to describe the motion of stars in space. A finite element model of a material may use a matrix to model stresses in a material at a number of different locations. These matrices transform initial property vectors of the model into final property vectors by standard matrix multiplication
For any transformation by matrix multiplication, there may be certain vectors for which the transformation merely acts to lengthen or shorten the vector. These vectors are called “eigenvectors” of the transformation. Eigenvectors provide “preferred directions”; vectors parallel to eigenvectors are not rotated by the transformation. The corresponding scaling factor of the lengthening or shortening for a given direction is called the “eigenvalue” for a the eigenvector.
Different eigenvectors may have different corresponding eigenvalues, and eigenvectors with an eigenvalue of 1 are not lengthened or shortened by the transformation; for these vectors, the transformation preserves length. Eigenvectors and eigenvalues provide a useful mathematical tool to analyze matrix transformations. It is, therefore, desirable to be able to compute eigenvectors and eigenvalues (collectively, “eigenpairs”) for any given matrix.
Several techniques are known to calculate eigenpairs of a matrix. One family of “eigensolver” techniques first reduces the matrix to a tridiagonal form. A tridiagonal form is a form in which the main diagonal of the matrix and the diagonals just above and below may contain non-zero numbers; all other entries are zero. Such an eigensolver computes the eigenpairs of the tridiagonal matrix then converts the computed eigenvectors back to the original reference system.
In order to improve scalability, the ACM TOMS 807 algorithm, or successive band reduction (“SBR”), is often employed for the tridiagonal reduction phase. In SBR, the initial, densely-populated matrix is reduced to a multiple band intermediate form having many non-zero diagonals in a first stage. The matrix is later reduced from the multiple band form to the tridiagonal (three band) form in the second stage.
After calculating the eigenpairs in a third stage, the eigenvector back-transformation also requires two stages: from the tridiagonal to the multiple band reference system in stage four and to the original dense reference system in stage five. The multistage SBR approach allows highly scalable BLAS-3 computing kernels to be used, but the two stage eigenvector back-transformation introduces additional floating point operations that influence scalability.
LAPACK (Linear Algebra Package) is a standard software library for numerical linear algebra. LAPACK provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. LAPACK also includes routines to implement associated matrix factorizations. The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. ScaLAPACK allows for interprocessor communication and assumes matrices are laid out in a two-dimensional block cyclic decomposition.
LAPACK and ScaLAPACK both have underlying deficiencies. For example, neither LAPACK or ScaLAPACK employ SBR calculations. Nor is LAPACK designed for scalability. And the ScaLAPACK library communicates using the aforementioned message passing, which is slower than communication using shared memory for tridiagonal eigensolvers.
The Parallel Linear Algebra Software for Multicore Architectures (“PLASMA”) is a mathematical library for performing conversion of a dense symmetric array to and from tridiagonal form on shared memory systems with multi-core processors. An example of such systems are shown in FIG. 1-3 as described below. PLASMA includes a tiled storage of the data arrays and a DAG (directed acyclic graph) scheduling of the computational subtasks. PLASMA improves over older LAPACK and ScaLAPACK libraries in terms of memory usage pressure, process synchronization requirements, task granularity, and load balance. PLASMA results in increased performance and scalability for sufficiently large problems.
PLASMA nevertheless suffers from a number of shortcomings, especially with respect to solving very large problems. For example, PLASMA is limited to 32-bit architectures. Because matrix entries are addressed using 32-bit signed integers, the largest matrix on which PLASMA may operate has a dimension N limited to the square root of the largest 32-bit signed integer (N=sqrt(231)=46340). This value of N is too low to operate on matrices that have hundreds of thousands or millions of rows and columns, as is required by some computations.
Because PLASMA is limited to smaller matrices, PLASMA does not use a tridiagonal eigenpair solver that effectively scales for larger matrices. While algorithms such as the multiple relatively robust representations (“MRRR”) algorithm exist to achieve such scaling, PLASMA is not designed to use them. PLASMA also operates on a fixed pool of threads. This is disadvantageous because different subtasks that occur in the five stages of the eigensolver may occur in parallel and have different computational complexities. A static thread allocation for all five stages is inefficient.
There is a need in the art for shared memory systems that use a combination of SBR and MRRR techniques to calculate eigenpairs for dense matrices having very large numbers of rows and columns. There is a further need for shared memory systems that are not merely “scaled up” versions of PLASMA that allow for increased scalability but that allow for the use of a highly scalable tridiagonal eigensolver. There is a still further need for a shared memory system that is capable of allocating a different number of threads to each of the different computational stages of the eigensolver.