1. Field of the Invention
The present invention generally relates to providing efficiency in linear algebra subroutines. More specifically, a novel method saves both computer memory and improves performance by forming the transpose of a matrix A by subdividing the matrix A into one or more square submatrices Aij along a row or column direction, performing an in-place square matrix transposition on each square submatrix Aij, to form transposed square submatrices Aij′ and then forming, on the possible remaining leftover rectangular piece B of A, its transpose matrix B′ by using a standard out-of-place transpose algorithm. The final transpose of matrix A is formed by connecting the square transposed part(s) Aij of the original of matrix A with B′.
2. Description of the Related Art
Scientific computing relies heavily on linear algebra. In fact, the whole field of engineering and scientific computing takes advantage of linear algebra for computations. Linear algebra routines are also used in games and graphics rendering. Typically, these linear algebra routines reside in a math library of a computer system that utilizes one or more linear algebra routines as a part of its processing. Linear algebra is also heavily used in analytic methods that include applications such as a supply chain management.
A number of methods have been used to improve performance from new or existing computer architectures for linear algebra routines. However, because linear algebra permeates so many calculations and applications, a need continues to exist to optimize performance, in any way possible, of matrix processing on a computer. Also, a new generation of computers requires possibly redoing or re-optimizing the previous, then-current, optimized programs, or at least “tweeking” them.
Linear Algebra Subroutines
The applications of the present invention include the matrix operation of transposition and is applicable in any environment in which used transposition is performed. An exemplary intended environment of the present invention is the computing standard called LAPACK (Linear Algebra PACKage) and to the various subroutines contained therein. Information on LAPACK is readily available on the Internet (e.g., reference the website at netlib.org).
For purpose of discussion only, Level 3 BLAS (Basic Linear Algebra Subprograms) of the LAPACK (Linear Algebra PACKage) are used, but it is intended to be understood that the concepts discussed herein are easily extended to other linear algebra mathematical standards and math library modules. Indeed, any routine or subroutine that involves matrix transposition would benefit by the method discussed herein.
When LAPACK is executed, the Basic Linear Algebra Subprograms (BLAS), unique for each computer architecture and provided by the computer vendor, are invoked. LAPACK comprises a number of factorization algorithms for linear algebra processing.
For example, Dense Linear Algebra Factorization Algorithms (DLAFAs) include matrix multiply subroutine calls, such as Double-precision Generalized Matrix Multiply (DGEMM). At the core of level 3 Basic Linear Algebra Subprograms (BLAS) are “L1 kernel” routines which are constructed to operate at near the peak rate of the machine when all data operands are streamed through or reside in the L1 cache.
The most heavily used type of level 3 L1 DGEMM kernel is Double-precision A Transpose multiplied by B (DATB), that is, C=C−AT*B, where A, B, and C are generic matrices or submatrices, and the symbology AT means the transpose of matrix A. It is noted that DATB is usually the only such kernel employed by today's state of the art codes, although DATB is only one of six possible kernels.
The DATB kernel operates so as to keep the A operand matrix or submatrix resident in the L1 cache. Since A is transposed in this kernel, its dimensions are K1 by M1, where K1×M1 is roughly equal to the size of the L1. Matrix A can be viewed as being stored by row, since in Fortran, a non-transposed matrix is stored in column-major order and a transposed matrix is equivalent to a matrix stored in row-major order. Because of asymmetry (C is both read and written) K1 is usually made to be greater than M1, as this choice leads to superior performance.
Matrix transposition is an important sub-operation appearing in many matrix algorithms, such as solving dense linear systems, matrix multiply, and eigenvalue computations.
The conventional solution for matrix transposition, exemplarily represented in FIG. 1 as process 100, is to double the space and explicitly produce a copy of the transpose matrix 102, leaving the original matrix 101 intact. It is noted that the symbology “[T]” or “′” each represents the transposition operation.
The drawback to this conventional method is that both space and computer time (performance) is wasted, since, if A is a matrix of size m×n, then the transposition process 100 requires two copies of matrix data (e.g., 2 mn) be stored in memory. Therefore, any process that reduces the space or number of operations from this conventional method would improve performance of the matrix transposition operation.
This space and operational efficiency is of particular concern as dimensions m or n increase in size so that the matrix is too large to be brought into a cache memory in its entirety for processing of the entire matrix. That is, because dimensions m or n are now typically of the order of hundreds or thousands, linear algebra processing is done in a piecemeal manner, and even the simple process of matrix transposition becomes burdensome in both time and memory space for large matrices.
Therefore, because of the importance of matrix transposition in linear algebra, a need exists to improve the memory space and computer time required to execute matrix transposition.