This invention relates to methods for multiplying matrices using computers with hierarchical memory.
The multiplication of matrices is an important operation for computers. It is useful for a wide variety of operations, one of them being the solution of simultaneous equations. In the interest of efficiency, it is highly desirable to perform these operations quickly. For example, a simulation for the purpose of research and/or development that can run more quickly will enhance the productivity of the scientist or engineer without added hardware costs.
In the solution of large matrix multiplications in large or real time problems, scientists and engineers have turned to supercomputers which include high speeds of operation, pipelined architectures and/or parallel processing. A series of libraries of subroutines has been developed for matrix multiplication, among other operations, on these machines. These subroutine libraries provide the high-level function of matrix multiplication, and are available on a variety of supercomputers. The subroutine is usually written for a specific machine, taking that machine's architecture into consideration. In this way, the programmer may make calls to a simple matrix algebra subroutine and expect rapid performance from the system. Commonly used subroutine libraries of this kind are the BLAS subroutine libraries (Basic Linear Algebra Subprograms). The BLAS3 subroutine library contains matrix operation subroutines and in particular a subroutine called DGEMM which performs matrix multiplications. For more information on BLAS, see C. Lawson, R. Hanson, D. Kincaid and F. Krogh, "Basic Linear Algebra Subprograms for FORTRAN Usage", ACM Trans. on Math. Soft., 5 (1979), 308-325, or J. J. Dongarra, J. DuCroz, S. Hammarling and R. Hanson, "An Extended Set of Linear Algebra Subprograms", ACM Trans on Math. Soft., 14,1 (1988) 1-32. For more information on BLAS3, see J. J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling, "Set of Level 3 Basic Linear Algebra Subprograms", ACM Trans on Math. Soft. (Dec. 1989).
In designing these subroutines, speed is important. For any given architecture, there are several limiting parameters that affect matrix multiplication speed. A first limiting parameter is the number of computations needed to perform the operation. Upon first consideration, it would seem that the order of the number of operations needed to perform a matrix multiplication would be n.sup.3, where n.sup.2 is the number of elements in each of the term matrices.
It has been shown, however, that this order can be reduced to n.sup.2.49, by using asymptotic complexity reduction methods of the type of Strassen's. This can lead to greater computational efficiency for large matrices, although there may be some accuracy tradeoffs. Strassen's method, and an implementation thereof, are discussed in "Gaussian Elimination is Not Optimal", Numerische Mathematik, Vol. 13, 1969, pp. 354-356 by V. Strassen, and "Extra High Speed Matrix Multiplication on the Cray-2", SIAM J. Sci. Stat. Comput., Vol. 9, No. 3, May 1988, by D. H. Bailey. Other asymptotic complexity reduction methods are discussed in V. Pan, "New Fast Algorithms for Matrix Operations", SIAM J. Comp., 9 (1980), 321-342.
A second parameter in speeding up matrix multiplication is system power. The clock speed, word size and number of processors limit the number of operations that may be performed in any given amount of time, and thus the speed at which matrix multiplications may be performed. Improvements along this line have included using faster processors with larger word sizes and using parallel processors. The use of parallel processors has allowed programmers to break the problem up into separate sub-problems and allocate these sub-problems to different processors. These so-called "blocked" methods allow speed improvements by performing more than one operation in parallel.
Another, more elusive, limiting parameter is memory performance. Two components of memory performance of interest are memory bandwidth and memory latency. Memory bandwidth is indicative of the overall available throughput rate of data from the memory to the processor. Memory latency is a measure of the time taken to retrieve the contents of a memory location, measured from the time at which it is requested to the time at which it is delivered to the requesting processor. Both latency and bandwidth may be degraded from the point of view of a processor if it must share its memory resources with other processors or peripheral devices. Computational methods may be affected by one or both of these limiting parameters, as the processor may be forced to stall while waiting for data from memory.
These problems have been addressed by computer designers in different ways. Memory bandwidth has been increased by using memories that cycle faster, and by using larger word sizes. Latency has been addressed by using memories with faster access times and by making computers more hierarchical. This involves adding small areas of expensive high speed memory that are local to a processor. Examples of hierarchical memory include cache memories, virtual memory, and large register sets.
If one ignores latency and bandwidth issues in implementing a method of matrix multiplication, one may have a method that is theoretically efficient, as it reduces the order of the number of operations to be performed or it splits the operations into blocked operations that may be performed by several processors, but still falls short of good performance. This may happen because the processing rate is slower than optimal as the processor spends a lot of time waiting for slow memory fetches or because it is competing for access to a shared memory resource.
This problem has become more pronounced as computers tend to include more hierarchical elements, as memory elements tend be shared more often, and as processors are getting faster and more complex, while large semiconductor memories, disks and other mechanical storage systems have not increased in speed at the same pace.
Blocked methods for matrix multiplication have been used for computers with hierarchical memories (K. Gallivan, W. Jalby, U. Meier and A. Sameh, "Impact of Hierarchical Memory Systems on Linear Algebra Algorithm Design", The International Journal of Supercomputer Applications, 2,1 Spring 1988, 12-48).