1. Field of the Invention
Embodiments of the present invention relate generally to the field of computing devices and more specifically to a technique for performing efficient matrix multiplication operations on a parallel processing device.
2. Description of the Related Art
Modern computing applications oftentimes require matrix operations, such as linearly scaling a matrix, transposing a matrix or computing the product of two matrices, to be performed when carrying out certain tasks such as solving complex sets of simultaneous linear equations. Commonly, these individual matrix operations are combined into larger processing steps and executed from a scientific computing library such as the Basic Linear Algebra Subprograms (“BLAS”) library. The BLAS library includes a function that performs a dot product operation on matrices “A” and “B” in memory, scales the dot product result matrix by a linear scaling factor alpha (“α”), scales a matrix “C” in memory by a linear scaling factor beta (“β”), adds the scaled dot product of “A” and “B” to the scaled “C” matrix and stores the result in matrix “C” (“C=αA·B+βC”). Additionally, one or more of matrices A, B and C may be transposed before performing the aforementioned operation.
As is well-known, matrix operations are computationally expensive, and the performance of an application may be limited by the processing time required for the matrix operations within the application. Further, as the size of the referenced matrices increases, the approximate computational cost of matrix multiplication increases with the cube of one dimension (i.e., where “n” is the number of elements in one dimension of a square matrix, the computational cost is proportional to n3).
One solution to the matrix operation problem involves using the microprocessor in a personal computer to perform the matrix operations. One drawback of this approach is that such microprocessors typically have a limited amount of arithmetic and memory access logic, thereby limiting the number of concurrent arithmetic and memory operations that the microprocessor can perform as well as the overall performance of the matrix operation. Another solution involves using a multiprocessing computing device to perform the matrix operations. These devices typically have far more arithmetic and memory logic than personal computers, enabling multiprocessing computing devices to perform more concurrent arithmetic and memory operations, thereby increasing the performance of the matrix operations relative to personal computers. Such multiprocessing computing devices, however, are far more expensive than personal computers and therefore are not a cost-effective solution to the matrix operation problem.
Yet another solution to the matrix operation problem involves using a graphics processing unit within a graphics adapter to perform matrix operations since these systems are configured to rapidly execute sophisticated graphics algorithms on large video data sets and are thus capable of delivering high computational bandwidth and high memory bandwidth. Although such capabilities seem attractive for performing complex matrix operations, typical graphics processing units impose a streaming or serialized computational model, which requires a large memory bandwidth to efficiently transmit matrix data between the memory and the individual processing units. In short, the memory bandwidth requirements for efficient matrix operations typically outstrip the actual memory bandwidth provided in conventional graphics processor designs, and such limitations decrease the performance of conventional graphics processing units when executing matrix operations.
As the foregoing illustrates, what is needed in the art is a computing device that performs matrix operations in a more efficient and cost-effective manner.