1. Field of the Invention
The present invention generally relates to data processing techniques and, in particular, to a system and method for performing matrix operations.
2. Related Art
In some computer applications, it is desirable to perform various matrix operations, such as matrix addition and/or matrix multiplication. Matrices can sometimes comprise a large amount of data, thereby causing matrix operations to consume significant processing time. Thus, various techniques have been developed for enabling computer processors to efficiently process matrix operations.
Unfortunately, a limiting factor in matrix operation efficiency is the cache size of the processor executing the matrix operation. Indeed, when the data sizes of the matrices being mathematically combined become so large such that each of the matrices cannot fit into the processor's cache, the matrices are typically partitioned into smaller matrix portions. Corresponding matrix portions of the matrices are then sequentially combined by the processor.
In this regard, the processor typically retrieves a set of corresponding matrix portions from each of the matrices and then mathematically combines the corresponding matrix portions to yield a combined matrix portion. This combined matrix portion is then returned to memory, and the next set of corresponding matrix portions is then retrieved and combined. This process is repeated for different portions of the matrices until the matrix operation is complete.
By sequentially operating on small portions of the matrices, the number of cache misses that occur within the processor for the matrix operation can be reduced, thereby helping to optimize the performance of the processor in performing the matrix operation. However, partitioning and operating on the matrices, as described above, can introduce various delays that adversely impact the performance of the processor.
For example, it is sometimes necessary and/or desirable to temporarily store the partitioned matrix portions into a temporary contiguous memory workspace while the processor is operating on the partitioned matrix portions. This copying can help to further reduce the occurrence of cache misses, thereby helping to reduce the bandwidth requirements for the bus between the processor and memory. However, the copying of these partitioned matrix portions into the memory workspace can introduce significant delays, particularly for large matrices that are partitioned into a large number of matrix portions. Such copying delays typically occur during the time periods when a new set of matrix portions are copied into the memory workspace in order to replace a previously processed set of matrix portions.