Numerical linear algebra is fundamental to scientific computing, financial engineering, image and signal processing, data mining, bioinformatics, and many other applications. The performance critical portions of such scientific and other computationally intensive applications can include a set of fundamental linear algebra operations involving vectors and matrices. These operations can be either memory bandwidth-bound or computation-bound depending on the number of memory operations performed as compared to number of arithmetic operations. A general principle of designing parallel linear algebra algorithms is the divide-and-conquer principle where the matrices are divided into sub-matrices and sequential algorithms process these sub-matrices in parallel, which can be termed blocking.
The size of the blocks of the vectors and matrices is decided based on several factors such as the memory hierarchy architecture, size of memory at each level of the hierarchy, number of vectors/matrices, etc. Usually the blocks are small as compared to the vector/matrix dimensions. As a result, in the case of matrices, the adjacent columns/rows of a block are non-contiguous in the memory. As the columns/rows of a matrix block are not contiguous in the memory, multiple memory accesses would be required to fetch them. Moreover, the starting addresses of the columns/rows of the block may not have the same memory alignment. Thus, efficiency of the memory accesses would depend on the alignment of the starting addresses and size of the columns/rows in the blocks.
Modern processors have hierarchical memory architecture (that is, main memory, cache (or scratchpad memory) and registers). The access time to read/write data decreases from main memory to the registers whereas the size of the available memory increases in the reverse order. The data transfers between different levels of memory take place at aligned address boundaries only. Unaligned memory accesses are broken up by the processor and turned into one or more aligned accesses. As a result, unaligned memory access can lead to significant drop in performance due to wastage of memory bandwidth and inefficient memory utilization.
For example, in the case of cache based processors, data is fetched into the caches from the main memory before processing them, and the data is written out from the caches to the main memory. While reading, the data is always read from cache line aligned addresses. If size of the data being read is less than the cache line size, then a single cache line is fetched irrespective of whether the access is aligned or unaligned. However, if the size of the data crosses the cache line boundary and the access is unaligned, then two cache lines have to be fetched and it uses the space of two cache lines in the cache.
In the case of write memory accesses, the processors loads the cache line into the caches, updates the required data and then writes it back at an appropriate time. Thus, similar issues will be encountered during write accesses as in read accesses in case of unaligned data. As such, unaligned memory accesses not only lead to slower memory accesses (total time required for accessing the required data is equal to the transfer time of two cache lines instead of one) but also poor memory utilization (memory space of two cache lines is used for storing the required data).
By way of example, in the case of direct memory access (DMA) based processors such as Cell BE, DMA transfers are used to move data between the local and main memory. Memory alignment is a critical factor that can impact DMA performance. DMA performance is optimal when both source and destination buffers are 128-byte (one cache line) aligned and the size of the transfer is a multiple of 128 bytes. This involves transfer of full cache lines between main memory and local store. If the source and destination are not 128-byte aligned, then DMA performance is optimal when both have the same quadword offset within a cache line. Transfer of unaligned data may result in the use of DMA lists. Also, DMA performance of unaligned data can be poor compared to aligned data due to loss in memory bandwidth and the overhead in creating and usage of DMA lists.
Existing blocking techniques lead to significant wastage of memory bandwidth as well as resources in case the matrices are unaligned, thereby disadvantageously impacting the overall performance of memory bandwidth-bound linear algebra operations. For example, in existing blocking approaches, where the adjacent columns/rows of a matrix block are non-contiguous in the memory, if memory alignment is not taken into consideration, it will lead to each column/row in a block becoming unaligned if the matrix is unaligned. As unaligned memory accesses are highly inefficient, the memory access performance for such matrices is poor. Also, in most of the applications, it is difficult to enforce the memory alignment restrictions on the input/output matrices. As such, it would be desirable to perform the linear algebra operations in a manner such that the memory transfers of the matrices is done efficiently even for the unaligned case.