Modern computers have long been utilized to perform large, complex calculations. Recently, the parallel processing capabilities of graphics processing units (GPUs) have been increasingly leveraged to perform computations for extremely large linear systems in a wide variety of computational science applications, such as applications that perform motion calculation, collision detection, 3D graphical rendering, structural analysis, and discretized physics in general, etc. As technology continues to develop, the ability to solve such systems faster, and/or solve increasingly larger systems efficiently is a persistent objective.
There are several computational approaches (e.g., algorithms) which can be applied to solve large linear systems of equations. Generally, these approaches can be classified as either direct or iterative methods. Typically, the time to solution for iterative methods scales at a lower rate than direct methods, so as the problem size increases, iterative methods become increasingly advantageous. The Sparse Matrix-Vector multiplication (SpMV) operation is a central component of iterative methods, often occupying the bulk of the computational cost of the algorithm and determining the overall time to solution.
SpMV is the operation y=Ax, where y and x are dense vectors of length n, and A is a sparse matrix. The SpMV operation is typically computationally light, achieving, for the case of double-precision computation, about 1 flop (floating-point operation) for every 4-6 bytes of data. Consequently, the SpMV operation is what is considered “bandwidth bound”. The performance of the operation is not determined by the speed of the arithmetic, but rather, by the speed at which data can be delivered from main memory to the processing unit(s). Optimizations for bandwidth-bound operations involve caching as much data as possible and caching data as close to the processors as possible to minimize overall communication time. When implemented on an accelerator (e.g., a GPU), with each thread computing a single element (or contribution to a single element) of y, the result y can be efficiently stored in registers, and the input x can be stored in a cache (e.g., due to its frequent re-use in the system, commonly texture cache or similar). However, the size of the data in the sparse matrix A is typically much larger than x or y, depending on the average number of non-zeroes per row in A, and is used only once during the SpMV operation. As such, there is little or no opportunity for caching in these cases.
The performance of SpMV on the GPU-equipped system depends on the bandwidth and size of the ‘device’ memory (also known as ‘GMEM’ or ‘global’ memory, the largest memory space on the GPU, independent of other memory spaces: shared/constant/texture/L1/etc.). The bandwidth of the GPU memory determines how fast elements of the sparse matrix A can be communicated to the computational cores, and strongly influences the overall speed of the SpMV operation. The size of the device memory determines how much of the matrix A can be stored on the GPU. If matrix A is too large to fit entirely on the GPU, then a portion of matrix A remains in the central processing unit (CPU) memory and does not benefit from GPU acceleration, which reduces the speed of the overall SpMV. Even when sparse matrix A can be fully stored in GPU memory, a small portion of sparse matrix A can be stored in host memory such that the total SpMV operation can occur on both the CPU and the GPU simultaneously, and leverage the combined performance of the CPU and GPU.
Linear systems arising from computational physics—used for example, in collision detection, motion, and other applications of discretized physics that are common in graphical processing—typically result in block sparse matrices. A block sparse matrix is a matrix that is defined using smaller matrices, called blocks. Block sparse refers to sparse matrices where the non-zero entries occur, at least partially, as small dense matrices (e.g., typically dense matrices having a size of 3×3 or 6×6), within the sparse matrix (typically of size thousands to millions or larger).
In many systems of interest, the matrix A is either symmetric or partially symmetric such that the numerical entries are mirrored across the diagonal (i.e. a_i,j=a_j,i). For these matrices, it is possible to transfer or store half of the matrix data and still perform the full SpMV operation. The other half of the matrix data may be calculated as the transpose of the half that is stored. While this technique allows significantly larger matrices to be stored, leveraging the symmetry results in very slow performance in all currently known implementations of SpMV on the GPU. This is, at least in part, because a single element of A now contributes (via its transpose) to two output elements of y, and the thread handling that element of A must now atomically update both values. Atomic updates are necessary in this case because a different threads may be updating the same y values simultaneously and atomic operations are required to avoid a race condition. Atomic operations are generally much less performant than their non-atomic counterparts which hurts performance.
When an SpMV is performed on a GPU, a set of threads multiply a dense column segment of x by a portion of the matrix A. Typically threads are assigned to a row of A. For the non-transpose portion of this computation, the calculation is straightforward: each thread updates its component of y. However, when computing the contribution of the transpose of the column segment, it becomes a row segment, and all threads must update the same element of y, hence the need to use atomics, or other strategies.
Consecutive entries in a dense column segment (for a ‘column-major’ matrix storage format) must be read simultaneously by a group of threads in a thread block to maximize device memory performance (which limits SpMV performance) by achieving ‘coalesced’ memory access, and similarly for ‘row-major’ or other standard storage formats. It is very inefficient for threads to randomly access individual words. Consequently, an efficient SpMV must maintain ‘coalesced’ memory access.