Sparse matrixes are matrices in which a majority of elements are zero. Operations using such matrices have a variety of applications and are usually the most computationally-intensive part of such an application. For example, sparse matrix-vector multiplication (SpMV) and sparse matrix transpose vector multiplication (SpMTV), basic operations in sparse linear algebra (SLA), are used for performing ranking algorithms, such as the PageRank® algorithm used by Google®, Inc. to rank webpages when providing search results. SpMV and SpMTV are the most computationally intensive part of such applications and the speed with which the matrixes can be used is limited by SpMV and SpMTV.
While attempts have been made to improve the speed of sparse matrix processing, such efforts still leave significant room for improvement. For example, to increase speed, matrixes have been encoded in a compressed format, which includes multiple arrays of information about values and position in the matrix of the non-zero entries and omit information about the zero entries. For instance, a compressed sparse matrix row format includes an array with values of the non-zero entries, columns in which the non-zero entries are located, and an array holding the index in the first array of a first non-zero entry in each row. Compressed sparse column format includes similar arrays. Such arrays are best stored in a cache of a processor performing the computations to allow fast access to the array data. However, in case of larger matrices, even the compressed format arrays may not fit into the cache, requiring a processor to access different arrays representing the matrix in main memory to perform a single step of the computation. In such an arrangement, modern computer processors, including central processing units (CPUs) and graphics processing units (GPUs), are likely to experience cache misses during the computation, a failure by the processor to retrieve required data from the cache. To finish the computation after a cache miss requires the processor to retrieve the missing data from the main memory, which can be much slower.
Further, additional concerns are present when GPUs are used to perform matrix computations such as SpMV and SpMTV. GPUs are better designed and optimized for dense computations, such as the processing of dense matrices, matrices in which most elements are non-zero entries. Such hardware commonly runs a single kernel function for processing matrix data. As a result, the hardware cannot respond to the huge variation in the number of non-zero entries in different portions of the matrix, such as in different rows or columns. For example, kernels that assign a single thread to process a single row or column of the matrix can suffer from load imbalance, with the total processing time depending on the thread assigned to process the densest row or column. On the other hand, kernels that assign multiple threads to process a single row or column suffer from a waste of hardware resources when the number of assigned threads is less than the number of non-zero entries in the row or column, with some of the assigned threads not being involved in the processing.
Therefore, there is a need to represent a sparse matrix in a way that decreases a likelihood of cache misses and allows for responding to the variation in the number of non-zero entries in different parts of the matrix.