High performance computing (HPC) on sparse data structures such as graphs and sparse matrices is becoming increasingly important in a wide array of fields including, for example, machine learning, computational science, physical model simulation, web searching, and knowledge discovery. Traditional high performance computing applications generally involve regular and dense data structures; however, sparse computation has some unique challenges. For example, sparse computation typically has considerably lower compute intensity than dense computation and, therefore, its performance is often limited by memory bandwidth. Additionally, memory access patterns and the amount of parallelism vary widely depending, for example, on the specific sparsity pattern of the input data, which complicates optimization as certain optimization information is often unknown a priori.
Systems may modify the input data set to obtain high data locality in order to address those challenges. For example, a system may employ reordering, which permutes rows and/or columns of a matrix in order to cluster non-zero entries near one another. For example, the system may reorder a sparse matrix 100 to generate a banded matrix 102 in which the non-zero entries 104 are clustered near one another as shown in FIGS. 1A-B. By doing so, the system increases the chances that a particular memory read involves more non-zero entries (i.e., spatial locality) and may result in more reuse out of cache (i.e., temporal locality) than without reordering. Various reordering algorithms have been developed and implemented including, for example, Breadth First Search (BFS), Reverse Cuthill-McKee (RCM), Self-Avoiding Walk (SAW), METIS Partitioner, and King's algorithms. In particular, BFS and its more refined version, RCM, are frequently used to optimize for cache locality in sparse matrix vector multiplication (SpMV) due to its lesser complexity and greater efficiency.