The present disclosure relates generally to computer architecture and more particularly, to computer architecture optimized for sparse matrix vector multiplication processing.
Sparse Matrix Vector Multiplication (SpMV) describes solving y=Ax where y and x are vectors and A is a large matrix populated mostly with zero entries. SpMV is frequently employed in scientific and engineering applications and is the kernel for iterative linear system solvers such as the conjugant gradient method.
Due to the sparseness of matrices used in many scientific and engineering applications, it is often neither practical nor feasible to store every entry of the matrix in a traditional dense representation, so compressed sparse representations such as compressed sparse row (CSR) format are often used to represent the matrices in memory. The CSR format stores the non-zero elements in an array called val, the corresponding column numbers of an array called col, and the array indices of the first entry of each row in an array called ptr.
For example, the second non-zero value in row 4 would therefore be val[ptr[4]+1] and its corresponding column number in row 4 is col[ptr[4]+1]. Multiplying a CSR matrix by a vector stored in an array called vec requires a row-wise multiply accumulate (MAC) operation for each matrix row sum=sum+val[i]*vec[col[i]], where i iterates for each non-zero entry of the matrix. As shown in these examples, CSR computations fundamentally require indirect addressing, which cannot be expressed in an affine loop and therefore are difficult to automatically optimize for SIMD and vector processors.
In addition, SpMV architectures need only to perform two floating-point operations for each matrix value, yielding a computation/communication ratio of—at best—only two floating-point operations (FLOPs) per 12 bytes read, assuming a 64-bit value and a 32-bit column number, and this does not include input vector data. As such, performance is highly dependent on memory bandwidth. Achieving high memory bandwidth for long-latency DRAM-based memories often requires that consecutive input values be read in overlapping outstanding requests from consecutive locations in order to take advantage of banked memory. This access pattern is often referred to as streaming or coalesced memory access.
Since CSR stores values in a row-major consecutive memory locations, a third challenge for achieving high performance for SpMV comes from the need to accumulate values that are delivered to a deeply pipelined adder in consecutive clock cycles. This “streaming reduction operation” is often a design challenge in SpMV due to the deeply pipelined nature of floating point adders. In other words, a data hazard exists because subsequent additions on serialized products cannot be performed until the previous addition has completed. In order to overcome this data hazard, either data scheduling or architectural methods must be employed.
Due to the indirect addressing and streaming reduction challenges, previous implementations of SpMV, both in special-purpose hardware and software, often suffer from low hardware utilization.
Prior approaches to designing efficient SpMV architectures either assume that a copy of the entire input vector for each multiplier can be stored on chip or blocking techniques are used to perform the SpMV over multiple passes of the input matrix. However, in most cases, a critical aspect of each specific SpMV implementation is the approach taken in designing the floating-point accumulator.
Historically, there have been two basic approaches for designing high-performance double precision accumulators. The first approach is to statically schedule the input data in order to interleave values and partial sums from different rows such that consecutive values belonging to each row are delivered to the accumulator—which is designed as a simple feedback adder—at a period corresponding to the pipeline latency of the adder. This still allows the adder to accept a new value every clock cycle while avoiding the accumulation data hazard between values in the same accumulation set (matrix row). Unfortunately, this method requires a large up-front cost in scheduling input data and is not practical for large data sets.
An early example of the first approach was the work of deLorimier and DeHon. M. deLorimier, A DeHon, “Floating-point sparse matrix multiply for FPGAs,” Proc. 13th ACM/SIGDA Symposium on Field-Programmable Gate Arrays (FPGA 2005). Their scheduling technique leads to the architecture's performance being highly dependent on the structure of the matrix, although on average they were able to achieve 66% of the peak performance in their simulation-based studies.
The second approach is to use a dynamic reduction technique that dynamically selects each input or partial sum to send into the adder—dynamically managing the progress of each active accumulation set using a controller. For the latter case, these approaches can be divided into two types depending on whether they use a single adder or multiple adders.
An early example using the dynamic reduction technique was from Prasanna's group at the University of Southern California as described in L. Zhou, V. K. Prasanna, “Sparse Matrix-Vector Multiplication on FPGAs,” Proc. 13th ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA 2005). In early variations of this technique, a linear array of adders was used to create a flattened binary adder tree, where each adder in the array was utilized at half the rate of the previous adder in the array. This required multiple adders with exponentially decreasing utilization, had a fixed maximum set size, and needed to be flushed between matrix rows.
The implementation from UT-Knoxville and Oak Ridge National Laboratory, described in J. Sun, G. Peterson, O. Storaasili, “Sparse Matrix-Vector Multiplication Design for FPGAs,” Proc. 15 IEEE International Symposium on Field Programmable Computing Machines (FCCM 2007), used a similar approach but with a parallel—as opposed to a linear—array of n adders, where n was the adder depth. This implementation striped each consecutive input across each adder in turn, achieving a fixed utilization of 1/n for each adder.
Prasanna's group later developed two improved reduction circuits, called the double-strided adder (DSA) and single-strided adder (SSA), that solved many of the problems of Prassanna's earlier accumulator design. L Zhuo, V. K. Prasanna, “High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs,” IEEE Trans. Parallel and Dist. Sys., Vol. 18, No. 10, October 2007. These new architectures required only two and one adders, respectively. In addition, they did not limit the maximum number of values that can be accumulated and did not need to be flushed between data sets. However, these designs did require a relatively large amount of buffer memory and extremely complex control logic.
An improved single-adder streaming reduction architecture was later developed at the University of Twente. M. Gerards, “Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs,” Master Thesis, University of Twente, The Netherlands, Aug. 15, 2008. This design requires less memory and less complex control than Prassanna's SSA design.
Finally, in each of the above discussed works, pre-made adders (usually generated with Xilinx Core Generator) have been used as the core of the accumulator. Another approach is to modify the adder itself such that the de-normalization and significand addition steps have a single cycle latency, which makes it possible to use as a feedback without scheduling. To minimize the latency of denormalize portion, which includes an exponent comparison and a shift of one of the significands, both inputs are base-converted to reduce the width of exponent while increasing the width of the mantissa. This reduces the latency of the denormalize while increasing the adder width. Since wide adders can be achieved cheaply with DSP48 components, these steps can sometimes be performed in one cycle. This technique is best suited for single precision operands but can be extended to double precision as well. However, in general this approach requires an unacceptably long clock period.
Thus, a need exists for an SpMV architecture that overcomes the above-mentioned disadvantages. A new streaming reduction technique that requires substantially less memory and simpler control logic would be particularly useful.