There is a fundamental problem in the design of computing systems, namely that of minimising the time cost of memory accesses.
This is a fundamental limitation on the design of computer systems as no matter what memory technology is used to support computation and no matter what technology is used to connect that memory to the processor, there is a maximum limitation on how much information can be transferred between processor and memory in a given time, this is the available memory bandwidth and the limitation of computer power by available memory bandwidth is often referred to as the “memory-wall”.
It is known to employ data compression to reduce the effects of the “memory wall”. However, a problem for programmers using compressed memory sub-systems is that data has to be decompressed before it can be operated upon as shown in the system of FIG. 1. This usually involves reading the compressed data from one part of memory into the register files 14 of the processor 16, decompressing it using program code retrieved from program memory 18 and storing the decompressed data in another uncompressed portion of memory 12.
However this solution has the disadvantage that additional memory bandwidth is required to read compressed data, store it in uncompressed form, and read it back into the processor to be operated upon. Additional memory capacity is also required to hold the uncompressed data and the decompression process will increase pressure on the processors register-files. Clearly this is a sub-optimal solution which it is suggested explains why such compressed memory sub-systems have remained an academic curiosity rather than entering the mainstream microprocessor industry.
Register-blocking is a useful technique for accelerating matrix algebra (particularly Finite-Element), however it has the disadvantage in that for many matrices (ex. As used in search engines such as GOOGLE™) zero fill has to be added decreasing effective FLOPS, and increasing memory bandwidth requirements, both of which are commodities which are in short supply in modern computing systems.
In fact the growing gap between processing capabilities and memory bandwidth which are increasing at highly disparate rates of 50% and 7% per annum respectively is referred to, as mentioned above, as the “Memory Wall”. There have been many claims of “breaking” the memory wall and they usually consist of using a cache to reduce the probability of having to go off-chip, and/or using multi-threading so that the latency and penalties associated with going off-chip can be mitigated.
These approaches merely hide the problem of limited external memory bandwidth rather than solving it and generally rely on the data-set exhibiting sufficient data locality, and the program exhibiting sufficient Thread-Level Parallelism (TLP) in order to be effective at all. In fact many of the larger and more interesting problems exhibit neither sufficient data-locality, nor sufficient TLP and the throughput of the whole system degenerates to the point where it is limited by external memory bandwidth, and the extra hardware which has been added on-chip is of no use. For this reason it is not uncommon to see large engineering applications pulling down processor performance to 1% or less of the manufacturers quoted peak performance specification.
State of the art methods for computing Sparse-Matrix Vector Products (SMVM) have improved little over the past few decades and performance improvements have been driven largely by advances in processor and semiconductor process technology. In general SMVM has had little if any influence on the design of mainstream microprocessors despite the obvious problems in terms of scaling I/O bandwidth performance, particularly where Chip Multi-Processors (CMPs) exacerbate problems by contending for increasingly scarce I/O bandwidth. A sizeable number of the entries in typical blocked sparse-matrices consist of zero fill. These values even if they do not contribute to the result of an SMVM are nonetheless fetched from memory and multiplied with all of the attendant problems in terms of power-dissipation and system throughput.
FIG. 2 is an exemplary illustration of a state of the art Block Compressed Sparse Row (BCSR) data-structure which consists of 3 arrays. The row (row_start) array holds the row entries containing non-zero tiles, a second col (col_idx) array containing the column addresses of the non-zero tiles and a val (value) array containing the actual non-zero entries (with fill) for all of the non-zeroes in the sparse-matrix, arranged in tile-by-tile order. If the A-matrix entry is zero then a processor will unnecessary perform computations using zero values leading to unnecessary consumption of bandwidth and power.
Many of the computations performed by processors consist of a large number of simple operations. As a result, a multiplication operation may take a significant number of clock cycles to complete. Whilst this operation is justified for complex calculations, the same cannot be said of trivial operations, for example multiplication of one number by 0, +1, or −1, where the answer may be obtained in a much simpler fashion.
JP 60247782 discloses an arrangement in which a sparse matrix is loaded and then examined to identify trivial values within the matrix. This approach however does not address the limitation in having to load the complete matrix from memory. JP 61025275 discloses a processor which interrogates values within a matrix to reduce the time required for a matrix operation. Similarly, JP 58022446 discloses a processor in which arithmetic operations are avoided depending on values contained within a register. JP 58109971 examines values within a register to reduce the overall computation time within a pipeline processor architecture for a calculation when an intermediate value generated during a computation is a trivial value. Similary, GB 1479404 discloses an arrangement in which data values within a matrix are examined to determine if they contain trivial values and where this determination is used in the performance of a computation. All of these approaches still involve the loading of the complete matrices from memory.
In certain applications, involving sparse matrices, the number of trivial operations carried out can be very significant owing to the presence of a significant number of zeros. The number of zeroes in a sparse matrix can be reduced or eliminated by storing the matrix in a sparse format such as compressed Row Storage (CRS) format, however due to the overheads in terms of address-generation such storage formats often result in very poor performance on commercial computer systems.
There is therefore a need for a solution which addresses at least some of the drawbacks of the prior art.