There is a fundamental problem in the design of computing systems, namely that of minimizing the time cost of memory accesses.
This is a fundamental limitation on the design of computer systems as no matter what memory technology is used to support computation and no matter what technology is used to connect that memory to the processor, there is a maximum limitation on how much information can be transferred between processor and memory in a given time, this is the available memory bandwidth and the limitation of computer power by available memory bandwidth is often referred to as the “memory-wall”.
The present application seeks to increase the effective memory bandwidth and thus minimize the limitation of the “memory-wall” through the use of data compression.
It is known to employ data compression to reduce the effects of the “memory wall”. However, a problem for programmers using compressed memory sub-systems is that data has to be decompressed before it can be operated upon as shown in the system of FIG. 1. This usually involves reading the compressed data from one part of memory 10 into the register files of the processor 16, decompressing it using program code retrieved from program memory 18 and storing the decompressed data in another uncompressed portion of memory 12.
However this solution has the disadvantage that additional memory bandwidth is required to read compressed data, store it in uncompressed form, and read it back into the processor to be operated upon. Additional memory capacity is also required to hold the uncompressed data and the decompression process will increase pressure on the processors register-files. Clearly this is a sub-optimal solution which it is suggested explains why such compressed memory sub-systems have remained an academic curiosity rather than entering the mainstream microprocessor industry.
EP-0240032-A2 discloses a vector processor comprises a memory for storing and retrieving vector data. The vector processor comprises a plurality of vector registers each capable of reading or writing plural (m) vector elements in parallel, at least one mask vector register capable of m mask bits in parallel, transfer portion connected to the memory, the plurality of vector registers and the mask vector register and responsive to an instruction for transferring vector elements from regularly spaced address locations within the memory to selected storage locations of a selected vector register corresponding to valid mask bits. Whilst this approach is useful, it is limited in that the storage/retrieval of vector data is limited to an entire register.
Register-blocking is a useful technique for accelerating matrix algebra (particularly Finite-Element), however it has the disadvantage in that for many matrices (ex. As used in search engines such as GOOGLE®) zero fill has to be added decreasing effective FLOPS (Floating Point Operations Per Second), and increasing memory bandwidth requirements, both of which are commodities which are in short supply in modern computing systems. In fact the growing gap between processing capabilities and memory bandwidth which are increasing at highly disparate rates of 50% and 7% per annum respectively is referred to, as mentioned above, as the “Memory Wall”. There have been many claims of “breaking” the memory wall and they usually consist of using a cache to reduce the probability of having to go off-chip, and/or using multi-threading so that the latency and penalties associated with going off-chip can be mitigated.
These approaches merely hide the problem of limited external memory bandwidth rather than solving it and generally rely on the data-set exhibiting sufficient data locality, and/or the program exhibiting sufficient Thread-Level Parallelism (TLP) in order to be effective at all, and this may not be true of all problems, and is certainly not always known a priori. In fact many of the larger and more interesting problems exhibit neither sufficient data-locality, nor sufficient TLP and the throughput of the whole system degenerates to the point where it is limited by external memory bandwidth, and the extra hardware which has been added on-chip is of no use. For this reason it is not uncommon to see large engineering applications pulling down processor performance to 1% or less of the manufacturers quoted peak performance specification.