In existing systems, the performance of high-speed computers is limited by memory latency, as memory access time may be much greater than processor cycle time. Memory and processor latencies are diverting at an exponential rate, creating the infamous memory-processor gap. This increasing gap may act as a potential performance bottleneck, severely limiting maximum theoretical performance in current and future computer systems. Moreover, the current mainstream multi-core processors increase memory contention, further affecting effective per core memory latency.
Specifically, memory intensive applications such as multimedia and streaming applications tend to be largely affected by the growing memory-processor performance gap. Such applications may often require real-time or high speed computation.
Memory and processing bandwidths are similar and relatively easy to co-scale well. For example, existing multi-core processors have high memory bandwidth using channels and banks. Known solutions to the latency gap rely on using memory bandwidth to ‘hide’ memory latency as well as exploiting locality of references.
Some of these solutions may include using caches, or more generally memory hierarchy, to utilize locality of reference inherent in instruction and data accesses. Caches filter a significant number of memory accesses, resulting in small effective memory latency.
Other solutions may involve using hardware multithreading to hide memory latency by context switching to another thread, thereby overlapping execution bandwidth with memory bandwidth. That effectively decreases the effect of memory latency significantly to the overall system performance.
Another solution may involve using data prefetching. In this case, data may be prefetched well before it is demanded. Some existing prefetching techniques may include hardware methods that rely on prefetching streams of data (using fixed strides). Examples include ‘one-block-ahead’ for prefetching cache lines, and stream buffers [Zucker, D. F.; Lee, R. B.; Flynn, M. J., “Hardware and software cache prefetching techniques for MPEG benchmarks,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 5, pp. 782-796, August 2000]. Other prefetching techniques may include the use of a stride prediction table. Generally, this refers to a hardware method where the last instruction and its associated data reference address are stored in a table; a subsequent access to the same instruction would trigger prefetching; current and last data access addresses predicts the prefetch address [Zucker, D. F.; Lee, R. B.; Flynn, M. J., “Hardware and software cache prefetching techniques for MPEG benchmarks,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 5, pp. 782-796, August 2000]. [HOOGERBRUGGE JAN [US]; VAN DE WAERDT JAN-WILLEM [US] “COUNTER BASED STRIDE PREDICTION FOR DATA PREFETCH”, EP1576465 (A1)]. Another prefetching technique may involve pattern-driven-prefetching. Here, the hardware may detect a stride pattern in a data reference sequence. The obtained pattern may direct the next prefetch addresses [Sbeyti, H., Niar, S., and Eeckhout, L. 2006. Pattern-driven prefetching for multimedia applications on embedded processors. J. Syst. Archit. 52, 4 (April 2006), 199-212.DOI=http://dx.doi.org/10.1016/j.sysarc.2005.05.002]. Additional prefetching techniques may include the use of explicit memory management. In this technique, the programmer manages memory by software, scheduling filling/spilling of data buffers (such as DMA transfers in the Cell BE processor).
However, none of the prefetch techniques described above allow for generic pattern matching. Some techniques only allow for simple stride prediction that is useful only for a limited set of applications. Other techniques are only capable of matching sequences of very small lengths (e.g., up to 3). As such, further work is thus needed to improve existing prefetching techniques to make them applicable to a wider class of applications, and to use the abundant bandwidths available (especially in multicore processors) to hide memory latency.