The present disclosure relates to improving performance of a stride-based prefetcher. In particular, it relates to improving performance of a stride-based prefetcher on an out-of-order central processing unit (CPU).
Over the past decade, the increase in processor frequency has not been matched by a corresponding reduction in memory access latency. This mismatch in performance has led to processors being frequently stalled when there is a delay in data arriving to the processor from memory. This delay limits or negates the improvement achieved from the increase in processor frequency. To deal with this problem, processors have incorporated multi-levels of caches. The multi-level caches allow for frequently accessed data to be fetched quickly by the processors. However, the processors still incur a huge latency penalty the first time they reference data that is not present in one of their caches.
Current processor systems address this problem by incorporating prefetch units in the processor pipeline. These prefetch units exploit the spatial and temporal locality of the processor accesses to predict which addresses are likely to be accessed next. The prefetch units generate their address predictions by examining the addresses which were accessed in the recent past. A common prefetcher implementation tracks the difference in successive addresses (i.e. the stride) that were accessed in the recent past. If the stride is constant, then the prefetcher issues multiple addresses spaced out by successive multiples of the stride starting from the last address. For example, if the past virtual address (VA) accesses were in the order: VA-3*Stride, VA-2*Stride, VA-Stride, VA; then the prefetcher will prefetch the following addresses: VA+Stride, VA+2*Stride, VA+3*Stride. This type of prefetcher is very effective for situations where large data structures (e.g., data structures in the form of an array or a matrix) are being accessed in regular loops.
However, current, conventional stride-based prefetchers do not function well in situations where the processor accesses are not in a strict, numeric sequence. This situation arises frequently in modern out-of-order processors. Specifically, this situation occurs especially often when the out-of-order processors perform applications where a data structure is accessed in a tight loop where very little computation is done before issuing the next access, such as with the execution of a block-transfer application.
Accordingly, there is a need for a system that generates an estimate of the correct access stride from out-of-order accesses.