As out-of-order cores become wider and deeper, microarchitectural performance tends to become more limited by two bottlenecks: cache misses and branch mispredictions. Prefetching of data can improve performance for many applications. Through a combination of hardware and software, prefetching data before the data is actually required can lead to reduced latency of memory accesses.
The impact of a cache miss can be mitigated in multiple ways, including: 1) hiding the latency of the miss through using out-of-order execution, 2) tailoring the cache replacement policy to better match applications' needs, and, 3) by prefetching the memory location before the actual demand occurs.
Load instructions can be classified into several categories, including: a) constant loads whose virtual addresses remains constant over multiple dynamic instances, b) striding loads with successive virtual addresses mainly in arithmetic progression, and, c) irregular loads that are neither constant loads nor strided loads.
Furthermore, as described herein, loads that miss in the cache frequently (i.e., greater than a threshold number of times, such as 100, 1000, 10,000, etc.) are called delinquent loads.
Prefetching delinquent irregular loads remains an open challenge.