The gap between processor and memory speed continues to widen. As a result, computer performance is increasingly determined by the effectiveness of the cache hierarchy. However, processor workloads typically incur significant cache misses.
Prefetching is a well-known and effective technique for improving the effectiveness of the cache hierarchy. One technique compilers use to improve the accuracy of prefetching is to statistically discover memory access instructions (e.g., load, store, etc.) with a constant “stride.” For example, a load instruction that loads every sixteenth byte is easy to prefetch for, because the compiler knows ahead of time what bytes will be needed. However, many memory access instructions with a constant stride cannot be statically discovered by the compiler due to pointer dereferences and indirect array references that are not resolved until run-time.
To address this problem, “instrumentation” code (i.e., test code) may be added to a software application to directly monitor the actual data addresses accessed by one or more memory access instructions. However, instrumentation code adds significant overhead to a software application that slows the application down (e.g., by a factor of 10 times).