Processor performance is an important metric in computing systems. The current state of the art is at a limit where speeding up the processor clock will minimally affect actual performance. The gating factor is the processor-cache miss rate. For example, at a processor clock rate of 3 GHz, a cache miss may cost about 300-450 clock cycles. Assuming 25% of the instructions are LOAD instructions, at a 2% cache miss rate the average number of cycles per instruction (CPI) increases from 1 to 1+(25%)(2%)(400)=3, resulting in three times slower processor performance.
Furthermore, servers today execute pointer-rich application environments (such as Java or .Net) which are generally accompanied by even lower processor-cache performance, as shown for example by cache miss rates of 4% in some instances (and resulting, a number of years ago, in suggestions to eliminate processor data caches altogether in pointer-rich execution environments, such as Artificial Intelligence systems).
Note that halving the cache miss rate on a 3 GHz processor with a 2% level-two processor-cache (L2 cache) miss rate results in performance equivalent to speeding up the processor clock rate to 10 GHz (holding other factors the same), in other words in “virtual over-clocking” with no side effects.
Conventional processor-cache prefetching algorithms require the compiler to produce specific prefetch instructions, or turn on bits in the generated assembly code, as the compiler is compiling the source code. Accordingly, there is a need for a processor-cache prefetching algorithm that requires no extra work from the compiler or the processor and is transparent to them, and requires no advance knowledge from the programmer.