The latency of first-level (L1) cache misses is a significant factor with respect to pipeline stalls in processors. The latency of a load instruction depends on whether or not the data exists in the L1 cache. However, at the time the load instruction is issued, it is not possible to determine whether or not the data will be present in the L1 cache. This affects issuing those instructions that are dependent on the load instruction, and leads to pipeline stalls and performance degradation.
In most of today's processors, the scheduler assumes that the data accessed by the load instruction will be available in the L1 cache. Dependent instructions are issued based on this assumption. A reject-and-replay mechanism is triggered when the load misses the cache. When the load instruction incurs a cache miss, it leads to not only a performance penalty, but also a power overhead.
The Compaq Alpha 21264 used the most significant bit of a 4-bit saturating counter as the load instruction's hit or miss prediction. The counter was incremented by one every time a load hit, and decremented by two every time a load missed. Typically, only a small percentage of static load instructions are responsible for a large fraction of the cache misses. The above counter can be easily saturated with the load instructions that usually have cache hits, and hence lead to mis-predictions.
Yoaz et al, in “Speculation techniques for improving load related instruction scheduling,” Computer Architecture News, vol. 27, No. 2, pp. 42-53, 1999 used 2-level local predictors, 2-level global predictors, and hybrid predictors for cache hit or miss prediction. Here again, the predictors predominantly predict as hits, and hence result in re-issuing of instructions.