1. Field of the Invention
The present invention relates generally to cache performance, and in particular to methods and mechanisms for prefetching data in processors with multiple levels of caches.
2. Description of the Related Art
Memory latency is frequently a large factor in determining the performance (e.g., instructions executed per second) of a processor in a given system. Over time, the operating frequencies of processors have increased dramatically, while the latency for access to dynamic random access memory (DRAM) in the typical system has not decreased at the same rate. Accordingly, the number of processor clocks required to access the external memory has increased. Therefore, techniques for compensating for the relatively low speed of memory devices have been developed. One technique is caching data in one or more caches located close to the processor. Caches are relatively small, low latency memories incorporated into the processor or coupled nearby.
Processors typically use caches to combat the effects of memory latency on processor performance. One way to mitigate the increasing latency of memory accesses is to prefetch data into a cache. The term “prefetch” may generally refer to the fetching of data from memory before that data is actually needed for computation by instructions in the program. One way that the memory bandwidth may be effectively utilized is to predict the information that will be accessed soon and then prefetch that information from the memory system into the cache. If the prediction is correct, the information may be a cache hit at the time of the actual request and thus the effective memory latency for actual requests may be decreased. On the other hand, if the prediction is incorrect, the prefetched information may replace useful information in the cache, causing more cache misses to be experienced than if prefetching were not employed and thus increasing the effective memory latency.
Certain types of computer programs process a long sequence of data where each element in the sequence is accessed only once. This type of access pattern usually results in cache misses since the required data is not in the cache at the time it is needed. This type of access may be referred to as a “data stream” or “stream”, which is prevalent in certain multimedia applications. Prefetching data based on a prediction of the stream may help prevent cache misses and improve processor efficiency.
The simplest type of prefetch prediction is a unit stride prediction. For example, a training mechanism may detect accesses to cache lines L and L+1. Therefore, the training mechanism may detect that the stride is 1, and so a prefetch unit may start prefetching cache lines L+2, L+3, etc. In other embodiments, other non-unit strides may be detected, and furthermore, strides may be to descending addresses instead of just ascending addresses.
Modern superscalar processors use very aggressive speculation techniques that involve reordering of memory accesses in order to achieve higher performance.
Reordering of memory accesses results in obfuscation of any discernable pattern in the memory stream the further the operation progresses from the front-end of the machine. In addition, lower-level caches have to contend with simultaneous request streams from multiple cores, which further increases the entropy of these access patterns. Some authors use the term lower-level cache to refer to caches closer to the core, while others use the term to refer to caches further from the core. As used herein, the term “lower-level caches” refers to caches further away from the core (e.g., L2 cache, L3 cache), while the term “upper-level cache” may refer to caches closer to the core (e.g., an L1 cache).
The closer the memory accesses get to memory, the more garbled the memory accesses become in relation to their original order. As a result, it becomes harder to detect a common stride between consecutive memory accesses at lower-level caches since memory accesses get reordered at each level of the machine. Prefetch units at the lower-level caches thus have to contend with garbled memory streams, and are often unable to identify a common pattern across the stream. This effectively reduces the effectiveness of prefetching at low levels of the cache hierarchy.