1. Field of the Invention
The present invention relates generally to cache performance, and in particular to methods and mechanisms for prefetching data in processors with multiple levels of caches.
2. Description of the Related Art
Memory latency is frequently a large factor in determining the performance (e.g., instructions executed per second) of a processor in a given system. While the operating frequencies of processors have increased dramatically over time, the latency for access to dynamic random access memory (DRAM) in a typical system has not decreased at the same rate. Consequently, the number of processor clock cycles required to access external memory has increased. In order to compensate for the relatively low speed of memory devices, various techniques have been developed.
One commonly used technique for reducing data access latency is caching data in one or more caches located close to the processor. Caches are typically relatively small, low latency, memories incorporated into the processor or coupled nearby. Another approach used to mitigate the impact of memory access latencies on processor performance is to prefetch data into a cache. The term “prefetch” may generally refer to the fetching of data from memory before that data is actually needed for computation by instructions in the program. One way that the memory bandwidth may be effectively utilized is to predict the information that will be accessed soon and then prefetch that information from the memory system into the cache. If the prediction is correct, the information may be a cache hit at the time of the actual request and thus the effective memory latency for actual requests may be decreased. On the other hand, if the prediction is incorrect, the prefetched information may replace useful information in the cache, causing more cache misses to be experienced than if prefetching were not employed and thus increasing the effective memory latency.
Certain types of computer programs process long sequences of data where each element in the sequence is accessed only once. This type of access pattern usually results in cache misses since the required data is not in the cache at the time it is needed. This type of access may be referred to as a “data stream” or “stream”, which is prevalent in certain multimedia applications. Prefetching data based on a prediction of the stream may help prevent cache misses and improve processor efficiency.
Prefetching is commonly employed for hiding memory latency in applications that have large data footprints. Many state-of-the-art microprocessors use multiple prefetch units, one or more at each cache level, in order to get fine-grained control over data placement. These prefetch units are typically tuned to be more aggressive as their proximity to the core decreases, such that the lower-level cache prefetch units run significantly ahead of the upper-level prefetch units closest to the core. As used herein, the term “lower-level prefetch units” refers to prefetch units further away from the core (e.g., level two (L2) prefetch unit, level three (L3) prefetch unit), while the term “upper-level prefetch unit” may refer to prefetch units closer to the core (e.g., level one (L1) prefetch unit).
Processors often utilize a virtual address space for data stored in the system memory, and software processes may divide the virtual address space into pages of a prefixed size. Mappings (or translations) of virtual addresses to physical addresses may keep track of where virtual pages are located in the physical memory. These mappings may be stored in a page table and this page table may be stored in memory. A translation lookaside buffer (TLB) may be utilized to store a subset of the mappings/translations.
Since the lower-level prefetch units require knowledge of physical addresses but do not have a means of obtaining the translation for the next virtual page accessed by the stream, the lower-level prefetch units typically have to stall at page-boundaries. Therefore, such processors may not reap the full benefits of prefetching for streams that cross page boundaries.
FIG. 1 illustrates the problem faced by the prior art. The relative locations of a demand stream, a L1 prefetch stream, and a L2 prefetch stream at a given time within a physical page in memory are shown in FIG. 1. The dashed vertical lines labeled with “Byte 0” and “Byte 4096” indicate the page boundaries. The demand stream fetches data from a location in the page where data is needed for immediate consumption by the program. The L1 prefetch stream tries to stay ahead of the demand stream by a certain number of bytes so that the demand stream will hit in the L1 cache. Similarly, the L2 prefetch stream tries to stay ahead of the L1 prefetch stream by a certain number of bytes so that the L1 prefetch stream will hit in the L2 cache for its prefetches. In one embodiment, data may be fetched from the page on a cache line basis, and the L1 prefetch stream may be a fixed number of cache lines ahead of the demand stream and the L2 prefetch stream may be a fixed number of cache lines ahead of the L1 prefetch stream.
When the L2 prefetch stream reaches the end of the page, as shown in FIG. 1, the L2 prefetch stream will stall while it waits for the physical page number of the next page in the given stream. The L2 prefetch stream may have to wait until the L1 prefetch stream reaches the end of the page, at which point the L1 prefetch unit will request the translation of the next virtual page into a corresponding physical page number. The dashed horizontal line shows the time spent idling by the L2 prefetch stream after the L2 prefetch stream hits the page boundary and waits for the L1 prefetch stream to catch up and reach the page boundary and provide next-page information. This results in “wasted cycles” with the L2 prefetch unit stalling for a certain amount of time while it could be doing useful work. As a result, when the L1 prefetch unit starts fetching from the next page in physical memory, these fetches will miss in the L2 cache because the L2 prefetch unit will not have already prefetched the data from the next page. Therefore, the L1 prefetch unit will incur increased latency as the data referenced by its prefetches needs to be fetched from memory. The L1 prefetch unit will continue to experience increased latency until the L2 prefetch unit is able to catch up with its prefetches in advance of the L1 prefetch unit.