1. Field of the Invention
This invention is related to the field of processors and, more particularly, to prefetch mechanisms in processors.
2. Description of the Related Art
Memory latency is frequently a large factor in determining the performance (e.g. instructions executed per second) of a processor in a given computer system. Over time, the operating frequencies of processors have increased dramatically, while the latency for access to dynamic random access memory (DRAM) in the typical computer system has not decreased as dramatically. Additionally, transmitting memory requests from the processor to the memory controller coupled to the memory system also requires time, which increases the memory latency. Accordingly, the number of processor clocks required to access the DRAM memory has increased, from latencies (as measured in processor clocks) of a few processor clocks, through tens of processor clocks, to over a hundred processor clocks in modern computer systems.
Processors have implemented caches to combat the effects of memory latency on processor performance. Caches are relatively small, low latency memories incorporated into the processor or coupled nearby. The caches store recently used instructions and/or data under the assumption that the recently used information may be accessed by the processor again. The caches may thus reduce the effective memory latency experienced by a processor by providing frequently accessed information more rapidly than if the information had to be retrieved from the memory system in response to each access.
If processor memory requests (e.g. instruction fetches and load and store memory operations) are cache hits (the requested information is stored in the processor""s cache), then the memory requests are not transmitted to the memory system. Accordingly, memory bandwidth may be freed for other uses. However, the first time a particular memory location is accessed, a cache miss occurs (since the requested information is stored in the cache after it has been accessed for the first time) and the information is transferred from the memory system to the processor (and may be stored in the cache). Additionally, since the caches are finite in size, information stored therein may be replaced by more recently accessed information. If the replaced information is accessed again, a cache miss will occur. The cache misses then experience the memory latency before the requested information arrives.
One way that the memory bandwidth may be effectively utilized is to predict the information that is to be accessed soon and to prefetch that information from the memory system into the cache. If the prediction is correct, the information may be a cache hit at the time of the actual request and thus the effective memory latency for actual requests may be decreased. Alternatively, the prefetch may be in progress at the time of the actual request, and thus the latency for the actual request may still be less than the memory latency even though a cache hit does not occur for the actual request. On the other hand, if the prediction is incorrect, the prefetched information may replace useful information in the cache, causing more cache misses to be experienced than if prefetching were not employed and thus increasing the effective memory latency.
A processor is described which includes a stride detect table. The stride detect table includes one or more entries, each entry used to track a potential stride pattern. Additionally, each entry includes a confidence counter. The confidence counter may be incremented each time another address in the pattern is detected, and thus may be indicative of the strength of the pattern (e.g., the likelihood of the pattern repeating). At a first threshold of the confidence counter, prefetching of the next address in the pattern (the most recent address plus the stride) may be initiated. At a second, greater threshold, a more aggressive prefetching may be initiated (e.g. the most recent address plus twice the stride). Since the aggressiveness of the prefetch is related to the number of times the pattern has repeated, aggressive prefetching may be performed for patterns which may be more likely to repeat. Thus, prefetching of data which is not subsequently used may be low.
In one implementation, prefetched cache lines may be stored in the L2 cache. Cache pollution may have a more limited affect in such implementations. Additionally, an implementation may track the patterns of addresses which miss the L1 cache, thereby potentially reducing the number of patterns to be tracked and thus the size of the stride detect table. Some implementations may detect collisions between prefetch addresses and subsequent miss addresses to cause the more aggressive prefetching, in addition to the second threshold of the confidence counter. In some embodiments, the implementation of prefetch in the processor and buffering of the prefetch data in a cache (such as the L2 cache) may allow for elimination of prefetching and a prefetch buffer from the memory controller in the system including the processor.
Implementing prefetch as described above may lead to more accurate prefetching in some implementations. For example, since the actual stream of misses from the cache in one processor is observed by the prefetch mechanism described herein, the patterns detected may be more likely to correspond to miss patterns in code being executed. When prefetch is implemented in the memory controller, observability is generally limited to the miss stream on the interface to the memory controller, which may be include misses from two or more processors (in multiprocessor systems). Thus patterns may be detected among misses from different processors. Such patterns may be less likely to repeat than patterns detected in a miss stream from one processor (or one cache).
Broadly speaking, a processor is contemplated, comprising a table and a control circuit coupled thereto. The table includes at least a first entry configured to store at least a portion of a first address and a corresponding first count. The control circuit is configured to select a second address equal to a sum of the first address and a first value as a prefetch address responsive to the first count being greater than or equal to a first threshold. Furthermore, the control circuit is configured to select a third address equal to a sum of the first address and a second value as the prefetch address responsive to the first count being greater than or equal to a second threshold. The second value is greater than the first value, and the second threshold is greater than the first threshold.
Additionally, a method is contemplated. At least a portion of a first address and a corresponding first count are stored. A second address equal to a sum of the first address and a first value is selected as a prefetch address responsive to the first count being greater than or equal to a first threshold. A third address equal to a sum of the first address and a second value is selected as the prefetch address responsive to the first count being greater than or equal to a second threshold. The second value is greater than the first value, and the second threshold is greater than the first threshold.