1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to prefetch mechanisms within microprocessors.
2. Description of the Related Art
Modern microprocessors are demanding increasing memory bandwidth (and particularly decreased memory latency) to support the increased performance achievable by the microprocessors. Increasing clock frequencies (i.e. shortening clock cycles) employed by the microprocessors allow for more data and instructions to be processed per second, thereby increasing bandwidth requirements. Furthermore, modern microprocessor microarchitectures are improving the efficiency at which the microprocessor can process data and instructions. Bandwidth requirements are increased even further due to the improved processing efficiency.
Computer systems typically have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data then may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modern microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a low latency system. Microprocessor performance may suffer due to the latency of the memory system.
In order to increase performance, microprocessors may employ prefetching to "guess" which data will be requested in the future by the program being executed. If the guess is correct, the delay of fetching the data from memory has already occurred when the data is requested (i.e. the requested data may be available within the microprocessor). In other words, the effective latency of the data is reduced. The microprocessor may employ a cache, for example, and the data may be prefetched from memory into the cache. The term prefetch, as used herein, refers to transferring data into a microprocessor (or cache memory attached to the microprocessor) prior to a request for the data being generated via execution of an instruction within the microprocessor. Generally, prefetch algorithms (i.e. the methods used to generate prefetch addresses) are based upon the pattern of accesses which have been performed in response to the program being executed. For example, one data prefetch algorithm is the stride-based prefetch algorithm in which the difference between the addresses of consecutive accesses (the "stride") is added to subsequent access addresses to generate a prefetch address. Another type of prefetch algorithm is the stream prefetch algorithm in which consecutive data words (i.e. data words which are contiguous to one another) are fetched.
The type of prefetch algorithm which is most effective for a given set of instructions within a program depends upon the type of memory access pattern exhibited by the set of instructions (or instruction stream). Stride-based prefetch algorithms often work well with regular memory reference patterns (i.e. references separated in memory by a fixed finite amount). An array, for example, may be traversed by reading memory locations which are separated from each other by a regular interval. After just a few memory fetches, the stride-based prefetch algorithm may have learned the regular interval and may correctly predict subsequent memory fetches. On the other hand, the stream prefetch algorithm may work well with memory access patterns in which a set of contiguous data is accessed once and then not returned to for a relatively long period of time. For example, searching a string for a particular character or for comparing to another string may exhibit a stream reference pattern. If the stream can be identified, the data can be prefetched, used once, and discarded. Yet another type of reference pattern is a loop reference pattern, in which data may be accessed a fixed number of times (i.e. the number of times the loop is executed) and then may not be accessed for a relatively long period of time.
None of the prefetch algorithms described above is most effective for all memory access patterns. In order to maximize prefetching and caching efficiency, it is therefore desirable to identify early in a microprocessor pipeline which type of memory access pattern is to be performed by an instruction stream being fetched in order to employ the appropriate prefetch algorithm for that instruction stream. The earlier in the pipeline that the prefetch algorithm may be determined, the earlier the prefetch may be initiated and consequently the lower the effective latency of the accessed data may be.