1. Field of the Invention
This invention relates to microprocessors, and more particularly, to finding an efficient method to achieve speculative pre-fetching of data from system memory.
2. Description of Related Art
In modern microprocessors, one or more processor cores, or processors, may be included in the microprocessor, wherein each processor is capable of executing instructions of a software application. Modern processors are typically pipelined wherein the processors are comprised of one or more data processing stages connected in series with storage elements placed between the stages. The output of one stage is made the input of the next stage during each transition of a clock signal. Ideally, every clock cycle produces useful execution of an instruction for each stage of the pipeline. In the event of a stall, which may be caused by a branch misprediction, i-cache miss or d-cache miss, data dependency, or other reason, no useful work may be performed for that particular instruction during the clock cycle. For example, a d-cache miss may require several clock cycles to service and, thus, decrease the performance of the system as no useful work is being performed during those clock cycles. The overall performance hit may be reduced by overlapping the d-cache miss service with out-of-order execution of multiple instructions per clock cycle. However, a stall of several clock cycles still reduces the performance of the processor due to in-order retirement that may prevent complete overlap of the stall cycles with useful work.
Further, system memory may comprise two or three levels of cache hierarchy for a processor core or for multiple cores on a microprocessor. Later levels in the hierarchy of the system memory may include access via a memory controller to dynamic random-access memory (DRAM), dual in-line memory modules (dimms), and a hard disk. Access to these lower levels of memory require a significant number of clock cycles. The multiple levels of caches that may be shared among multiple cores on a microprocessor help to alleviate this latency when there is a cache hit. However, as cache sizes increase and later levels of the cache hierarchy are placed farther away from the processor core, the latency to determine if a requested memory line exists in a cache also increases. This latency becomes more problematic for processor cores that access each level of cache in a serial manner. Should a processor core have a memory request followed by a serial access of each level of cache where there is no hit, followed by a DRAM access, the overall latency to service the memory request may become a substantial penalty.
One solution for reducing the access time for a memory request is to use a speculative request to the cache hierarchy and to DRAM. However, each access to DRAM may inadvertently close a DRAM page to other processor cores. If the requested memory line is in one of the caches, then the access to DRAM, and inadvertent closing of a DRAM page, was unnecessary. Also, the memory controller and data bus are used for unnecessary accesses. If the cache hit rate is high, resources external to the processor core and needed by other processor cores, may be made not available by the unnecessary requests sent to DRAM.
To remove unnecessary requests to DRAM and subsequent unnecessary resource consumption, a cancellation scheme may be employed that uses the hit status of all the caches. However, the hit status of all the caches may not be known until the speculative request has already been sent to DRAM. Alternatively, the speculative request may be delayed, but then the benefit is reduced or removed altogether.
In view of the above, an efficient method for achieving speculative pre-fetching of data from system memory is desired.