1. Technical Field
Embodiments of the present invention are directed to prefetching from a memory device, such as a dynamic random access memory (DRAM) device. More particularly, embodiments of the present invention are directed to apparatuses and methods of dynamically controlling prefetch injection that exploits the memory device state and workload stride to reduce page misses without introducing them.
2. Discussion of the Related Art
DRAM page misses often result in poor performance of the overall memory system. FIG. 1 illustrates an example of access patterns conducted in the Standard Performance Evaluation Corporation (SPEC) floating point (SPECfp) “Swim” benchmark application. The SPECfp benchmark applications measure a processor's floating point performance and the central processing unit's (CPU's) interaction with main memory and cache. Bank 0 has streams 0xa96eXXX and 0xa9a3XXX (each being addresses of a 4K page). Bank 1 has streams 0x808dXXX and 0x8ceXXX (each being addresses of a 4K page). The addresses of both of these streams are one cache line apart (the cache line size being 64 bytes) and have a stride of +1. These streams have a predictable read transaction stride with a unit cache line stride. In both banks 0 and 1, prefetching future accesses in each stream would reduce page misses. The access patterns in FIG. 1 illustrate an example of a high bandwidth (BW), high page miss rate, with a regular stride application (such as Swim from SPECfp), where multiple streams collide with each other in the bank causing page misses. Prefetching in this scenario is beneficial if: (1) prefetching occurs when a page is already open so that no page misses are introduced from prefetching; and (2) prefetching occurs far enough ahead so that the page miss is entirely eliminated, as shown in the example of FIG. 1.
In the paper by W. Lin, S. Reinhardt, and D. Burger, “Reducing DRAM Latencies with an Integrated Memory Hierarchy Design”, Proc., 7th International Symposium on High-Performance Computer Architecture, January 2001, a proposal was offered where prefetches are sent to a Level 2 (L2) cache. This scheme prefetches blocks that are spatially near the address of recent demand misses into the L2 cache only when the memory channel is idle and a DRAM page is opened. That is, a central processing unit (CPU) prefetcher attempts to prefetch a wide range of addresses around a demand miss when the memory channel (and system) is idle and a page is opened. Prefetching into a L2 cache may pollute the L2 cache with prefetched data and may give poor performance if the prefetched data is unused and the evicted line from the prefetch is used in the future.
Traditional PC platforms partitioning with a prefetcher in the CPU are not aware of memory system details (and specifically, the DRAM state). Additionally, the memory controller in the chipset is not aware of the current CPU state nor of the CPU prefetcher algorithm or state. Previous attempts to fix these problems have introduced a prefetcher in a chipset that employs a prefetcher with similar latency reduction aims as the CPU prefetcher, but is not as sophisticated because chipset price points do not permit as many gates for the prefetcher. Therefore, previous chipset prefetchers have not been as efficient because they do not take into account the DRAM state, and are constrained to use fewer gates due to the chipset price points.
Moreover, chipset prefetchers alleviate cache pollution problems with the CPU prefetcher to some extent by prefetching in a separate chipset prefetch buffer. A simplistic chipset prefetcher may inject prefetches on reads subject to certain conditions without considering how well the memory system reacted to the previously injected prefetches. These chipset prefetchers function basically as open-loop control systems. The prefetches are injected by the chipset prefetchers with the goal of reducing read latency. While such a technique provides performance gains for latency sensitive applications, a performance loss is observed for other scenarios.
Accordingly, what is needed is a chipset prefetch system (i.e., chipset prefetcher) and method that does not hurt the performance of applications that are not able to efficiently utilize prefetching, while maximizing the performance of applications that benefit from prefetch operations.