1. Field of the Invention
The invention relates generally to chipset prefetchers and, more particularly, to an apparatus and a method of prefetching from a memory device having interleaved channels that detects workload strides and intelligently injects prefetches without incurring additional page misses.
2. Description of the Related Art
A PC platform typically consists of one or more central processing units (CPUs) connected to a chipset via high speed interconnect (typically Front Side Bus or FSB). The chipset contains a memory controller for interfacing to main memory comprised of dynamic random access memory (DRAM). If a page miss occurs during DRAM operations, the open page must first be written back to memory (precharged). Then, the new memory page has to be opened before a READ/WRITE operation (activate). Precharge and activate operations take time and slow down memory accesses, resulting in an inefficient use of the memory bus (reduced bandwidth) and a loss in performance of the device employing DRAM.
System performance may be enhanced and effective memory access latency may be reduced by anticipating the needs of a processor. If data needed by a processor in the near future is predicted, then this data can be fetched in advance or “prefetched”, such that the data is buffered/cached and available to the processor with low latency. A prefetcher that accurately predicts a READ request and issues it in advance of an actual READ can thus, significantly improve system performance.
A prefetcher can be implemented in CPU or in chipset, and prefetching schemes have been routinely used on the CPU side. However, typical interconnect between CPU(s) and the chipset does not have a provision for either the CPU to share workload information with the chipset or for the chipset to share DRAM state (open/close pages) information with the CPU. Thus, a prefetcher implemented on the CPU side is limited in its ability to intelligently issue prefetches based on accurate DRAM state. This can be a significant performance constraint. On the other hand, with some additional hardware, the chipset can detect the workload strides that are important from a performance perspective. As a result, the chipset prefetcher can provide significant performance gain for applications that have a high bandwidth, a high page miss rate, and a predictable READ transaction access pattern by issuing prefetch requests in a DRAM state aware manner. Performance gain is a result of reduced DRAM page misses and lower latency of loads that hit the prefetch buffer.
W. Lin, S. Reinhardt, and D. Burger, in the paper “Reducing DRAM Latencies with an Integrated Memory Hierarchy Design,” Proc., 7th International Symposium on High-Performance Computer Architecture, January 2001, offered a proposal where CPU prefetches are sent to a Level 2 (L2) cache. This scheme prefetches blocks of data that are spatially near the address of recent demand misses into the L2 cache only when the memory channel is idle and a DRAM page is opened. That is, a CPU prefetcher attempts to prefetch a wide range of addresses around a demand miss when the memory channel (and system) is idle and a page is opened. However, prefetching into a L2 cache may pollute the L2 cache with prefetched data and may give poor performance if the prefetched data is not used and the line evicted because of the prefetch is required in the future.
Prefetch performance may vary depending on whether prefetching is implemented in a lock step or in an interleaved channel memory system. Some key features of a lock step channel memory system include: (1) each channel works on the same cache line address; (2) each channel works on part of the cache line data for the address; (3) channels progress on READ/WRITE commands in lock step; and (4) the page table state is the same for each channel. On the other hand, some key features of an interleaved channel memory system include: (1) the access address is mapped to the different channels by a mapping function; (2) each channel works on a different address; (3) each channel may progress at different rates on processing their accesses; and (4) the page table state may be different for each channel. A prefetcher for lock step memory system has been described in U.S. Patent Application Publication 2004/0123043 A1.
Interleaved channels have performance advantages over lock step channels. For example, interleaved channels can process back-to-back DRAM page misses in parallel and overlap page miss latency over a longer burst length. However, prefetching in an interleaved channel memory system presents some challenges. Since each access address is mapped to a different channel, interleaved channels may present a challenge to stride detection possibilities because of potential non-linear channel address mapping. In addition, since DRAM state aware prefetching typically issues prefetches that are page hits and page table state may be different for each channel, some prefetches may be rejected in some (but not all) interleaved channels. Finally, since each channel schedules its DRAM commands independently for demands and prefetches, based on the scheduling policy, prefetches within each channel may progress at different rates.
In light of above, what is needed is a chipset prefetcher and method that works with a memory controller with interleaved memory channels that detects workload strides and intelligently injects prefetches without incurring additional page misses.