1. Field of the Invention
The present invention generally relates to microprocessors and to multiprocessor architectures and, more particularly, to architectures with data caches implementing data prefetching.
2. Description of the Prior Art
Future microprocessor designs will require new design trade-offs to address new constraints on architectures. The increasing computer power available per chip with the use of chip multiprocessors is not matched by a commensurate increase in signal I/O bandwidth to satisfy the inherent memory bandwidth requirements, leading to a potentially unbalanced and inefficient design.
At the same time, the use of SRAM memories as on-chip memories to provide a significant reduction in bandwidth requirements is limited by its comparatively low density and power dissipation. SRAM memories are also suffering from manufacturability constraints limiting future access speeds.
A promising solution to these multiple constraints is the adoption of embedded DRAM techniques for high-capacity, high-density on-chip caches. Embedded DRAM (eDRAM) uses logic fabrication technology to build the familiar IT DRAM cell in a logic process, offering a significant increase in capacity per given unit area. While eDRAM offers attractive density and capacity per unit area, it has higher latency compared to SRAM-based on-chip solutions.
By using eDRAM in conjunction with chip-multiprocessor solutions, it is possible to deliver increased performance with reduced memory bandwidth requirements, and to offer attractive system solutions transcending the constraints of current technology. Integrating large capacity eDRAM caches on chip makes high bandwidth access to high capacity on-chip storage a reality by offering both wide data paths, and higher on-chip transfer speeds.
To deliver on the promise of this new memory hierarchy paradigm, prefetching may be used to decouple application access latency from the technology, and the available bandwidth may be used for latency hiding. To avoid the area cost and constraints imposed on SRAM in future technologies, efficient prefetch cache architecture may be used. As an example, the BlueGene/L system, which uses eDRAM for on-chip cache and prefetch caches, is described in “Blue Gene/L compute chip: Memory and Ethernet”, published in IBM Journal of Research and Development, 49(2/3), 2005 by M. Ohmacht, R. A. Bergamaschi, S. Bhattacharya, A. Gara, M. E. Giampapa, B. Gopalsamy, R. A. Haring, D. Hoenicke, D. J. Krolak, J. A. Marcella, B. J. Nathanson, V. Salapura, and M. E. Wazlowski.
In the BlueGene/L compute chip, instead of using standard L2 SRAM-base cache, a small private prefetch cache is integrated in the memory hierarchy between a first level private 32 KB data cache, and a 3rd level on-chip 4 MB eDRAM cache shared between two processor cores on a chip. A small private prefetch cache is implemented between small private L1 caches and large L3 eDRAM shared between two processors. The prefetch cache's size is only 2 kB per processor.
The idea of prefetching data to improve data cache hit rate by fetching data from the memory before the processor actually needs them is widely employed and explored. The underlying idea is to overlap memory access time with computation, and thus to improve processor performance by reducing the number of stall cycles. Ideally, only data which are needed are prefetched ahead of time so they are ready to use when the processor needs them. By prefetching too many unneeded data into the prefetch cache, available memory bandwidth for other participants on the memory bus is reduced, and the prefetch cache gets polluted where prefetched data can displace useful data from the cache. These problems are even more pronounced for smaller prefetch data caches.
To ensure efficient use of a small prefetch cache, the careful management of the prefetch cache is extremely important. For optimal performance, the prefetched cache lines placed in the L2 prefetch cache should not displace some other cache line which is still in use by the processor. Which line to displace from the prefetch cache is determined by the prefetch cache replacement policy.
The replacement policy captures reference behavior and helps to determine how data streams are aged out of the prefetch cache to make room for new data lines. There is a number of different standard replacement policies available, such as random, round-robin, some variants of round-robin, and least recently used (LRU), to name a few.
Whereas some of the replacement policies are simple to implement in hardware—such as random or round-robin—these approaches have a disadvantage that they can displace lines from the prefetch cache that are still in use by the processor. Similarly, these policies could displace lines allocated in the prefetch cache for recently issued prefetch requests to the L3 and for which data are still in-flight from the L3 cache.
Alternatively, some of the replacement policies deliver good performance—such as LRU replacement policy—but result in complex hardware implementation requiring many resources, and as a consequence, increasing the power consumption of the circuitry.
The replacement policy employed for efficient management of a prefetch data cache determines the prefetch cache performance and the complexity of the prefetch cache design. It would be highly desirable to provide a replacement policy for a prefetch cache system which would enable high performance of the prefetch cache—expressed as high cache hit rate—and at the same time, can be implemented by a replacement mechanism that is not complex.