Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100.
Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories.
Generally, the main memory of a computer system has a memory organization at the page level of granularity. Typically, a page may be a four kilobyte (KB) page, although any other size page may be defined for a particular implementation. Cache memory organization is generally at a cacheline level of granularity. A cacheline is a set of bytes stored in contiguous memory locations, which are treated as a unit for coherency purposes. As used herein, each of the terms “cacheline” and “cache block” are interchangeable. The number of bytes in a cacheline may be varied according to design choice, and may be of any size.
Various workloads executed by a processor have memory access patterns that function more efficiently with page level data or cacheline level data. In a conventional cache, only a single granularity is supported. This results in some workload functioning more efficiently than other workloads. Alternately, when a workload is launched (begin of execution), the computer system could implement a partition of the cache memory so that a portion of the cache memory can be organized at a page level. However, partitioning the cache reduces the available size of the cache for storing data at cacheline granularity, which may lead to an increase in cache misses due to the reduced storage capacity for these cachelines.