1. Field of the Invention
The present invention generally relates to memory devices for computer systems, and more particularly to a method of managing writebacks from a cache memory.
2. Description of the Related Art
The basic structure of a conventional computer system includes one or more processing units which are connected to various peripheral devices (including input/output devices such as a display monitor, keyboard, and permanent storage device), a memory device such as random access memory (RAM) that is used by the processing units to carry out program instructions and store operand data, and firmware which seeks out and loads an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. The processing units typically communicate with the peripheral devices by means of a generalized interconnect or bus. A computer system may have many additional components such as various adapters or controllers, and serial, parallel and universal bus ports for connection to, e.g., modems, printers or network interfaces.
In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture includes a processor core having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. The processing unit can also have one or more caches, such as an instruction cache and a data cache, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up performance by avoiding the longer step of loading the values from a main memory device. These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip. A processing unit can include additional caches, such as a level 2 (L2) cache which may support on-board (level 1) instruction and data caches. An L2 cache acts as an intermediary between the main (system) memory and the on-board caches, and can store a much larger amount of information than the on-board caches, but at a longer access penalty. Additional cache levels may be provided, e.g., L3, etc.
A cache has many blocks which individually store the various instruction or data values. The blocks in any cache can be divided into groups of blocks called sets or congruence classes. A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g. 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set. A 1-way set associative cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (to indicate the validity of the value stored in the cache, i.e., consistency with the overall system memory architecture). The address tag is usually a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache “hit.” The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a “read” or “write”, to a memory location that maps into the full congruence class, the cache must make one of the blocks in that class available for the new operation. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.). If the data in the chosen block has been modified, that data is written (cast out) to the next lowest level in the memory hierarchy which may be another cache (in the case of the L1 or on-board cache) or main memory (in the case of an L2 or higher cache). By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. If the data in the chosen block has not been modified, the value in that block can simply be abandoned and not written to the next lowest level in the hierarchy. This process of freeing up a block from one level of the cache hierarchy is known as an eviction. At the end of this process, the cache no longer holds a copy of the evicted block. When a device such as the CPU or system bus needs to know if a particular cache line is located in a given cache, it can perform a “snoop” request to see if the address is in the directory for that cache.
Today's multi-core designs present memory controllers with increasing challenges to keep pace in regard to bandwidth and latency. As many processor cores target a single memory controller, locality is an especially difficult concept to maintain, which adversely affects the scheduling of sequential accesses to main memory with spatial locality. In addition, even though memory I/O frequencies are constantly increasing, critical DRAM timing parameters are not improving at the same rate. All of these factors exacerbate a number of issues facing memory controllers. In particular, with respect to memory writes, they aggravate bus turnaround penalty (especially write-to-read or vice versa), page mode options, and bursty behavior of reads and writes.
Modern processors can force modified data to be cast out of its lowest-level caches into memory due to an LRU eviction policy. For example, in U.S. Patent Application Publication nos. 2011/0276762 (now U.S. Pat. No. 8,838,901) and 2011/0276763 (now U.S. Pat. No. 8,683,128), a method is described to intelligently schedule writebacks of modified data to memory by utilizing the backing of the lowest-level cache to identify castouts that can be scheduled to memory before they become forced writebacks. This approach addresses the problems experienced in current memory controllers (as described above) by leveraging the lowest-level cache to virtually expand the visibility of the memory controller.
FIG. 1 illustrates an exemplary data processing system 100 according to the aforementioned applications. Data processing system 100 includes one or more processor complexes 102, which may be implemented as a chip multiprocessor (CMP) or a multi-chip module (MCM). Processor complex 102 includes at least one processor core 104, which includes logic for processing data under the direction of instructions. Each processor core 104 is capable of simultaneously executing multiple independent hardware threads of execution. Each processor core 104 is supported by a cache hierarchy including one or more upper level caches 106 and a lowest level cache 108, providing processor cores 104 with low latency access to instructions and data retrieved from system memory. While it is typical for at least the highest level cache (i.e., that with the lowest access latency) to be on-chip with the associated core 104, the lower levels of cache memory (including lowest level cache 108) may be implemented as on-chip or off-chip, in-line or lookaside cache, which may be fully inclusive, partially inclusive, or non-inclusive of the contents the upper levels of cache. The lowest-level cache 108 can be shared by multiple processor cores 104, and further can optionally be configured as a victim cache.
Processor complex 102 additionally includes one or more memory controllers (MCs) 110 each controlling read and write access to system (or main) memory, which is the lowest level of storage addressable by the real address space of processor complex(es) 102. Each memory controller 110 is coupled by a memory bus 112 to at least one respective memory channel 120, each of which includes one or more ranks 122 of system memory. A rank 122 can include multiple memory chips 124, which may in turn each contain multiple banks 130 for storing data. The system is not constrained to a particular memory technology but may employ dynamic random access memory (DRAM) for the system memory because of its low cost and high bit density. Each memory channel 120 is connected to one or more dual inline memory modules, each containing numerous DRAM memory chips. These DRAM memory chips are arranged logically into one or more independent accessible banks, and the banks are partitioned into pages. A given memory controller includes a physical read queue that buffers data read from the system memory via the memory bus, and a physical write queue that buffers data to be written to the system memory via the memory bus. The memory controller grants priority to write operations over read operations on the memory bus based upon a number of dirty cache lines in the lowest level cache memory.