1. Field of the Invention
The present invention is generally directed toward a processor and, more specifically, to a method for reducing off-chip bandwidth requirements for a processor.
2. Description of the Related Art
Memory systems within a computer system have typically implemented multiple levels of cache memory or cache, e.g., a level 1 (L1) cache, a level 2 (L2) cache and a level 3 (L3) cache, in addition to main memory. Usually, one or more cache memory levels are implemented on-chip within a processor. In a typical case, both reads from main memory and writes to main memory are cached. To reduce the overhead of information transfer between cache and main memory, information has usually been transferred in a group, e.g., a cache line or multiple cache lines. A cache line size is architecturally dependent and usually expressed in bytes, e.g., a cache line may be between 32 and 128 bytes. Cache memories usually implement one of two write policies, i.e., a write-back policy or a write-through policy. In caches that implement a write-back policy, newly cached information is not actually written to main memory until a cache line that stores the information is needed for a new address. The cache memory may implement any number of different cache replacement policies, e.g., a least recently used (LRU) policy, when deciding which cache line(s) to boot from the cache. In a memory system implementing write-through cache, every time the processor writes to a cache location, the corresponding main memory location is also updated.
Usually, write-back cache provides better performance at a slightly higher risk of memory system integrity. That is, write-back cache may save a memory system from performing many unnecessary write cycles to main memory, which can lead to measurable processor execution improvements. However, when write-back cache is implemented, writes to cache locations are only placed in cache and the main memory is not actually updated until the cache line is booted out of the cache to make room for another address in the cache. As a result, at any given time there can be a mismatch of information between one or more cache lines and corresponding addresses in main memory. When this occurs, the main memory is said to be stale, as the main memory does not contain the new information that has only been written to the cache. On the other hand, in memory systems that implement write-through cache, the main memory is never stale as the main memory is written at substantially the same time that the cache is written.
Normally, stale memory is not a problem as a cache controller, implemented in conjunction with the cache, keeps track of which locations in the cache have been changed and, therefore, which locations in main memory may be stale. This has typically been accomplished by implementing an extra bit of memory, usually one per bit cache line, called a “dirty bit”. Whenever a write is cached, the “dirty bit” is set to provide an indication to the cache controller that when the cache line is reused for a different address, the information needs to be written to the corresponding address in main memory. In a typical memory system, the “dirty bit” has been implemented by adding an extra bit to a tag random access memory (RAM), as opposed to adding a dedicated separate memory. In various computer systems, it may be desirable for a cache controller to read old information from a cache line before storing new information to the cache line. For example, reading the old information before storing the new information may be done to detect errors using an error correction code (ECC) with an error correcting circuit and to update the ECC to take into account bits that change as a result of the new information.
As processor designs become increasingly advanced, management of limited off-chip processor bandwidth has become increasingly important. Limited off-chip processor bandwidth can be even more problematic in chip multiprocessor (CMP) designs. As is well known, a CMP is essentially a symmetric multi-processor (SMP) implemented on a single integrated circuit. In a typical case, multiple processor cores of the CMP share main memory, of a memory hierarchy, and various interconnects. In general, a computer system that implements one or more CMPs allows for increased thread-level parallelism (TLP). Unfortunately, limited off-chip bandwidth is increasingly difficult to manage in chip multi-processor (CMP) designs and/or other designs that are aggressive speculative architectures. As has been noted by various academic researchers, writes from cache to off-chip memory, e.g., main memory, frequently write information that is identical to that already stored in the off-chip memory. Thus, when a cache line is booted from cache that contains information that is identical to the information already stored in off-chip memory, limited off-chip bandwidth is needlessly consumed.
What is needed is a technique for reducing the use of limited off-chip bandwidth for transferring redundant information.