Most computer systems employ a multilevel hierarchy of memory systems, with relatively fast, expensive, limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a small fast memory called a cache, either physically integrated within a processor integrated circuit, or mounted physically close to the processor for speed. There may be separate instruction caches and data caches. There may be multiple levels of caches.
Caches are commonly organized around an amount of memory called a line, or block, or page. The present patent document uses the term “line,” but the invention is equally applicable to systems employing blocks or pages.
Many computer systems employ multiple processors, each of which may have multiple levels of caches. Some caches may be shared by multiple processors. All processors and caches may share a common main memory. A particular line may simultaneously exist in memory and in the cache hierarchies for multiple processors. All copies of a line in the caches must be identical, a property called coherency. The protocols for maintaining coherence for multiple processors are called cache coherence protocols.
A cache “owns” a line if the cache has permission to modify the line without issuing any further coherency transactions. There can only be one “owner” of a line. For any cache coherence protocol, the most current copy of a cache line must be retrieved from the current owner, if any, and a copy of the data must be delivered to the requestor. If the line is to be modified, ownership must be acquired by the requester, and any shared copies must be invalidated.
There are three common approaches to determine the location of the owner of a line, with many variations and hybrids. In one approach, called a snooping protocol, or snoop-based protocol, the owner is unknown, and all caches must be interrogated (snooped) to determine the location of the most current copy of the requested line. All requests for access to a cache line, by any device in the system, are forwarded to all caches in the system. Eventually, the most current copy of a line is located and a copy is provided to the requestor. In a single-bus system, coherence (snooping) traffic, addresses, and often data all share a common bus.
In a second approach, called a directory-based protocol, memory is provided to maintain information about the state of every line in the memory system. For example, for every line in memory, a directory may include a bit for each cache hierarchy to indicate whether that cache hierarchy has a copy of the line, and a bit to indicate whether that cache hierarchy has ownership. For every request for access to a cache line, the directory must be consulted to determine the owner, and then the most current copy of the line is retrieved and delivered to the requestor. Typically, tags and status bits for a directory are stored in main memory, so that a request for state information cycles main memory and has the latency of main memory. In a multiple bus system, directory traffic may be on a separate bus.
A third approach is a global coherency filter, which has a tag for every valid line in the cache system. A coherency filter is a snoop system with a second set of tags, stored centrally, for all caches in the system. A request for a cache line is forwarded to the central filter, rather than to all the caches. The tags for a coherency filter are typically stored in a small high-speed memory. Some coherency filters may only track owned lines, and may not be inclusive of all shared lines in the system. In a multiple bus system, coherency filter traffic may be on a separate bus.
For relatively small systems, with one bus or with only a few buses, snoop-based protocols provide the best performance. However, snoop-based systems with one bus increase bus traffic, and for large systems with one bus or with only a few buses, snoop traffic can limit overall performance. Directory-based systems increase the time required to retrieve a line (latency) relative to snooping on a single bus, but in a multiple-bus system a directory requires less coherency traffic on the system buses than snoop-based systems. For large multiple-bus systems, where bus traffic may be more important than latency, directory-based systems typically provide the best overall performance. Many computer systems use some sort of hybrid of snoop-based and directory-based protocols. For example, for a multiple bus system, snoop-based protocols may be used for coherency on each local bus, and directory-based protocols may be used for coherency across buses.
If a processor requests a line, the overall time required to retrieve the line (overall latency) includes (1) the time required to acquire access rights using a cache coherency protocol, (2) the time required to process an address, and (3) the time required to retrieve and transfer the data. As discussed above, bus traffic for coherency requests can limit overall performance.
One way to decrease bus traffic for coherency requests is to increase the line size. For example, if contiguous lines are requested, each line requires a separate coherency request. If line size is doubled, twice as much data is read for each coherency request. In addition, a substantial part of overall latency is the time required to route a memory request to the various memory components and to get the data from those components. Larger lines provide more data for each request. However, as lines become even larger, much of the data transferred is not needed, and much of the cache space is filled with data that is not needed. This increases the bus traffic for data transfer, and increases the cache miss rate, both of which negatively impact overall performance. In addition, some fraction of a line may be needed exclusively by more than one processor or node. This can cause excessive cache-to-cache copy activity as the two processors or nodes fight for ownership, and the resulting number of coherency requests may increase.
As an alternative, it is known to permit partial line (or partial block) invalidation. It is also known to prefetch extra sub-lines. For example, see C. K. Liu and T. C. King, A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches, The Journal of Supercomputing, 11, 405-420 (1997). Liu and King describe a coherence protocol for invalidating sub-lines, and for prefetching of multiple sub-lines.
There is an ongoing need to reduce overall latency while maintaining coherency, particularly for large multiple-bus systems.