1. Field of the Invention
The present invention relates to a computer-based memory system, and, more particularly, to cache coherence operations.
2. Description of the Related Art
A symmetric multiprocessor (“SMP”) system generally employs a snoopy mechanism to ensure cache coherence. When a cache miss occurs, the requesting cache may send a cache request to the memory and all its peer caches. When a peer cache receives the cache request, the peer cache snoops its cache directory and produces a cache snoop response indicating whether the requested data is found and the state of the corresponding cache line. If the requested data is found in a peer cache, the peer cache can source the data to the requesting cache via a cache intervention (i.e., cache-to-cache transfer). The memory is responsible for supplying the requested data if the data cannot be supplied by any peer cache.
Referring now to FIG. 1, an exemplary SMP system is shown that includes multiple processing units interconnected via an interconnect network. Each processing unit includes a processor core and a cache. Also connected to the interconnect network are a memory and some I/O devices. The memory can be physically distributed into multiple memory portions, such that each memory portion is operatively associated with a processing unit. The interconnect network serves at least two purposes: (1) sending cache coherence requests to the caches and the memory; and (2) transferring data among the caches and the memory. Although four processing units are depicted, it is understood that any number of processing units can be included in the system. Furthermore, although only one cache is shown in each processing unit, it is understood that each processing unit may comprise a cache hierarchy with multiple caches, as contemplated by those skilled in the art.
There are many techniques for achieving cache coherence that are known to those skilled in the art. A number of snoopy cache coherence protocols have been proposed. The MESI coherence protocol and its variations have been widely used in SMP systems. As the name suggests, MESI has four cache states: modified (M), exclusive (E), shared (S) and invalid (I). If a cache line is in an invalid state, the data in the cache is not valid. If a cache line is in a shared state, the data in the cache is valid and can also be valid in other caches. The shared state is entered when the data is retrieved from memory or another cache, and the corresponding snoop responses indicate that the data is valid in at least one of the other caches. If a cache line is in an exclusive state, the data in the cache is valid, and cannot be valid in another cache. Furthermore, the data in the cache has not been modified with respect to the data maintained at memory. The exclusive state is entered when the data is retrieved from memory or another cache, and the corresponding snoop responses indicate that the data is not valid in another cache. If a cache line is in a modified state, the data in the cache is valid and cannot be valid in another cache. Furthermore, the data has been modified as a result of a store operation.
When a cache miss occurs, if the requested data is found in both memory and another cache, supplying the data via a cache intervention is often preferred because cache-to-cache transfer latency is usually smaller than memory access latency. The IBM® Power 4 system, for example, enhances the MESI protocol to allow more cache interventions. An enhanced coherence protocol allows data of a shared cache line to be supplied to another cache via a cache intervention. In addition, if data of a modified cache line is supplied to another cache, the modified data is not necessarily written back to the memory immediately. A cache with the most up-to-date data can be held responsible for memory update when the data is eventually replaced.
For the purposes of the present disclosure, a cache that generates a cache request is referred to as the “requesting cache” of the cache request. A cache request can be sent to one or more caches and the memory. Given a cache request, a cache is referred to as a “sourcing cache” if the corresponding cache state shows that the cache can source the requested data to the requesting cache via a cache intervention. A cache is referred to as a “non-sourcing cache” if the corresponding cache state shows that the cache does not contain the requested data or cannot source the requested data to the requesting cache.
A major drawback of snoopy cache coherence protocols is that a cache request is usually broadcast to all caches in the system. This can cause serious problems to overall performance, system scalability and power consumption, especially for large SMP systems. Further, broadcasting cache requests indiscriminately may consume enormous network bandwidth, while snooping peer caches unnecessarily may require excessive cache snoop ports. It is worth noting that servicing a cache request may take more time than necessary when far away caches are snooped unnecessarily.
Directory-based cache coherence protocols have been proposed to overcome the scalability limitation of snoop-based cache coherence protocols. Typical directory-based protocols maintain directory information as a directory entry for each memory block to record the caches in which the memory block is currently cached. With a full-map directory structure, for example, each directory entry comprises one bit for each cache in the system, indicating whether the cache has a data copy of the memory block. A dirty bit can be used to indicate if the data has been modified in a cache without updating the memory to reflect the modified cache. Given a memory address, its directory entry is usually maintained in a node in which the corresponding physical memory resides. This node is often referred to as the “home” of the memory address. When a cache miss occurs, the requesting cache sends a cache request to the home, which generates appropriate point-to-point coherence messages according to the directory information.
However, directory-based coherence protocols have various shortcomings. First, maintaining a directory entry for each memory block usually results in significant storage overhead. Alternative directory structures such as a limited directory or a chained directory can reduce the storage overhead but with performance compromises. Second, accessing the directory can be time-consuming because directory information is usually stored in dynamic random access memories (DRAM's). Caching recently-used directory entries can potentially reduce directory access latencies but with increased implementation complexity. Third, accessing the directory causes three or four message passing hops to service a cache request, compared with two message passing hops with snoopy coherence protocols.
Consider a scenario in which a cache miss occurs in a requesting cache, while the requested data is modified in another cache. To service the cache miss, the requesting cache sends a cache request to the corresponding home. When the home receives the cache request, it forwards the cache request to the cache that contains the modified data. When the cache with the modified data receives the forwarded cache request, it sends the requested data to the requesting cache (an alternative is to send the requested data to the home, which will forward the requested data to the requesting cache).
Thus, it is generally desirable to have a scalable and efficient cache coherence protocol that combines the advantages of both snoop-based and directory-based cache coherence approaches without the disadvantages found individually in each approach.