1. Field of the Invention
The present invention relates to cache coherence mechanisms, and, more particularly, to adaptive snoop-and-forward mechanisms for multiprocessor systems.
2. Description of the Related Art
A symmetric multiprocessor (“SMP”) system employs a cache coherence mechanism to ensure cache coherence. When a read cache miss occurs, the requesting cache broadcasts a cache request to its peer caches and to the memory. When a peer cache receives the cache request, the peer cache performs a cache snoop operation and produces a cache snoop response indicating whether the requested data is found in the peer cache and the state of the corresponding cache line. If the requested data is found in a peer cache, the peer cache may source the data to the requesting cache via a cache intervention. The memory is responsible for supplying the requested data if the requested data cannot be supplied by any peer cache.
Referring now to FIG. 1, an exemplary cache-coherent multiprocessor system 100 is shown that comprises multiple nodes interconnected via an interconnect network, wherein each node comprises a central processing unit (“CPU”) and a cache. The interconnect network can be a shared bus or a message-passing network such as a torus network. Also connected to the interconnect network are a memory and some input/output (“I/O”) devices. Although the memory is depicted as one component, the memory can be physically distributed into multiple memory portions, wherein each memory portion is operatively associated with a node.
Referring now to FIG. 2, another exemplary cache-coherent multiprocessor system 200 is shown that comprises multiple nodes interconnected via an inter-node interconnect, wherein each node comprises a chip multiprocessor (“CMP”) subsystem. The inter-node interconnect network can be a shared bus or a message-passing network such as a torus network. Each CMP subsystem comprises one or more caches that can communicate with each other via an intra-node interconnect (also referred to as intra-node fabric). A memory portion, as well as some input/output devices, can also be connected to the intra-node fabric.
For the purposes of the present invention, a cache is referred to as a requesting cache of a cache request, if the cache request is originally generated from the cache. Likewise, a node is referred to as a requesting node of a cache request, if the cache request is originally generated from a cache in the node. A cache request can be a read request that intends to obtain a shared copy of requested data, a read-with-intent-to-modify request that intends to obtain an exclusive copy of requested data, and an invalidate request that intends to invalidate shared copies of requested data in other caches.
A number of techniques for achieving cache coherence in multiprocessor systems are known to those skilled in the art, such as snoopy cache coherence protocols. For example, the MESI snoopy cache coherence protocol and its variants have been widely used in SMP systems. As the name suggests, MESI has four cache states: modified (M), exclusive (E), shared (S) and invalid (I). If a cache line is in an invalid state in a cache, the data is not valid in the cache. If a cache line is in a shared state in a cache, the data is valid in the cache and can also be valid in other caches. This state is entered, for example, when the data is retrieved from the memory or another cache, and the corresponding snoop responses indicate that the data is valid in at least one of the other caches. If a cache line is in an exclusive state in a cache, the data is valid in the cache, and cannot be valid in any other cache. Furthermore, the data has not been modified with respect to the data maintained in the memory. This state is entered, for example, when the data is retrieved from the memory or another cache, and the corresponding snoop responses indicate that the data is not valid in any other cache. If a cache line is in a modified state in a cache, the data is valid in the cache and cannot be valid in any other cache. Furthermore, the data has been modified as a result of a memory store operation, and the modified data has not been written to the memory.
When a cache miss occurs, if the requested data is found in both memory and another cache, supplying the data via a cache intervention is often preferred because cache-to-cache transfer latency is usually smaller than memory access latency. For example, in the IBM® Power 4 system, when data of an address is shared in at least one cache in a multi-chip module, the cache with the last shared copy can supply the data to another cache in the same module via a cache intervention.
In a modern SMP system, caches generally communicate with each other via a message-passing network instead of a shared bus to improve system scalability and performance. In a bus-based SMP system, the bus behaves as a central arbitrator that serializes all bus transactions to ensure a total order of bus transactions. In a network-based SMP system, in contrast, messages can be received in different orders at different receiving caches. One skilled in the art will appreciate that appropriate ordering of coherence messages is generally needed for efficient cache coherence support.
To support cache coherence in SMP systems in which caches are interconnected via a message-passing network, one promising approach is to rely on a particular network topology that can guarantee certain desirable message-passing ordering. For example, consider an SMP system in which caches communicate with each other via a unidirectional ring. When a first cache intends to broadcast a message, the first cache sends the message to a second cache, which is the subsequent cache to the first cache in the unidirectional ring. The second cache receives the message and then forwards the message to a third cache, which is the subsequent cache to the second cache in the unidirectional ring. The process continues like so with further subsequent caches in the unidirectional ring until the message is delivered to all the caches.
It becomes apparent that the unidirectional ring topology ensures the so-called triangle ordering, assuming in-order message passing from a cache to its subsequent cache in the unidirectional ring. With triangle ordering, if cache A sends a first message to caches B and C, and cache B receives the first message from cache A and then sends a second message to cache C, it is guaranteed that cache C receives the first message from cache A before receiving the second message from cache B. It can be shown that triangle ordering provides effective support for cache coherence implementation.
The approach of relying on message-passing ordering guarantee of a unidirectional ring can be extended to a hierarchical cache-coherent multiprocessor system. For example, consider an SMP system that includes multiple chips, wherein each chip includes multiple processors and caches. Within the chip boundary, a chip can use a central arbiter for intra-chip cache coherence. The central arbiter behaves as a bus that serializes outgoing cache requests issued from the chip. Beyond the chip boundary, a unidirectional ring is used to pass inter-chip cache requests and cache snoop responses.
In such a hierarchical system, when a cache miss occurs in a cache, the cache sends a request to the on-chip central arbiter. The central arbiter sends a coherence message to other caches on the same chip. The central arbiter determines that a cache request cannot be serviced locally, if requested data is not found in any on-chip cache for a read cache miss, or exclusive ownership is not found in any on-chip cache for a write cache miss. In this case, the central arbiter issues an appropriate inter-chip cache request that will be passed to all other chips via a unidirectional ring. The central arbiter can ensure that a chip can have at most one outstanding cache request regarding the same address.
One potential drawback of using a unidirectional ring is the overall latency to service a read request, especially when the sourcing cache that services the read request is far away in the unidirectional ring from the requesting cache. Therefore, it is generally desirable to develop a mechanism that can effectively reduce the overall latency of servicing a cache request, with reasonable bandwidth consumption.