1. Field of the Invention
This invention relates to caches. In particular, this invention relates to a cache coherency scheme for multiple caches in a multiprocessor system.
2. Description of the Related Art
With the shift of computing technology to the "network is the computer" paradigm, the need for a shared global memory address space and a coherent caching system in a networked computing system becomes increasingly important. FIG. 1A is a block diagram showing one such networked computer system 100 with a conventional non-uniform memory architecture (NUMA). System 100 includes a plurality of subsystems 110, 120, . . . 180, coupled to each other via a global interconnect 190. Each subsystem is assigned a unique network node address. Each subsystem includes one or more processors, a corresponding number of memory management units (MMUs) and caches, a main memory assigned with a portion of a global memory address space, a global interface and a local subsystem interconnect. For example, subsystem 110 includes processors 111a, 111b . . . 111i, MMUs 112a, 112b, . . . 112i, caches 113a, 113b, . . . 113i, main memory 114, global interface 115 and subsystem interconnect 119.
Data from main memories 114, 124, . . . 184 may be stored in one or more of caches 113a . . . 113i, 123a . . . 123i, and 183a . . . 183i. Thus, cache coherency among caches 113a . . . 113i, 123a . . . 123i, and 183a . . . 183i is maintained in order for system 100 to execute shared-memory programs correctly.
In order to support a conventional directory-based cache coherency scheme, subsystems 110, 120, . . . 180 also include directories 116, 126, . . . 186 coupled to global interfaces 115, 125, . . . 185, respectively. Referring now to FIG. 1B, each global interface, e.g., interface 115 includes a slave agent ("SA"), a request agent ("RA") and a directory agent ("DA"), e.g, SA 115a, RA 115b and DA 115c. Each DA is responsible for updating its associated directory with the status of all cached copies of its (home) main memory, including copies cached in other subsystems.
The status of cached copies in each node are recorded in directories 116, 126, . . . 186 as one of four states per node. An invalid ("I") state indicates that the node, i.e., subsystem, does not have a copy of the data line of interest. A shared ("S") state indicates that the node has an S copy, and that possibly other nodes may have S copies. An owned ("O") state indicates that the node has an O copy, and that possibly other nodes may have S copies. Note that the node with the O copy is required to perform a write-back upon replacement. Finally, a modified ("M") state indicates that the node is the sole owner of the data line, i.e., there are no S copies in the other nodes.
A RA provides a subsystem with a mechanism for sending read and write requests to the other subsystems. A DA provides access to and is responsible for updating its associated home directory. An SA is responsible for responding to requests from the DA of another subsystem.
Requests for data and responses are exchanged by the respective agents between subsystems 110, 120, . . . 180 in the form of data/control packets, thereby enabling subsystems to keep track of the states of their caches 113a . . . 113i, 123a . . . 123i, and 183a . . . 183i in directories 116, 126, . . . 186, respectively. These data/control packets are transported between subsystems via global interconnect 190. Unfortunately, since global interconnect 190 may be based on any one of a number of conventional networking protocols, e.g., a collision sense multiple access (CSMA) protocol, from the timing viewpoint, subsystems 110, 120, . . . 180 may be loosely coupled to each other at the network layer of the protocol. As such, while the arrival of packets end-to-end is guaranteed, the order of arrival of the packets is not necessarily guaranteed. The out-of-order arrival of packets at subsystems 110, 120, . . . 180 is problematic because they can result in "corner cases" which, if not detected and resolved, can disrupt cache coherency.
One such corner case is illustrated by FIGS. 2A-2D in which a data packet associated with an earlier-in-time read-to-share request (RTS.sub.-- req) arrives after the cache line is prematurely invalidated as a result of the arrival of a later-in-time read-to-own request (RTO.sub.-- req) initiated by another subsystem. In this example, initially, subsystem 110, subsystem 120 and a fourth subsystem (not shown in FIG. 1A) have shared ("S") copies of a data line from the memory space of subsystem 180.
Referring first to FIG. 2A, RA1 of global interface 115 of subsystem 110 sends a RTS.sub.-- req packet to DA8 of global interface 185 of subsystem 180. As shown in FIG. 2B, DA8 responds by initiating the transfer of a data packet to the requesting RA1.
Next, as shown in FIG. 2C, before the data packet arrives at RA1, RA2 of global interface 125 of subsystem 120 sends a read-to-own request (RTO.sub.-- req) packet to DA8.
FIG. 2D shows DA8 respond by initiating the transfer of a data packet to RA2. In addition, DA8 sends invalidate (Invld) packets to SA1 and SA4, the slave agents of subsystem 110 and the fourth subsystem, respectively.
Unfortunately, the later-in-time Invld packet arrives at SA1 before the earlier-in-time data packet arrives at RA1. As a result, SA1 receives the Invld packet first and proceeds to invalidate the old S copy of the data line of interest. Subsequently, RA1 receives the data packet, but is unable to update the value of its S copy because it has been erroneously and prematurely marked Invld.
Several conventional brute-force handshaking protocols for resolving corner cases do exist. FIGS. 3A-3F illustrate one prior art solution to the corner case described above. Again, using the same starting conditions as the example illustrated by FIGS. 2A-2D, subsystem 110, subsystem 120 and the fourth subsystem have S copies of a data line from the memory space of subsystem 180.
Referring first to FIG. 3A, RA1 of subsystem 110 sends a RTS.sub.-- req packet to DA8 of subsystem 180.
As shown in FIG. 3B, DA8 responds by initiating the transfer of a data packet to the requesting RA1. DA8 then idles while waiting for a read-acknowledgment (RTS.sub.-- ack) packet from RA1.
Next, as shown in FIG. 3C, RA2 sends a RTO.sub.-- req packet to DA8. However, DA8 is idle because it is waiting for a RTS.sub.-- ack packet from RA1 to arrive, and hence is unresponsive.
As shown in FIG. 3D, after receiving the RTS.sub.-- ack packet from RA1, DA8 is no longer idle and is now able to respond to the RTO.sub.-- req packet from RA2.
Accordingly, as shown in FIG. 3E, DA8 sends Invld packet(s) to any SAs of subsystems with S copies of the data line of interest. In this example, DA8 sends Invld packets to SA1 and SA4. DA8 is also responsible for sending a data packet together with the #.sub.-- Invld to RA2.
Subsequently, as shown in FIG. 3F, RA2 counts the number of incoming Invld.sub.-- ack from SA1 and SA4 thereby avoiding the corner case illustrated by FIGS. 2A-2D.
Unfortunately, the above-described brute-force hand-shaking solution for handling and/or reducing corner cases is inefficient because of the excessive number of handshaking control packets. These extra control packets substantially increase the network traffic. In other words, the "cure" for the infrequent but disastrous corner cases substantially degrade the efficiency of the network.
Hence, there is a need for a simple and streamlined cache coherency protocol which handles and/or reduces corner cases without substantially increasing network traffic. Advantages of the present invention include reduction of complicated race conditions resulting from the corner cases, ease of formal verification of the protocol due to the reduction of the race conditions, and increased reliability of the resulting cache coherent computer system.