1. Field of the Invention
The present invention relates to the field of data processing, and more particularly to a protocol for snooping cache memory.
2. Description of the Related Art
Caches are used in various forms to reduce the effective time required by a processor to access instructions or data that are stored in main memory. The theory of a cache is that a system attains a higher speed by using a small portion of very fast memory as a cache along with a larger amount of slower main memory. The cache memory is usually placed operationally between the data processing unit or units and the main memory. When the processor needs to access main memory, it looks first to the cache memory to see if the information required is available in the cache. When data and/or instructions are first called from main memory, the information is stored in cache as part of a block of information (known as a cache line) that is taken from consecutive locations of main memory. During subsequent memory accesses to the same addresses, the processor interacts with the fast cache memory rather than main memory. Statistically, when information is accessed from a particular block in main memory, subsequent accesses most likely will call for information from within the same block. This locality of reference property results in a substantial decrease in average memory access time.
FIG. 1 is a simplified block diagram of the cache 100. The cache includes a set of cache lines 102. Each cache line 102 is capable of storing a block of data 104 from consecutive addresses in main memory. Each cache line 102 is associated with a tag 106, which represents a block address of the line. A set of MESI (Modified Exclusive Shared Invalid) bits 110 are used to maintain cache consistency. The reading and writing of data in the cache is controlled by a cache access logic circuit 112.
The use of cache memory in the context of various computer systems is illustrated in FIGS. 2, 3 and 4. FIG. 2 shows cache memory used in a uniprocessor system. A CPU 200 includes an internal L1 cache 202 and is coupled to a second level cache L2 204. The second level cache 204 may reside on its own chip or on the same chip as the CPU 200. The CPU 200 is coupled to a memory bus 206, which allows the CPU 200 to conduct transactions with main memory (DRAM) 208 through a memory controller 210, and with various input/output devices 212 over an I/O bus 214 through an I/O controller 216.
The processors and their corresponding caches (not shown) may be combined into a multiprocessor configuration such as that shown in FIG. 3. Processors CPU1 300, CPU2 302, CPU3 304 and CPU4 306 are each coupled to a multiprocessor bus 308. Through the multiprocessor bus 308, the individual processors may communicate with each other and with I/O (not shown) and memory (not shown). One skilled in the art would understand that any number of individual processors may be coupled to the multiprocessor bus.
FIG. 4 illustrates a more sophisticated system in which individual multiprocessor systems or "clusters" communicate with each other over a common bus. As shown in FIG. 4, a first cluster 400 includes a first multiprocessor MP1 402 coupled through a memory bus 404 to I/O 406, a third level cache L3 408, a memory unit 410 and a cluster controller 412.
A second cluster 414 includes a second multiprocessor MP2 416 coupled through a second cluster bus 418 to I/O 420, memory 422, a third level L3 cache 424 and a second cluster controller 426. The clusters 400 and 414 communicate with each other through their respective cluster controllers 412 and 426 over a cluster interconnect 428. One skilled in the art would understand that the multiprocessor clusters each include processors, an optional I/O controller and an optional memory controller. One skilled in the art would also understand that a large number of clusters may be connected in a multicluster system. Optionally, the cluster interconnect may include global memory controller 430 and global I/O controller 432.
In each of these systems, consistency must be maintained among the caches and memory distributed throughout the system. For example, a computer system may implement a write through policy to update main memory at the same time a write operation from a processor changes the contents of its cache. Alternatively, under a write back policy, the data in main memory is updated only when the cache line containing the data is forced out of the cache or when another agent in the system, such as another processor or another cluster, needs to access the data. A cache line may be forced out of the cache, for example, if it is the least recently used (LRU) cache line. By its very nature, the write back policy results in less traffic on the memory bus between cache and memory because it avoids the unnecessary writing of data to memory when the line may not be needed by another agent on the bus.
Table 1 illustrates some of the state transitions experienced by caches associated with a requesting agent and a snooping agent in response to a memory or I/O access request from the requesting agent. The term "requesting agent" is used to refer to a processor or other device, such as an I/O or cluster controller, initiating the access request. The term "snooping agent" refers to caches that snoop their buses for the access request to determine how to change the state of their associated cache lines to maintain cache consistency. For the sake of simplicity, the table only illustrates transitions from which the requesting agent cache line starts in the invalid state. One skilled in the art would understand how to extend the state transition table of Table 1 to describe state transitions beginning from the modified, exclusive and shared states.
TABLE 1 ______________________________________ WRITE BACK Snoop Signal Request Requesting Agent HIT# HITM# Snooping Agent ______________________________________ Read I .fwdarw. E 0 0 I .fwdarw. I I .fwdarw. S 1 0 S .fwdarw. S I .fwdarw. S 1 0 E .fwdarw. S I .fwdarw. S 1 1 M .fwdarw. S I .fwdarw. E 0 1 M .fwdarw. I Write I .fwdarw. M 0 0 I .fwdarw. I I .fwdarw. M 1 0 S .fwdarw. I I .fwdarw. M 1 0 E .fwdarw. I I .fwdarw. M 1 1 M .fwdarw. I ______________________________________
Table 1 also includes the snoop results provided by a snooping agent in the form of active low HIT# and HITM# signals. Here, a 0 indicates that the signal is inactive, while a 1 represents that the signal is active.
Starting from the invalid state, in response to a memory access read request, if no snooping agent asserts the HIT# or HITM# signal, then the requesting agent cache line will go from the invalid to the exclusive state. The inactive snoop signals indicate that no other cache holds the cache line retrieved from memory in response to the memory access request, and that the line is thus exclusive to the requesting agent cache and not shared with any other caches.
If, however, in response to a read request, a snooping agent asserts the HIT# signal because it caches the requested line, then the requesting agent cache line will make a state transition from the invalid state to the shared state to indicate that the line is shared with another cache. If the line was previously in the exclusive state in the snooping agent cache, then it will also make a transition to the shared state to maintain consistency with the requesting agent cache, which now caches the same line.
If the line requested by the requesting agent is in modified state in a snooping agent, then the requesting agent cache line will make a transition from the invalid to the shared state or the exclusive state, depending on whether both the HIT# and HITM# signals are asserted together or just HITM# alone is asserted. Modified cache lines will be described immediately below.
When carrying out a write operation, write back caches typically assume a write allocate policy. Under this policy, to write the data into the cache, the requesting cache must first perform a "read for ownership" in which the cache first reads the line specified by the request address and then merges the write data into the request address location within the cache line. During the read for ownership phase, the requesting agent cache line makes a transition from the invalid state to the exclusive state. The snooping agents all make a transition to the invalid state to remain consistent with the requesting agent cache, which now "owns" the cache line. To complete the write operation, the requesting agent merges the write data into the cache line and sets its MESI state to modified (M) to indicate that the line is modified and thus inconsistent with main memory and all other caches.
Referring back to the read operation, if the snoop result indicates that the cache line requested by the requesting agent is in a modified (M) state in another snooping agent cache, then that snooping agent must intervene before memory can supply the data. The operation is a three step process. The requesting agent aborts the request. The snooping agent performs a write back operation to main memory. The requesting agent then retries the operation. Accordingly, the snooping agent will change the state of its transferred line from modified (M) to shared (S) for a read operation. However, if the read operation is a read for ownership, such as that performed as an interim step during a write operation, then the state of the line in the snooping agent is changed from modified (M) to invalid (I).
The MESI protocol exhibits a number of advantages. When a processor attempts to write a cache line and the line is in a modified or exclusive state in its cache, then it is known that the line is in an invalid state in all other caches. For that reason, the requesting agent need not perform operations on the memory bus to conduct the write operation, thus minimizing bus traffic. Moreover, conducting operations on the bus creates a bus access latency penalty, which the MESI state avoids.
From the above description, it is apparent that a requesting agent must monitor its bus for the snoop result to return from other snooping agents before it can complete the requested operation and correctly modify the state of the affected cache lines. However, under a number of circumstances, the transmission of the snoop results to the requesting agent may be delayed. First, the main cause of delay is that the snooping agent is a slow cache that takes a relatively long time to perform a tag match of the requested address with the tags in the cache. Second, the snooping agent may experience an internal block. This may occur when the local bus between the snooping agent's processor core and its local cache is occupied with a transaction between those two units. In that case, the local cache cannot be snooped to provide snoop results coming to an external bus. Third, a delay in receiving snoop results can occur due to an external deadlock in which the system is unable to determine whether a transaction is guaranteed to complete. This can happen when, for example, multiprocessor bus traffic or cluster interconnect traffic delays the placement of snoop results on to the respective multiprocessor bus or interconnect.
The delay in providing snoop results requires that all bus agents extend or "stretch" their snoop phases until the snoop results are available. Conventional systems support equivalent functions using multiple additional signals. In particular, a separate busy signal is used in such systems to indicate a snoop phase stretch. It is desirable to minimize the number of bus signals used to indicate cache state and the availability of cache results. In addition, it is desirable to use a minimal number of such bus signals to indicate the delay of signals other than those indicating cache state.