1. Technical Field
The present invention relates in general to the field of computers, and, in particular, to cache memory in a multi-processor computer system. Still more particularly, the present invention relates to an improved method and system for avoiding a live lock condition that may occur when a processor's cache memory in the multi-processor computer system attempts to invalidate a cache line stored in other processors' cache memories in the multi-processor computer while the other processors' cache memories are using the cache line.
2. Description of the Related Art
The basic structure of a conventional symmetrical multi-processor (SMP) computer system 100 is shown in FIG. 1. Computer system 100 has multiple processing units 102. Each processing unit 102 communicates, via a system bus 104, with other processing units 102, input/output (I/O) devices 106 (such as a display monitor, keyboard, graphical pointer (mouse), and a permanent storage device or hard disk) and a system memory 108 (such as random access memory or RAM) that is used to dynamically store data and program instructions used by processing units 102. Computer system 100 may have many additional components which are not shown, such as serial, parallel, and universal system bus (USB) ports for connection to modems, printers or scanners. Other components not shown that might be used in conjunction with those shown in the block diagram of FIG. 1 may include a display adapter used to control a video display monitor, a memory controller to access system memory 108, etc.
Typically, each processing unit 102 includes a processor 110, which includes a processor core 112, a level-one (L1) instruction cache 114 and a level-one (L1) data cache 116. When packaged in the same chipset as the processor 110, the L1 cache memory (L1 cache) is often referred to as “on-board” cache memory. The L1 (on-board) cache communicates with system bus 104 via a level-two (L2) cache memory 118. L2 cache memory 118 includes a read/claim (R/C) queue 120 that interfaces with the L1 cache, and a snoop queue 122 that interfaces with system bus 104. Snoop queue 122 monitors (“snoops”) system bus 104 for instructions to processing unit 102.
In the symmetric multi-processor (SMP) computer systems shown in FIG. 1, all of the processing units directly share the same system memory. Another conventional multi-processor computer is a non-uniform memory access (NUMA) multi-processor, such as depicted in FIG. 2. A conventional NUMA computer system 200 includes a number of nodes 202 connected by a switch 204. Each node 202, which can have the architecture of an SMP system, includes a local interconnect 206 to which a number of processing units 208 are coupled. Processing units 208 each contain a central processing unit (CPU) 210 and associated cache hierarchy 212. At the lowest level of the volatile memory hierarchy, nodes 202 further contain a system memory 214, which may be centralized within each node 202 or distributed among processing units 208 as shown. CPUs 210 access memory 214 through a memory controller (MC) 216.
Each node 202 further includes a respective node controller 218, which maintains data coherency and facilitates the communication of requests and responses (both graphically depicted as 219) between nodes 202 via switch 204. Each node controller 218 has an associated local memory directory (LMD) 220 that identifies the data from local system memory 214 that are cached in other nodes 202, a remote memory cache (RMC) 222 that temporarily caches data retrieved from remote system memories, and a remote memory directory (RMD) 224 providing a directory of the contents of RMC 222.
In any multi-processor computer system, including those depicted in FIGS. 1 and 2, it is important to provide a coherent cache memory system. Each cache line (section of cache memory), also called a cache block, in a cache memory correlates to a section of system memory. If one of the processing units 102 modifies data in that unit's cache memory (L1, L2 or other), then all of the other processors' caches in the multi-processor computer system should either receive an updated copy of the cache line reflecting the new data (write-update), or the other processors' caches should be notified that they no longer have a valid copy of the data (write-invalidate). The most common protocol for changing data in a cache is the write-invalidate method.
There are a number of protocols and techniques for achieving cache coherence when using a write-invalidate system that are known to those skilled in the art. All of these mechanisms for maintaining coherency require that the protocols allow only one processor to have “permission” allowing a write operation to a given memory location (cache line) at any given point in time. As a consequence of this requirement, whenever a processing element attempts to write to a memory location, it must first inform all other processing elements of its desire to write the location and receive permission from all other processing elements to carry out the write.
To implement cache coherency in a system, the processors communicate over a common generalized interconnect, such system bus 104 shown in FIG. 1 or switch 204 shown in FIG. 2. The processors pass messages over the interconnect indicating their desire to read from or write to memory locations. When an operation is placed on the interconnect, all of the other processors “snoop” (monitor) this operation and decide if the state of their caches can allow the requested operation to proceed and, if so, under what conditions. There are several bus transactions that require snooping and follow-up action to honor the bus transactions and maintain memory coherency. The snooping operation is triggered by the receipt of a qualified snoop request, generated by the assertion of certain bus signals.
This communication is necessary because, in systems with caches, the most recent valid copy of a cache line may have moved from the system memory to one or more of the caches in the system (as discussed above). If a processor attempts to access a memory location not present within its cache hierarchy, the correct version of the cache line, which contains the actual (current) value for the memory location, may either be in the system memory or in one of more of the caches in another processing unit. If the correct version is in one or more of the other caches in the system, it is necessary to obtain the correct value from the cache(s) in the system instead of system memory.
For example, consider a processor attempting to read a location in memory. It first polls its own L1 cache. If the cache line is not present in the L1 cache, the request is forwarded to the processor's own L2 cache. If the cache line is not present in the L2 cache, the request is then presented on the generalized interconnect to be serviced. Once an operation has been placed on the generalized interconnect, all other processing units snoop the operation and determine if the cache line is present in their caches. If a given processing unit has the cache line requested in its local L2 cache, then assuming that the data is valid (has not been modified in the L1 cache of that processor), the processor sends the requested cache line to the requesting processor.
Thus, when a processor wishes to read or write a cache line, it must communicate that desire with the other processing units in the system in order to maintain cache coherence. To achieve this, the cache coherence protocol associates with each cache line in each level of the cache hierarchy, a status indicator indicating the current “state” of the cache line. The state information is used to allow certain optimizations in the coherency protocol that reduce message traffic on the generalized interconnect and the inter-cache connections.
Examples of the state information include those described in a specific protocol referred to as “MESI.” In this protocol, a cache line can be in one of four states: “M” (Modified), “E” (Exclusive), “S” (Shared) or “I” (Invalid). Under the MESI protocol, each cache entry (e.g., a 32-byte sector) has two additional bits which indicate the state of the entry, out of the four possible states. Depending upon the initial state of the entry and the type of access sought by the requesting processor, the state may be changed, and a particular state is set for the entry in the requesting processor's cache. For example, when a cache line is in the Modified state, the addressed cache line is valid only in the cache memory having the modified cache line, and the modified value has not been written back to system memory. When a cache line is Exclusive, it is present only in the noted cache memory, and is consistent with system memory. If a sector is Shared, it is valid in that cache memory and in at least one other cache memory, all of the shared cache lines being consistent with system memory. Finally, when a sector is Invalid, it indicates that the addressed cache line is not resident in the cache memory. A cache line in any of the Modified, Shared or Invalid states can move between the states depending upon the particular bus transaction. While a cache line in an Exclusive state can move to any other state, a cache line can only become Exclusive if it is first Invalid.
A further improvement in accessing cache lines can be achieved using the cache coherency protocol. This improvement, referred to as “intervention,” allows a cache having control over a cache line to provide the data in that cache line directly to another cache requesting the value (for a read-type operation), in other words, bypassing the need to write the data to system memory and then have the requesting processor read it back again from memory. Intervention can generally be performed only by a cache having the value in a cache line whose state is Modified or Exclusive. In both of these states, there is only one cache line that has a valid copy of the value, so it is a simple matter to source (write) the value over the bus without the necessity of first writing it to system memory. The intervention procedure thus speeds up processing by avoiding the longer process of writing to and reading from system memory (which actually involves three bus operations and two memory operations). This procedure not only results in better latency, but also increased bus bandwidth.
There are many variations of the MESI protocol. One variation of the MESI protocol is the R-MESI protocol, typically used in SMP computers. Under the R-MESI protocol, the last processor to receive a shared cache line designates the cache line as “R” (for Recently shared) instead of “S” (for Shared). This denotes that the processor with the R cache line has the exclusive right to share the line with other processors.
Another variation of the MESI protocol is used with NUMA computers uses the notation SL and SR. As shown in FIG. 2, each node 202 has independent memory 214. Data in one cache 212, such as cache 212a in node 202a, may be shared with another cache 212, such as cache 212b found in node 202b. If the shared data in the cache 212a is also in memory 214a, then that data (cache line) is noted as SL in cache 212a and SR in cache 212b. Thus node 202a knows that the data is relatively close by for management purposes.
A cache transaction may require any cache memories (caches) which currently contain a value to invalidate the corresponding cache lines. For example, when a processor or I/O device issues a store operation for a particular cache line, any caches in other processors which have earlier copies of the cache line must invalidate, or “kill,” those cache lines. The processing unit having the cache memory that wants to modify the cache line sends a bus transaction, called a “kill” command, to the cache memories, including L1 cache memories and L2 cache memories, in all other processing units in the system. This kill command tells the other cache memories to invalidate (kill) the cache line being modified by the first processing unit. The two main types of kill commands are called read-with-intent-to-modify (RWITM) and data claim (DClaim). The RWITM command is sent when the cache line modifying processing unit does not initially have the cache line to be modified in its cache memory, and thus must first “read” the cache line before modifying it. The DClaim command is similar to the RWITM command, except the modifying processing unit already has a copy (either a Shared copy or an Exclusive copy) of the cache line to be modified.
Typically, there are two possible responses to a kill command. One is an “ACK” (acknowledge) response, indicating that the other processing unit's cache memory has received the kill command, and has killed (invalidated) the cache line described. The other possible response is a “Retry” response. The retry response instructs the processing unit that sent the kill command to send the command again (“retry”) because the kill command was not complied with. The reason for the non-compliance with the kill command may be 1) the receiving processor cache is delivering a copy of the subject cache line to the processor (R/C queue is active); 2) the receiving processor cache is delivering a shared copy of the subject cache line to another processor's cache (snoop queue is active); or 3) the receiving processor's cache snoop queue is temporarily full, and cannot receive any new commands.
One solution to the third condition described above (full snoop queue) is disclosed in U.S. Pat. No. 6,336,169 (Arimilli et al., Jan. 1, 2002), which is herein incorporated by reference in its entirety.
The present invention, however, addresses the first two reasons for a retry response (busy snoop queue or busy R/C queue).
For purposes of illustration and clarity, suppose that processing unit P0 shown in FIG. 1 issues a kill command for cache line “A.” Now suppose that at the same time processing unit P30 is sending processing unit P31 a copy of cache line “A.” Then processing unit P30 and/or P31 will send a retry response back to processing unit P0, since the cache line is “in flight” and under the temporary control of processing units P30 and/or P31. This condition is shown in a timing chart depicted in FIG. 3. At time 302, processing unit P30 receives an address read instruction from processing unit P31. At time 304, processing unit P31 receives an acknowledgment from P30 for the address read instruction, and awaits the delivery of cache line “A” from (typically the L2 cache in) processing unit P30. During combined response (C.R.) period 306, both processing unit P30 and processing unit P31 have combined control over cache line “A.”
At time 308, processing unit P31 completes the transfer of cache line “A.” The time period from time 302 to time 308 is the total time during which cache line “A” is in flight from processing unit P30 to processing unit P31. This total time is depicted as cache line busy time (CLB) 310. If during CLB 310 a kill command 312 is sent from processing unit P0, then processing unit P30 or P31 (depicted) sends a “retry” response 314 back to processing unit P0, as described above. Processing unit P0 then resends the kill command, as shown at 316. Assuming the re-sent kill command arrives without further incident, then all cache lines having cache line “A” are killed (invalidated).
With reference now to FIG. 4, there is depicted a “live lock” scenario resulting from multiple overlapping cache line busy signals (CLB) 410. For example, assume an L2 cache in processing unit X1 is using cache line “A.” If so, processing unit X1 will issue CLB 410-1. If an L2 cache from a processing unit “0” issues a kill command 412-1, processing unit X1 will respond with a retry command 414-1. Before processing unit X1 completes the operation with cache line “A,” a second processing unit X2 may start control over cache line “A.” When processing unit “0” resends a kill command 412-2, processing unit X2 issues a new retry response 414-2. Likewise, a third processing unit X3 may initiate control over cache line “A” before processing unit X2 has completed the CLB 410-2, and so on. As depicted, a processing unit may never be able to get the other processing units in the system to acknowledge the kill command 412, due to the live lock described. A live lock can also occur under conditions in which CLB's 410 do not overlap, but are sufficiently close together to effectively block kill command from processing unit “0.”
Thus, there is a need for a method and system that avoids a live lock that requires a kill command to be re-sent an indefinite number of times.