A multiprocessor system may comprise multiple processors coupled to a common shared system memory. Each processor may comprise one or more levels of cache memory. The multiprocessor system may further comprise a system bus coupling the processing elements to each other and to the system memory. A cache memory may refer to a relatively small, high-speed memory that contains a copy of information from one or more portions of the system memory. Frequently, the cache memory is physically distinct from the system memory. Such a cache memory may be integral with a processor in the system, commonly referred to as a Level-1 (L1) or primary cache, or may be non-integral with a processor in the system, commonly referred to as a Level-2 (L2) or secondary cache.
When a processor generates a read request and the requested data resides in its cache memory, e.g., L1 cache, then a cache read hit takes place. The processor may then obtain the data from the cache memory without having to access the system memory. If the data is not in the cache memory, then a cache read miss occurs. The memory request may be forwarded to the system memory and the data may subsequently be retrieved from the system memory as would normally be done if the cache did not exist. On a cache miss, the data that is retrieved from the system memory may be provided to the processor and may also be written into the cache memory due to the statistical likelihood that this data will be requested again by that processor. Likewise, if a processor generates a write request, the write data may be written to the cache memory without having to access the system memory over the system bus.
Hence, data may be stored in multiple locations, e.g., L1 cache of a particular processor and system memory. If a processor altered the contents of a system memory location that is duplicated in its cache memory, the cache memory may be said to hold “modified” data. The system memory may be said to hold “stale” or invalid data. Problems may result if another processor or bus agent, e.g., Direct Memory Access (DMA) controller, inadvertently obtained this “stale” or invalid data from system memory. Subsequently, it is required that processors or other bus agents are provided the most recent copy of data from either the system memory or cache memory where the data resides. This may commonly be referred to as “maintaining cache coherency.” In order to maintain cache coherency, therefore, it may be necessary to monitor the system bus to see if another processor or bus agent accesses cacheable system memory. This method of monitoring the system bus is referred to in the art as “snooping.”
Each cache may be associated with logic circuitry commonly referred to as a “snoop controller” configured to monitor the system bus for the snoopable addresses requested by a processor or other bus agent. Snoopable addresses may refer to the addresses requested by the processor or bus agent that are to be snooped by snoop controllers on the system bus. Snoop controllers may snoop these snoopable addresses to determine if copies of the snoopable addresses requested by the processor or bus agent are within their associated cache memories using a protocol commonly referred to as Modified, Exclusive, Shared and Invalid (MESI). In the MESI protocol, an indication of a coherency state is stored in association with each unit of storage in the cache memory. This unit of storage may commonly be referred to as a “coherency granule.” A “cache line” may be the size of one or more coherency granules. In the MESI protocol, the indication of the coherency state for each coherency granule in the cache memory may be stored in a cache state directory in the cache subsystem. Each coherency granule may have one of four coherency states: modified (M), exclusive (E), shared (S), or invalid (I), which may be indicated by two or more bits in the cache state directory. The modified state indicates that a coherency granule is valid only in the cache memory containing the modified or updated coherency granule and that the value of the updated coherency granule has not been written to system memory. When a coherency granule is indicated as exclusive, the coherency granule is resident in only the cache memory having the coherency granule in the exclusive state. However, the data in the exclusive state is consistent with system memory. If a coherency granule is marked as shared, the coherency granule is resident in the associated cache memory and may be in one or more cache memories in addition to the system memory. If the coherency granule is marked as shared, all of the copies of the coherency granule in all the cache memories so marked are consistent with the system memory. Finally, the invalid state may indicate that the data and the address tag associated with the coherency granule are both invalid and thus are not contained within that cache memory.
A processor or other bus agent may generate a “transfer request” to be received by a unit commonly referred to as a “bus macro”. A “transfer request” may refer to either a request to read an address not within the processor's or bus agent's associated cache memory(ies), a request to write to an address not owned by the processor's or bus agent's associated cache memory(ies), synchronization commands, address only requests, e.g., updating the state of a coherency granule, or translation lookaside buffer invalidation requests. The bus macro may be configured to determine if the received transfer request is snoopable. That is, the bus macro may be configured to determine if the received transfer request is to be broadcasted to the other snoop controllers not associated with the requesting processor or bus agent in order to determine if a copy of the requested snoopable address, i.e., a copy of the requested coherency granule, is within their associated cache memories. The broadcasted transfer request may commonly be referred to as a “snoop request.”
In some multiprocessor systems, the performance of snooping may be enhanced through “snoop pipelining.” Snoop pipelining may refer to the bus macro broadcasting multiple snoop requests prior to the completion of a previously issued snoop request. Hence, a higher snoop bus bandwidth (busses between the bus macro and snoop controllers) and lower overall snoop latency (duration of time for snoop requests to be completed) may be achieved. A snoop request may be said to be “completed” when the bus macro services that snoop request. The snoop request may typically be serviced by the bus macro after the bus macro receives a response to the snoop request from each of the snoop controllers. Servicing may include reading from or writing to an address in system memory as requested in the transfer request. Upon servicing the oldest in a series of pipelined snoop requests broadcasted, the bus macro may broadcast the next pipelined snoop request. That is, if the snoop pipeline is full, then the bus macro may broadcast the next pipelined snoop request upon servicing the oldest snoop request in the snoop pipeline. The bus macro may not broadcast the next pipelined snoop request until the oldest in a series of pipelined snoop requests is completed in order to maintain sequential consistency. Sequential consistency may refer to ensuring that a request, e.g., read from an address or write to an address, is completed in the proper order to ensure that the appropriate data is read from or written to memory, as well as to ensure that the program execution is correct.
If the bus macro received multiple transfer requests from one or more processors or bus agents during a single clock cycle and these multiple transfer requests are snoopable, then the bus macro may broadcast each transfer request, one at a time, based on an arbitration algorithm. Snoop controllers may monitor the system bus for these snoop requests (broadcasted transfer requests). If one or more of these snoop controllers detect a hit to the modified coherency granule in an associated cache, i.e., one or more snoop controllers detected that the state of the requested coherency granule was in the modified state, then these snoop controllers may issue a request, commonly referred to as a “snoop castout request,” to the bus macro. The snoop castout request is a request to write the modified data in the cache associated with the requesting snoop controller to the system memory to maintain cache coherency. However, for a variety of reasons, the bus macro may receive these snoop castout requests out of order with respect to the order the snoop requests were broadcasted. For example, snoop castout requests may be received out of order due to the different response latencies among the different snoop controllers. The different response latencies may be caused by a variety of reasons such as slower clock cycles or caches in use. In another example, snoop castout requests may be received out of order due to what may be referred to as a “replacement castout”. A replacement castout may refer to replacing a valid cache line in a cache with a new cache line where the replaced cache line is stored in a castout buffer to be castout (“replacement castout”). If there are snoop castout requests in the castout buffer along with a replacement castout, then the issuance of the snoop castout requests may be delayed if the replacement castout is issued prior to the issuance of the snoop castout requests.
Since not all snoop requests may result in a snoop castout request, the bus macro may be unable to determine the order of the snoop castout requests with respect to the order the snoop requests were broadcasted. Hence, the bus macro grants the snoop castout requests in the order they were received and not necessarily in the order the snoop requests were broadcasted. Hence, the oldest in a series of pipelined snoop requests may not necessarily be completed prior to a younger snoop request, thereby delaying the issuance of the next pipelined snoop request.
For example, if the bus macro received four transfer requests from four processors or bus agents (designated as masters 0–3) in the same clock cycle, then bus macro may broadcast these four transfer requests (designated as snoop requests 0–3) in an order based on an arbitration algorithm. For example, bus macro may first broadcast the transfer request (snoop request 0) received from master 0. Bus macro may subsequently broadcast the transfer request (snoop request 1) received from master 1 followed by broadcasting the transfer request (snoop request 2) received from master 2 followed by broadcasting the transfer request (snoop request 3) received from master 3. If one or more snoop controllers detect a hit to a modified coherency granule from the multiple snoop requests, e.g., snoop requests 2 and 3, then these one or more snoop controllers issue snoop castout requests to the bus macro. However, these snoop castout requests may be received out of order with respect to the order the snoop requests were broadcasted for one or more reasons as previously mentioned, e.g., a snoop castout request in response to snoop request 3 is received prior to receiving all the responses to snoop request 1. Since the bus macro grants the snoop castout requests in the order they were received and not necessarily in the order the snoop requests were broadcasted, snoop castout requests may be serviced out of the order the snoop requests were broadcasted. That is, the oldest in a series of pipelined snoop requests may not be serviced prior to a younger snoop request. Until the oldest in a series of pipelined snoop requests is serviced, the bus macro may not issue the next pipelined snoop request. Hence, by servicing a younger snoop request prior to servicing the oldest in a series of pipelined snoop requests, the issuance of the next pipelined request is delayed.
Therefore, there is a need in the art to ensure orderly forward progress in granting snoop castout requests.