A multiprocessor system may comprise multiple processors coupled to a common shared system memory. The multiprocessor system may further include one or more levels of cache associated with each processor. A cache includes a relatively small, high speed memory (“cache memory”) that contains a copy of information from one or more portions of the system memory. A Level-1 (L1) cache or primary cache may be built into the integrated circuit of the processor. The processor may be associated with additional levels of cache, such as a Level-2 (L2) cache and a Level-3 (L3) cache. These lower level caches, e.g., L2, L3, may be employed to stage data to the L1 cache and typically have progressively larger storage capacities but longer access latencies.
The cache memory may be organized as a collection of spatially mapped, fixed size storage region pools commonly referred to as “congruence classes.” Each of these storage region pools typically comprises one or more storage regions of fixed granularity. These storage regions maybe freely associated with any equally granular storage region in the system as long as the storage region spatially maps to a congruence class. The position of the storage region within the pool may be referred to as the “set.” The intersection of each congruence class and set contains a cache line. The size of the storage granule may be referred to as the “cache line size.” A unique tag may be derived from an address of a given storage granule to indicate its residency in a given congruence class and set.
When a processor generates a read request and the requested data resides in its cache memory (e.g., cache memory of L1 cache), then a cache read hit takes place. The processor may then obtain the data from the cache memory without having to access the system memory. If the data is not in the cache memory, then a cache read miss occurs. The memory request may be forwarded to the system memory and the data may subsequently be retrieved from the system memory as would normally be done if the cache did not exist. On a cache miss, the data that is retrieved from the system memory may be provided to the processor and may also be written into the cache memory due to the statistical likelihood that this data will be requested again by that processor. Likewise, if a processor generates a write request, the write data may be written to the cache memory without having to access the system memory over the system bus.
Hence, data may be stored in multiple locations. For example, data may be stored in a cache of a particular processor as well as in system memory. If a processor altered the contents of a system memory location that is duplicated in its cache memory (e.g., cache memory of L1 cache), the cache memory may be said to hold “modified” data. The system memory may be said to hold “stale” or invalid data Problems may result if another processor (other than the processor whose cache memory is said to hold “modified” data) or bus agent, e.g., Direct Memory Access (DMA) controller, inadvertently obtained this “stale” or invalid data from system memory. Subsequently, it is required that the other processors or other bus agents are provided the most recent copy of data from either the system memory or cache memory where the data resides. This may commonly be referred to as “maintaining cache coherency.” In order to maintain cache coherency, therefore, it may be necessary to monitor the system bus to see if another processor or bus agent accesses cacheable system memory. This method of monitoring the system bus is referred to in the art as “snooping.”
Each cache may be associated with logic circuitry commonly referred to as a “snoop controller” configured to monitor the system bus for the snoopable addresses requested by a different processor or other bus agent. Snoopable addresses may refer to the addresses requested by the other processor or bus agent that are to be snooped by snoop controllers on the system bus. Snoop controllers may snoop these snoopable addresses to determine if copies of the snoopable addresses requested by the other processor or bus agent are within their associated cache memories using a protocol commonly referred to as Modified, Exclusive, Shared and Invalid (MESI). In the MESI protocol, an indication of a coherency state is stored in association with each unit of storage in the cache memory. This unit of storage may commonly be referred to as a “coherency granule”. A “cache line” may be the size of one or more coherency granules. In the MESI protocol, the indication of the coherency state for each coherency granule in the cache memory may be stored in a cache state directory in the cache subsystem. Each coherency granule may have one of four coherency states: modified (M), exclusive (E), shared (S), or invalid (I), which may be indicated by two or more bits in the cache state directory. The modified state indicates that a coherency granule is valid only in the cache memory containing the modified or updated coherency granule and that the value of the updated coherency granule has not been written to system memory. When a coherency granule is indicated as exclusive, the coherency granule is resident in only the cache memory having the coherency granule in the exclusive state. However, the data in the exclusive state is consistent with system memory. If a coherency granule is marked as shared, the coherency granule is resident in the associated cache memory and may be in one or more cache memories in addition to the system memory. If the coherency granule is marked as shared, all of the copies of the coherency granule in all the cache memories so marked are consistent with the system memory. Finally, the invalid state may indicate that the data and the address tag associated with the coherency granule are both invalid and thus are not contained within that cache memory.
To determine whether a “cache hit” or a “cache miss” occurred from an address requested by the processor or whether a copy of a snoopable address requested by another processor or bus agent is within the cache memory, there may be logic in the cache to search what is referred to as a “cache directory”. The cache directory may be searched using a portion of the bits in the snoopable address or the address requested by the processor. The cache directory, as mentioned above, stores the coherency state for each coherency granule in the cache memory. The cache directory also stores a unique tag used to indicate whether data from a particular address is stored in the cache memory. This unique tag may be compared with particular bits from the snoopable address and the address requested by the processor. If there is a match, then the data contained at the requested address lies within the cache memory. Hence, the cache directory may be searched to determine if the data contained at the requested or snoopable address lies within the cache memory.
An example of a processor associated with multiple levels of caches incorporating the above-mentioned concepts is described below in association with FIG. 1. Referring to FIG. 1, FIG. 1 illustrates a processor 101 coupled to an L2 cache 102 which is coupled to an L3 cache 103. Processor 101, L2 cache 102 and L3 cache 103 may be implemented on an integrated circuit 104. L3 cache 103 may include a multiplexer 105 configured to receive requests from processor 101, such as a read or write request described above, as well as the snoopable address via an interconnect 106. Interconnect 106 is connected to a system bus (not shown) which is connected to other processors (not shown) or bus agents (not shown). An arbitration mechanism 107 may determine which of the two requests (requests from interconnect 106 and from processor 101) gets serviced. The selected request is dispatched into a dispatch pipeline 108. If the snoop request is not selected, it may be sent on a bypass pipeline 113. Bypass pipeline 113 may be configured to indicate to interconnect 106 to retry resending the snoop request that was denied.
Dispatch pipeline 108 is coupled to a cache directory 109. Dispatch pipeline 108 may contain logic configured to determine if the data at the requested address lies within a cache memory 114 of L3 cache 103. Dispatch pipeline 108 may determine if the data at the requested address lies within cache memory 114 by comparing the tag values in cache directory 109 with the value stored in particular bits in the requested address. As mentioned above, if there is match, then the data contained at the requested address lies within cache memory 114. Otherwise, cache memory 114 does not store the data at the requested address. The result may be transmitted to response pipeline 110 configured to transmit an indication as to whether the data at the requested address lies within cache memory 114. The result may be transmitted to either processor 101 or to another processor (not shown) or bus agent (not shown) via interconnect 106.
Referring to FIG. 1, response pipeline 110 and bypass pipeline 113 may be coupled to a multiplexer 115. Multiplexer 115 may be configured to select to send either the result from response pipeline 110 or the request to retry resending the snoop request denied from bypass pipeline 113 by using particular bit values from arbiter 107. That is, arbiter 107 may be configured to send particular bit values to the select input of multiplexer 115 used to select either the result from response pipeline 110 or the request to retry resending the snoop request denied from bypass pipeline 113.
Referring again to FIG. 1, dispatch pipeline 108 may further be configured to dispatch the result, e.g., cache hit, to processor's 101 requests to read/write machines 112A-N, where N is any number. Read/write machines 112A-N may collectively or individually be referred to as read/write machines 112 or read/write machine 112, respectively. Read/write machines 112 may be configured to execute these requests, e.g., read request, for processor 101.
Dispatch pipeline 108 may further be configured to dispatch the result to requests from interconnect 106 to snooping logic, referred to herein as “snoop machines” 111A-N, where N is any number. Snoop machines 111A-N may collectively or individually be referred to as snoop machines 111 or snoop machine 111, respectively. Snoop machines 111 may be configured to respond to the requests from other processors or bus agents. Snoop machines 111 may further be configured to write modified data in the cache memory of L3 cache 103 to the system memory (not shown) to maintain cache coherency.
Referring to FIG. 1, interconnect 106 may transfer a received snoop request to multiplexer 105 every cycle. The response to the snoop request may be transmitted at a given fixed number of cycles after interconnect 106 transmits the snoop request to L3 cache 103. For example, interconnect 106 may transmit the snoop request to multiplexer 105 on a given cycle followed by a determination by arbiter 107 as to whether the snoop request is selected to be dispatched to dispatch pipeline 108 or is to be transmitted on bypass pipeline 113 to response pipeline 110. If the snoop request is selected, it enters dispatch pipeline 108 and response pipeline 110 some cycle(s) later. A search in cache directory 109 is made some cycle(s) later by dispatch pipeline 108. The result as to whether data at the snoop address lies within cache memory 114 is transmitted to response pipeline 110. The response may be generated and transmitted to interconnect 106 some cycle(s) later by response pipeline 110. All these actions occur on a fixed schedule as illustrated in FIG. 2.
FIG. 2 is a timing diagram illustrating the actions described above occurring on a fixed schedule. Referring to FIG. 2, in conjunction with FIG. 1, interconnect 106 sends snoop requests A, B, C, and D to multiplexer 105 during the indicated clock cycles. Processor 101 (labeled “processor” in FIG. 2) sends requests M and N to multiplexer 105 during the indicated clock cycles. As illustrated in FIG. 2, snoop requests B and C are transmitted during the same cycle as requests M and N. The request (either the snoop request or the request sent by processor 101) becomes selected and dispatched by arbiter 107 to dispatch pipeline 108 (labeled “dispatch pipeline” in FIG. 2). As illustrated in FIG. 2, arbiter 107 selects snoop request A followed by selecting requests M and N instead of snoop requests B and C, respectively, followed by selecting snoop request D. These selected requests are dispatched to dispatch pipeline 110 in the clock cycles indicated in FIG. 2.
FIG. 2 further illustrates which clock cycle the result as to whether data at the addresses requested by snoop requests A and D was found within cache memory 114 is inputted to response pipeline 110. Snoop requests B and C are inputted into bypass pipeline 113 (indicated by “bypass pipeline” in FIG. 2) at the illustrated clock cycle since they were not selected by arbiter 107. At the end of response pipeline 110 for snoop request A (corresponds to the time to respond to snoop request A as labeled in FIG. 2), the result is transmitted to interconnect 106 at that given cycle. At the end of bypass pipeline 113 for snoop request B (corresponds to the time to respond to snoop request B as labeled in FIG. 2), the result (request to retry resending snoop request B) is transmitted to interconnect 106 at the cycle following the transmission of the result for snoop request A and so forth. As illustrated in FIG. 2, the time to respond to each snoop request occurs on a fixed schedule.
As stated above, if the snoop request is not selected by arbiter 107 (arbiter 107 selected request from processor 101 instead of snoop request), then the snoop request, e.g., snoop requests B and C, is sent to bypass pipeline 113 some cycle(s) later. The response indicating to retry sending the snoop request is generated and transmitted to interconnect 106 at a given cycle by bypass pipeline 113, some cycles later. Consequently, a snoop request from interconnect 106 may have to be denied and requested to be retried again which may result in hundreds of additional clock cycles of delay. If the number of rejected snoop requests could be reduced, then the performance could be improved.
Therefore, there is a need in the art to improve the performance by reducing the number of snoop requests denied.