A multiprocessor system may comprise multiple processors coupled to a common shared system memory. The multiprocessor system may further include one or more levels of cache associated with each processor. A cache includes a relatively small, high speed memory (“cache memory”) that contains a copy of information from one or more portions of the system memory. A Level-1 (L1) cache or primary cache may be built into the integrated circuit of the processor. The processor may be associated with additional levels of cache, such as a Level-2 (L2) cache and a Level-3 (L3) cache. These lower level caches, e.g., L2, L3, may be employed to stage data to the L1 cache and typically have progressively larger storage capacities but longer access latencies.
The cache memory may be organized as a collection of spatially mapped, fixed size storage region pools commonly referred to as “congruence classes.” Each of these storage region pools typically comprises one or more storage regions of fixed granularity. These storage regions may be freely associated with any equally granular storage region in the system as long as the storage region spatially maps to a congruence class. The position of the storage region within the pool may be referred to as the “set.” The intersection of each congruence class and set contains a cache line. The size of the storage granule may be referred to as the “cache line size.” A unique tag may be derived from an address of a given storage granule to indicate its residency in a given congruence class and set.
When a processor generates a read request and the requested data resides in its cache memory (e.g., cache memory of L1 cache), then a cache read hit takes place. The processor may then obtain the data from the cache memory without having to access the system memory. If the data is not in the cache memory, then a cache read miss occurs. The memory request may be forwarded to the system memory and the data may subsequently be retrieved from the system memory as would normally be done if the cache did not exist. On a cache miss, the data that is retrieved from the system memory may be provided to the processor and may also be written into the cache memory due to the statistical likelihood that this data will be requested again by that processor. Likewise, if a processor generates a write request, the write data may be written to the cache memory without having to access the system memory over the system bus.
Hence, data may be stored in multiple locations. For example, data may be stored in a cache of a particular processor as well as in system memory. If a processor altered the contents of a system memory location that is duplicated in its cache memory (e.g., cache memory of L1 cache), the cache memory may be said to hold “modified” data. The system memory may be said to hold “stale” or invalid data. Problems may result if another processor (other than the processor whose cache memory is said to hold “modified” data) or bus agent, e.g., Direct Memory Access (DMA) controller, inadvertently obtained this “stale” or invalid data from system memory. Subsequently, it is required that the other processors or other bus agents are provided the most recent copy of data from either the system memory or cache memory where the data resides. This may commonly be referred to as “maintaining cache coherency.” In order to maintain cache coherency, therefore, it may be necessary to monitor the system bus to see if another processor or bus agent accesses cacheable system memory. This method of monitoring the system bus is referred to in the art as “snooping.”
Each cache may be associated with logic circuitry commonly referred to as a “snoop controller” configured to monitor the system bus for the snoopable addresses requested by a different processor or other bus agent. Snoopable addresses may refer to the addresses requested by the other processor or bus agent that are to be snooped by snoop controllers on the system bus. Snoop controllers may snoop these snoopable addresses to determine if copies of the snoopable addresses requested by the other processor or bus agent are within their associated cache memories using a protocol commonly referred to as Modified, Exclusive, Shared and Invalid (MESI). In the MESI protocol, an indication of a coherency state is stored in association with each unit of storage in the cache memory. This unit of storage may commonly be referred to as a “coherency granule”. A “cache line” may be the size of one or more coherency granules. In the MESI protocol, the indication of the coherency state for each coherency granule in the cache memory may be stored in a cache state directory in the cache subsystem. Each coherency granule may have one of four coherency states: modified (M), exclusive (E), shared (S), or invalid (I), which may be indicated by two or more bits in the cache state directory. The modified state indicates that a coherency granule is valid only in the cache memory containing the modified or updated coherency granule and that the value of the updated coherency granule has not been written to system memory. When a coherency granule is indicated as exclusive, the coherency granule is resident in only the cache memory having the coherency granule in the exclusive state. However, the data in the exclusive state is consistent with system memory. If a coherency granule is marked as shared, the coherency granule is resident in the associated cache memory and may be in one or more cache memories in addition to the system memory. If the coherency granule is marked as shared, all of the copies of the coherency granule in all the cache memories so marked are consistent with the system memory. Finally, the invalid state may indicate that the data and the address tag associated with the coherency granule are both invalid and thus are not contained within that cache memory.
To determine whether a “cache hit” or a “cache miss” occurred from an address requested by the processor or whether a copy of a snoopable address requested by another processor or bus agent is within the cache memory, there may be logic in the cache to search what is referred to as a “cache directory”. The cache directory may be searched using a portion of the bits in the snoopable address or the address requested by the processor. The cache directory, as mentioned above, stores the coherency state for each coherency granule in the cache memory. The cache directory also stores a unique tag used to indicate whether data from a particular address is stored in the cache memory. This unique tag may be compared with particular bits from the snoopable address and the address requested by the processor. If there is a match, then the data contained at the requested address lies within the cache memory. Hence, the cache directory may be searched to determine if the data contained at the requested or snoopable address lies within the cache memory.
An example of a processor associated with multiple levels of caches incorporating the above-mentioned concepts is described below in association with FIG. 1. Referring to FIG. 1, FIG. 1 illustrates a processor 101 coupled to an L2 cache 102 which is coupled to an L3 cache 103. Processor 101, L2 cache 102 and L3 cache 103 may be implemented on an integrated circuit 104. L3 cache 103 may include a multiplexer 105 configured to receive requests from processor 101, such as a read or write request described above, as well as the snoopable address via an interconnect 106. Interconnect 106 is connected to a system bus (not shown) which is connected to other processors (not shown) or bus agents (not shown). An arbitration mechanism 107 may determine which of the two requests (requests from interconnect 106 and from processor 101) gets serviced. The selected request is dispatched into a dispatch pipeline 108. Dispatch pipeline 108 is coupled to a cache directory 109. Dispatch pipeline 108 may contain logic configured to determine if the data at the requested address lies within the cache memory (not shown) of L3 cache 103. Dispatch pipeline 108 may determine if the data at the requested address lies within the cache memory by comparing the tag values in cache directory 109 with the value stored in particular bits in the requested address. As mentioned above, if there is a match, then the data contained at the requested address lies within the cache memory. Otherwise, the cache memory does not store the data at the requested address. The result may be transmitted to a response pipeline 110 configured to transmit an indication as to whether the data at the requested address lies within the cache memory. The result may be transmitted to either processor 101 or to another processor (not shown) or bus agent (not shown) via interconnect 106.
Dispatch pipeline 108 may further dispatch the result, e.g., cache hit, to processor's 101 requests to read/write machines 112A-N, where N is any number. Read/write machines 112A-N may collectively or individually be referred to as read/write machines 112 or read/write machine 112, respectively. Read/write machines 112 may be configured to execute these requests, e.g., read request, for processor 101.
Dispatch pipeline 108 may further dispatch the result to requests from interconnect 106 to snooping logic, referred to herein as “snoop machines” 111A-N, where N is any number. Snooop machines 111A-N may collectively or individually be referred to as snoop machines 111 or snoop machine 111, respectively. Snoop machines 111 may be configured to respond to the requests from other processors or bus agents. Snoop machines 110 may further be configured to write modified data in the cache memory of L3 cache 103 to the system memory (not shown) to maintain cache coherency.
As stated above, the lower cache levels, e.g., L2, L3, may be employed to stage data to the L1 cache and typically have progressively larger storage capacities. As semiconductor technology advances, there continues to be an exponential growth in the number of transistors per integrated circuit as predicted by Gordon Moore (commonly referred to as Moore's Law). As the number of transistors per integrated circuit continues to increase, the amount of information that may be stored in a cache memory increases thereby resulting an increase in the number of entries in cache directories.
However, the speed or frequency of operation in computers continues to increase as well. As the frequency of operation increases, the rate at which requests, such as the requests from interconnect 106, to be serviced increases as well. However, the speed at which a cache directory may be accessed is not keeping up with the increased speed of operation. Consequently, a request from interconnect 106 may have to be denied and requested to be retried again which may result in hundreds of additional clock cycles of delay. If the snoop bandwidth or throughput of the cache directory could be improved, then fewer requests from interconnect 106 would be denied.
Therefore, there is a need in the art to improve the snoop bandwidth or throughput of the cache directory thereby reducing the number of requests denied.