Many modern computer systems are multi-processor systems. That is, they include multiple processors coupled together on a common bus that share the computing load of the system. In addition, the multiple processors typically share a common system memory. Still further, each of the processors includes a cache memory, or typically a hierarchy of cache memories.
A cache memory, or cache, is a memory internal to the processor that stores a subset of the data in the system memory and is typically much smaller than the system memory. Transfers of data with the processor's cache are much faster than the transfers of data between the processor and memory. When a processor reads data from the system memory, the processor also stores the data in its cache so the next time the processor needs to read the data it can more quickly read from the cache rather than having to read the data from the system memory. Similarly, the next time the processor needs to write data to a system memory address whose data is stored in the cache, the processor can simply write to the cache rather than having to write the data immediately to memory, which is commonly referred to as write-back caching. This ability to access data in the cache thereby avoiding the need to access memory greatly improves system performance by reducing the overall data access time.
Caches store data in cache lines. A common cache line size is 32 bytes. A cache line is the smallest unit of data that can be transferred between the cache and the system memory. That is, when a processor wants to read a cacheable piece of data from memory, it reads all the data in the cache line containing the data and stores the entire cache line in the cache. Similarly, when a new cache line needs to be written to the cache that causes a modified cache line to be replaced, the processor writes the entire replaced line to memory.
The presence of multiple processors each having its own cache that caches data from a shared memory introduces a problem of cache coherence. That is, the view of memory that one processor sees through its cache may be different from the view another processor sees through its cache. For example, assume a location in memory denoted X contains a value of 1. Processor A reads from memory at address X and caches the value of 1 into its cache. Next, processor B reads from memory at address X and caches the value of 1 into its cache. Then processor A writes a value of 0 into its cache and also updates memory at address X to a value of 0. Now if processor A reads address X it will receive a 0 from its cache; but if processor B reads address X it will receive a 1 from its cache.
The example above illustrates the need to keep track of the state of any cache lines that are shared by more than one cache in the system. One common scheme for enforcing cache coherence is commonly referred to as snooping. With snooping, each cache maintains a copy of the sharing status for every cache line it holds. Each cache monitors or snoops every transaction on the bus shared by the other processors to determine whether or not the cache has a copy of the cache line implicated by the bus transaction initiated by another processor. The cache performs different actions depending upon the type of transaction snooped and the status of the cache line implicated. A common cache coherency status protocol is the MESI protocol. MESI stands for Modified, Exclusive, Shared, Invalid, which are the four possible states or status values of a cache line in a cache.
One method of maintaining cache coherence commonly used with snooping is to ensure that a processor has exclusive access to a cache line before writing data to it. This method is commonly referred to as a write invalidate protocol because on a write it invalidates any copies of the implicated cache line in the other caches. Requiring exclusive access ensures that no other readable or writable copies of a cache line exist when the writing processor writes the data.
To invalidate the other copies of the cache line in the other caches, the invalidating processor gains access to the bus and provides on the bus the address of the cache line to be invalidated. The other caches are snooping the bus and check to see if they are -currently caching the address. If so, the other caches change the state of the cache line to Invalid.
In addition, each cache also snoops the bus to determine if it has a modified cache line that is being read by another processor. If so, the cache provides the modified cache line, either by writing the modified cache line to memory or providing the modified cache line to the requesting processor, or both. The transaction reading the cache line may allow the cache line to be shared or it may require the other caches to invalidate the line.
Processor caches typically include a hierarchy of caches. For example, a processor may have a level-one (L1) and level-two (L2) cache. The L1 cache is closer to the computation elements of the processor than the L2 cache, and is capable of providing data to the computation elements faster than the L2 cache. Furthermore, the caches may be further divided into separate instruction caches and data caches for caching instructions and data, respectively.
The various caches within the cache hierarchy of the processor transfer cache lines between one another. For example, if a cache address misses in an L1 cache, the L1 might load the missing cache line from an L2 cache in the processor if it is present in the L2. Also, if an L1 cache needs to replace a valid cache line with a newer cache line, the L1 cache may cast out the replaced cache line to the L2 cache rather than writing the cache line to system memory. This is particularly common for write-back cache configurations.
The transfer of a cache line between two caches in a processor may require several processor clock cycles. This may be true for several reasons. One reason is that caches typically comprise a pipeline of multiple stages, wherein each stage processes a portion of an operation during a clock cycle, implying that multiple clock cycles are required to read or write the cache. Additionally, caches are often multi-pass caches, meaning that a first pass, typically referred to as a query pass, through the pipeline is required to obtain the status of the implicated cache line. One or more subsequent passes are required to update the cache based on the status obtained or to read additional data that was not obtained during the query pass. Still further, the caches may be spatially located a relatively large distance away from one another on the processor integrated circuit, requiring additional clock cycles for long signal paths and/or signals which require propagation delays through many logic gates to generate.
For example, assume the processor stores a new cache line to its L1 cache forcing the L1 to replace a modified cache line. The L1 may castout the modified cache line that was chosen for replacement to an L2 cache on the processor. The L1 reads the castout line from its pipeline and stores the line into a buffer between the two caches. The L1 informs the L2 of the castout and subsequently overwrites the castout line with the new cache line. The L2 reads the castout line from the castout buffer and writes the line into itself.
This works well as long as the caches do not snoop a transaction on the bus that collides with the address of the castout line during the castout, i.e., that has the same address as the castout line. A colliding snoop while the castout is in-flight introduces significant design problems that must be addressed. For example, if the snooped transaction is a read and the cache line that is in-flight is a cache line with modified data that has not been written to memory, which of the two caches will supply the cache line data to the snooped transaction on the bus? Which of the two caches will own the castout line in order to update its status?
The conventional approach to the problem has been to cancel or kill the in-flight operation. However, this approach has negative side effects. It increases the timing and complexity of the cache control logic to be able to handle the cancelled in-flight operation. For example, in the example above, the L1 cache must delay overwriting the castout line with the new line until it is informed by the L2 that it is safe to do so. The longer the L1 must wait to overwrite the castout line, the more complicated the process to back out and/or retry the operation. Also, the added delay may adversely affect performance. Furthermore, the added communication between the caches in the form of cancellation and handshaking may take place on signals between the two caches that are relatively long and have significant propagation delay if the two cache blocks are a relatively great distance from one another, which may consequently create critical timing paths.
Therefore, what is needed is a cache that internally handles the effects of an external snoop that collides with an in-flight operation rather than killing it.