A significant portion of the operations performed by microprocessors is to read data from or write data to memory. Reading data from memory is commonly referred to as a load, and writing data to memory is commonly referred to as a store. Typically, a microprocessor generates load and store operations in response to an instruction that accesses memory. Load and store operations can also be generated by the microprocessor for other reasons necessary to the operation of the microprocessor, such as loading page table information, or evicting a cache line to memory.
Because accesses to memory are relatively slow compared to other operations within the microprocessor, modern microprocessors employ cache memories. A cache memory, or cache, is a memory in the microprocessor that stores a subset of the data in the system memory and is typically much smaller than the system memory. Transfers of data with the microprocessor's cache are much faster than the transfers of data between the microprocessor and memory. When a microprocessor reads data from the system memory, the microprocessor also stores the data in its cache so the next time the microprocessor needs to read the data it can more quickly read from the cache rather than having to read the data from the system memory. Similarly, the next time the microprocessor needs to write data to a system memory address whose data is stored in the cache, the microprocessor can simply write to the cache rather than having to write the data immediately to memory, which is commonly referred to as write-back caching. This ability to access data in the cache, thereby prolonging the need to access memory, greatly improves system performance by reducing the overall data access time.
Caches store data in cache lines. A common cache line size is 32 bytes. A cache line is the smallest unit of data that can be transferred between the cache and the system memory. That is, when a microprocessor wants to read a cacheable piece of data from memory, it reads all the data in the cache line containing the piece of data and stores the entire cache line in the cache. Similarly, when a new cache line needs to be written to the cache that causes a modified cache line to be replaced, the microprocessor writes the entire replaced line to memory.
Modern caches are typically pipelined. That is, the caches are comprised of multiple stages coupled together to form a pipeline. To perform a store operation to a cache typically requires two or more passes through the cache pipeline. During the first pass, the memory address of the store operation is provided to the cache to determine whether the address is present, i.e., cached, in the cache and if so, the status of the cache line associated with the store address. During the second pass, the data of the store operation is written into the cache.
To accommodate the two-pass nature of a cache, microprocessors typically employ store buffers to hold the store data and store address for use in the second pass. In addition, the cache is typically a resource accessed by multiple functional blocks within the microprocessor; consequently, the functional blocks must arbitrate for access to the cache. The store buffers also serve the purpose of holding the store data and address until the store operation wins arbitration for the cache to perform the store.
When a store operation address hits in the cache, the data can typically be written to the cache line immediately. However, if the store address misses the cache, the data cannot be immediately written to the cache. This is because the store data is almost always less than a full cache line and, as mentioned above, a cache line is the smallest unit of data than can be transferred between the cache and the system memory. If the store data were written to the cache without the remaining bytes of the cache line present in the cache, the cache would later have to write a partial cache line to memory, which is not permitted.
One solution is simply to write the store data to memory and not to the cache, i.e., to cache only loads. However, another solution is to first read the cache line implicated by the store address from memory, merge the store data with the cache line, and then store the updated cache line to the cache. This type of cache is commonly referred to as a write-allocate cache, since space for a cache line is allocated in the cache on write operations that miss in the cache.
Modern microprocessors commonly include buffers to receive data read from memory, such as cache lines read from memory for a write-allocate operation. These buffers are commonly referred to as response buffers.
Modern computer systems commonly employ multiple microprocessors and/or multiple levels of cache. For example, each microprocessor may include level-one (L1) and level-two (L2) caches. Furthermore, the caches at each level may be separated into distinct instruction caches and data caches. The presence of multiple microprocessors and/or caches that cache data from a shared memory introduces a problem of cache coherence. That is, the view of memory that one microprocessor sees through its cache may be different from the view another microprocessor sees through its cache. For example, assume a location in memory denoted X contains a value of 1. Microprocessor A reads from memory at address X and caches the value of 1 into its cache. Next, microprocessor B reads from memory at address X and caches the value of 1 into its cache. Then microprocessor A writes a value of 0 into its cache and also updates memory at address X to a value of 0. Now if microprocessor A reads address X it will receive a 0 from its cache; but if microprocessor B reads address X it will receive a 1 from its cache.
In order to maintain a coherent view of memory through the different caches, microprocessors commonly employ a cache coherency protocol, in which a cache line status value is maintained for each cache line. An example of a popular and well-documented cache coherency status protocol is the MESI protocol. MESI stands for Modified, Exclusive, Shared, Invalid, which are the four possible states or status values of a cache line.
Both store buffers and response buffers keep a cache line status for the cache lines they hold, just as caches keep a cache line status for each of the cache lines they hold. While a cache line resides in a cache, store buffer, and/or response buffer, events may occur that require the status of the cache line to be updated. One example of a cache line status-altering event is an eviction of a cache line from a cache. A cache line may be evicted because it is the least-recently-used cache line to make room in the cache for a new cache line, for example. The status of an evicted cache line is updated to Invalid.
Another example of an event requiring update of cache line status is a snoop operation. Many systems are designed to perform snoop operations as part of the cache coherency protocol. Each cache monitors, or snoops, every transaction on the microprocessor bus to determine whether or not the cache has a copy of the cache line implicated by the bus transaction initiated by another microprocessor or by another cache within the microprocessor. The cache performs different actions depending upon the type of transaction snooped and the status of the cache line implicated. The snoop operation may be an invalidate snoop, in which case the MESI state must be updated to Invalid. Or, the snoop may be a shared snoop, in which case the MESI state may be updated to Shared.
As discussed above, under some conditions, such as when the store misses in a write-allocate cache, a response buffer will be allocated to receive the implicated cache line, either from system memory of from a lower-level cache. If the store was one that caused a response buffer to be allocated, then the cache line status is kept coherent between the store buffer and response buffer as status-altering events occur. Prior processors have included logic to update both the store buffer cache line status and the response buffer cache line status when one of these events occurs.
The main disadvantage of the prior method is that the logic to update the MESI state in both the store buffer and response buffer is complex. This is particularly true as the number of buffers increases. The complexity is disadvantageous in at least three aspects.
First, the complexity is error-prone in its design. That is, it is easy to design bugs into the update logic and difficult to test for all the possible combinations of conditions and events that might occur in order to find the bugs. Second, the complexity implies larger control circuitry, which consumes undue chip real estate, which may in turn affect chip yields. Third, the complexity may impact critical timing paths that affect the clock speed at which the microprocessor may run. Therefore, what is needed is a means of reducing the complexity of maintaining cache line status coherency associated with store operations.