1. Field of the Invention
The present invention relates to computer systems and, more specifically, to cache coherent computer systems.
2. Background Information
A computer system typically comprises one or more processors linked to a main memory by a bus or other interconnect. In most computer systems, main memory organizes the instructions and data being stored into units typically referred to as “blocks” each of which is separately addressable and may be of a fixed size. Instructions and data are typically moved about the computer system in terms of one or more blocks.
Ordinarily, a processor will retrieve data, e.g., one or more blocks, from main memory, perform some operation on it, and eventually return the results back to main memory. Retrieving data from main memory and providing it to a processor can take significant time especially in terms of the high operating speeds of today's processors. To reduce such latencies as well as to reduce the number of times a processor must access main memory, modern processors and/or processor chipsets include one or more cache memories or caches. A cache is a small, fast memory module located in close proximity to the processor. Many caches are static random access memories (SRAMs), which are faster, but more expensive, than dynamic random access memories (DRAMs), which are often used for main memory. The cache is used to store information, e.g., data or instructions, which the processor is currently using or is likely to use in the near future.
Most caches are organized as a series of lines, and each cache line is typically sized to hold one memory block. The particular cache line at which a received memory block is to be placed is determined by the manner in which the cache is organized. There are basically three different categories of cache organization. If a received memory block can be stored at any line of the cache, the cache is said to be “fully associative”. If each memory block can only be placed in a single, pre-defined cache line, the cache is said to be “direct mapped”. If a received memory block can only be placed within a restricted set of cache lines, the cache is said to be “set associative”.
For each cache line, a tag is provided that contains the memory address of the block stored at that cache line. The tag also stores the state of the cache line typically through one or more flags or state bits. In particular, a valid bit indicates whether the entry contains a valid address, while a dirty bit indicates whether the block is dirty, i.e., modified while in the cache, or clean, i.e., not modified.
In addition, there are two basic types of caches: “write-through” and “write-back”. With a write-through cache, whenever a processor modifies or updates a piece of data in the processor's cache, main memory's copy of that data is automatically updated. This is accomplished by having the processor write the data back to memory whenever the data is modified or updated. A write-back cache, in contrast, does not automatically send modified or updated data to main memory. Instead, the updated data remains in the cache until some more convenient time, e.g., when the processor is idle, at which point the modified data is written back to memory. The utilization of write-back caches typically improves system performance. In some systems, a write-back or victim buffer is provided in addition to the cache. “Victim data” refers to modified data that is being removed from the processor's cache in order to make room for new data received at the processor. Typically, the data selected for removal from the cache is data the processor is no longer using. The victim buffer stores this modified data which is waiting to be written back to main memory. Modified data in the victim buffer is eventually “victimized”, i.e., written back to main memory, typically at some convenient time.
Symmetrical Multiprocessor (SMP) Systems
Multiprocessor computing systems, such as symmetrical multiprocessor (SMP) systems, provide a computer environment in which software applications may run on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value. This frees the programmer to focus on program development rather than on managing partitioned data sets and communicating values.
Cache Coherency
Because more than one processor of the SMP system may request a copy of the same memory block from main memory, cache coherency protocols have been developed to ensure that no processor relies on a memory block that has become stale, typically due to a modification or update performed to the block by some other processor. Many cache coherency protocols associate a state with each cache line. A given memory block, for example, may be in a shared state in which copies of the block may be present in the caches associated with multiple processors. When a memory block is in the shared state, a processor may read from, but not write to, the respective block. To support write operations, a memory block may be in an exclusive state. In this case, the block is owned by a single processor which may write to the cache line. When the processor updates or modifies the block, its copy becomes the most up-to-date version, while corresponding copies of the block at main memory and/or other processor caches become stale.
There are two classes of cache coherency protocols: snooping and directory based. With snooping, the caches monitor or snoop all transactions traversing the shared memory bus, looking for transactions that reference a memory block stored at the cache. If such a transaction is detected, the cache updates the status information for its copy of the memory block based on the snooped transaction. With a directory based protocol, the state of each block is kept in a single, centralized location in the system, called a directory. The directory filters each request so that only those caches that are interested in the specified memory block, i.e., those caches having a copy of the block, need respond. A directory also maintains state for every coherent memory block in the system even though in most cases the actual number of blocks that are cached is quite small compared to the total size of memory.
In some computer systems, a duplicate copy of the cache tag information that is being maintained at each processor is utilized in place of the directory. The Duplicate Tag store (DTAG) has a section for each processor. The coherence information that must be maintained by the DTAG is bounded by the total cache size of all processors. The overhead required by a DTAG can thus be smaller than that required by a directory which, as mentioned above, maintains coherence for every memory block in the system. All sections of the DTAG are accessed for each memory reference operation issued in the computer system. In other words, the DTAG for each processor is searched to determine whether any processor has a copy of the memory block specified in the memory reference operation. Specifically, a search is made to determine whether one or more processors have a copy of the specified block. The results from these accesses to the DTAG are used to determine the appropriate response to the memory reference operation, including a next state of the DTAG. The responses are then disseminated to the appropriate system components.
For example, if the DTAG reveals that the block targeted by the memory reference operation is held by a processor in the dirty state, the memory reference operation is forwarded to the identified processor which, in turn, satisfies the operation by sending a copy of the specified block from its cache to the component that issued the memory reference operation. If no processor has a copy of the specified block in the dirty state, then the version of the block at main memory is considered up-to-date, and memory satisfies the memory reference operation by sending a copy of the block directly from memory.
When a processor is finished with a memory block that is in the dirty state, the processor writes the modified block from its cache back to main memory. To write-back data, a processor typically performs an atomic read-modify-write operation. More specifically, the processor first reads the contents of the DTAG to confirm that the respective DTAG entry also reflects that the processor has a dirty copy of the memory block. If so, the processor writes the modified data back to memory and invalidates the DTAG entry.
After issuing the write-back, the processor will typically want to re-use the cache line to store a different memory block. In this case, the processor will issue a memory reference operation specifying the new block. The computer system, however, must prevent the memory reference operation from reaching (and modifying the state of) the DTAG ahead of the write-back. If the memory reference operation is processed at the DTAG first, the DTAG entry for the memory block being written back will be replaced with the tag and state information corresponding to the new block. Should another processor request a copy of the memory block being written back, a search of the DTAG would reveal no processor having a dirty copy of the block. Main memory would erroneously conclude that its copy of the memory block is current and send a copy to the processor issuing the request when, in fact, the write-back containing the most up-to-date copy is in flight.
Several approaches have been developed to avoid this problem. First, system designers have imposed a requirement that a processor, upon writing a memory block back to main memory, wait to receive an acknowledgement from memory that the write back completed before issuing a new request that would reuse the cache line victimized by the write-back. By delaying the subsequent memory reference operation until the write-back is acknowledged, the DTAG is kept up-to-date. This solution, however, delays the processor's acquisition of the new memory block while it waits for the acknowledgement. Delays such as these can reduce the computer system's performance. To minimize the performance penalty, some systems employ associative caches and victimize non-dirty memory blocks first to make room for new blocks. Associative caches, however, are more expensive and typically smaller than non-associative, mapped caches. Furthermore, a policy that victimizes non-dirty blocks more often than dirty blocks reduces the effectiveness of the cache.
Another solution is to design the processors to combine the memory reference operation for the new memory block and the write-back into a single operation or command. By combining the two operations into a single command, the system ensures that the request for the new block is never received ahead of the write-back. This solution, however, imposes requirements and complexities on the processor. Not all processors, moreover, support such pairing of memory reference operations with write-backs. Yet another approach is to impose ordering on the communication channel(s) between the processors and the main memory. Ordering constraints, however, increase the complexity of the computer system and, in some cases, may not be feasible. Accordingly, a need exists for an efficient mechanism to issue write backs in a computer system.