1. Field of the Invention
This invention generally relates to multiprocessor computer systems that employ cache subsystems, and more specifically, to maintaining cache coherence for pending transactions.
2. Description of the Relevant Art
A processor is a device that is configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor. A computer system includes a processor and other components such as system memory, buses, caches, and input/output (I/O) devices. Multiprocessing computer systems include two or more processors, which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole.
A cache memory is a high-speed memory unit interposed in the memory hierarchy of a computer system between a slower system memory and a processor to improve effective memory transfer rates and accordingly improve system performance. The cache is usually implemented by semiconductor memory devices having speeds that are comparable to the speed of the processor, while the system memory utilizes a less costly, lower speed technology. The cache memory typically includes a plurality of memory locations that stores a block or a "line" of two or more words. The words may be of a variable number of bytes and may be parts of data or instruction code. Each line in the cache has associated with it an address tag that uniquely identifies the address of the line. The address tags are typically included within a tag array device. The address tag may include a part of the address bits and additional bits that are used for identifying the state of the line. Accordingly, a processor may read from or write directly into one or more lines in the cache if the lines are present in the cache and if they are valid. This event in which the processor deals directly with the cache is referred to as cache "hit". For example, when a read request originates in the processor for a new word, whether data or instruction, an address tag comparison is made to determine whether a valid copy of the requested word resides in a line of the cache memory. If present, the data is used directly from the cache, i.e. a cache read hit is completed. If not present, a line containing the requested word is retrieved from the system memory and stored in the cache memory. The requested line is simultaneously supplied to the processor. This event is referred to as a cache read miss.
Similarly, the processor may also write directly into the cache memory instead of the system memory. For example, when a write request is generated, an address tag comparison is made to determine whether the line into which data is to be written resides in the cache. If the line is present (and is valid), the data is written directly into the line. This event is referred to cache write hit. In many systems, a "dirty" bit for the line is then set. The dirty bit indicates the data stored within the line has been modified, and thus, before the line is deleted from the cache memory, overwritten, or replaced the modified data must be written into the system memory. If the line into which the data is to be written does not exist in the cache memory, the line is either fetched into the cache from the system memory to allow the data to be written into the cache, or the data is written directly into the system memory. This event is referred to as cache write miss.
In a multiprocessor system, copies of the same line of memory can be present in the caches of multiple processors. Cache coherence is the requirement that ensures that writes and reads by processors are observed by other processors in a well-defined order, in spite of the fact that each processor directly writes or reads its copy of the line in its cache. When a processor writes to a copy of the line in its cache, all other copies of the same line must either be invalidated or updated with the new value, so that subsequent reads by those other processors will observe the newly written value of the cache line. Similarly, when a processor incurs a read miss, it must be given the most recent value of the line, which could be a dirty line in another cache instead of in main memory.
Data coherence in a multiprocessor shared-memory system is typically maintained through employment of a snooping protocol or the use of a directory-based protocol. In a directory-based protocol, a directory of which processors have copies of each cache line is maintained. This directory is used to limit the number of processors that must snoop a given request for a cache line. The use of directories reduces the snoop traffic and thus allows larger systems to be built. However, use of directories increases the system's latency (which is caused by the directory lookup), hardware cost and complexity.
In a snooping protocol, each processor broadcasts all of its requests for cache lines, typically on snooping buses, to all other processors which then look up in their cache tags ("snoop") to determine what action must be taken. For example, when a processor presents a write cycle on the bus to write data into the system memory, a cache controller device of another processor determines whether a corresponding line of data is contained within the cache. If a corresponding line is not contained within the cache, the cache controller takes no additional action and the write cycle is allowed to complete. If a corresponding line of data is contained within the cache, the cache controller determines whether the line of data is modified or not, typically by looking up the entries in the tag array. If the corresponding line is not dirty (e.g., the line is in a "shared" state), the line is marked invalid and the write cycle is allowed to complete. In some systems, if the line is dirty, the processor with a dirty line must respond with a copy of the line and simultaneously invalidate its own copy (referred to as a "copyback-invalidate"). If the line is dirty, and the request is read, the processor with a dirty line must respond with a copy of the line and mark the line shared.
Accordingly, snooping reads the cache tags, and sometimes modifies the cache tags when the snooped transaction is completed. However, there are several reasons why it is often desirable to de-couple snooping from the completion of the transaction when the processors cache tags are modified.
Generally speaking, a read or write transaction may take a relatively long time to complete. Instead of completing each individual transaction before starting the next one, performance (transaction throughput) may be improved by allowing subsequent transactions to start on the broadcast snooping bus before the previous transactions have completed. Therefore, each processor may have multiple transactions (it's own read or write bus requests as well as broadcast requests from other processors) pending completion. To maintain coherence, it may be necessary to keep these pending transactions in order and therefore a Pending Transaction Queue (PTQ) may be used between the snooping bus and the processor cache. Transactions are placed in this PTQ after they have appeared on the bus and before they have been completed (completion means that they have had their desired effect on the cache).
On a large multiprocessor system, the traffic on the bus may be very heavy. Generally, only a small portion of such heavy traffic is intended for any given processor in the system, as determined by snooping. For example, requests from other processors for lines that are determined by snooping to be not present in the processor's cache are irrelevant to the processor and need not be queued in the PTQ. Accordingly, de-coupled snooping may be performed in a duplicate set of tags. In some systems, the duplicate set may be called Duplicate Tags or DTags.
The use of the DTags serves two purposes. First, since all broadcast requests need to access the snoop tags, but only the few transactions that are relevant to a processor need to access the processor's copy of the tags, the DTags help to reduce the bandwidth consumed by snooping on the processor's copy of its tags. Second, the DTags may be modified after the snoop to reflect the action of the snooped transaction, even though the transaction actually completes later when it leaves the PTQ. This allows subsequent snoops to see the modified state of the line.
Having a second tag array (DTags) for snooping is desirable if the processor cache tags are external to the processor chip and are made with the same technology as the cache data memory, since there are substantial savings in cache tag bandwidth. However, if the cache tags are on-chip in much faster memory than the off-chip cache data memory, then the cache tags may have spare bandwidth available for snooping. By performing all bus snooping to the second tag array de-coupled from the processor (first) tag array, the processor regains bandwidth that would have been lost if snooping was performed at the processor tag array. Unfortunately, the second tag array doubles the amount of memory needed for the tags. Furthermore, the second tag array increases the cost of the system since it has to be made from high-speed memory devices as the first tag array.