1. Technical Field
The invention relates to a multiprocessor computer architecture. More particularly, the invention relates to a multiprocessor computer architecture containing processor caches that are kept coherent.
2. Description of the Prior Art
The flow of information between processors in a computer architecture should avoid the use of stale data, i.e. data that is older than and/or inconsistent with related data stored at another cache location or in main memory. Thus, cache coherency should be maintained, while minimizing interference with processor operation. In computer system architectures, cache coherency can be handled by hardware and/or software. In some such architectures, coherency between the processors and memory is maintained by hardware, while software guarantees coherency between the cache, memory, and input/output devices.
Caches can be classified either as write-through or write-back. A write-through cache is always written to along with system memory, such that system memory and cache each maintain a current copy of the information that is written, and the possibility of stale data is avoided. Information input in computer architecture using a write-through cache requires significant system overhead. For example, the system must guarantee that none of the blocks of the I/O buffer that have been designated for input are in the cache.
Write-back caches keep modified data until the data are cast out and written back to memory to make room for new data. Because a write-back cache may have the only copy of modified data, special care must be taken in the cache coherency protocol between caches and I/O systems such that the cache data can be quickly accessed and never lost.
One software solution to the stale data problem marks a buffer page as non-cacheable, and the operating system is configured to input information only to this non-cacheable page. Another software solution to this problem uses the operating system to flush the buffer addresses from the cache after an information input occurs, effectively clearing the caches. A hardware solution to the stale data problem checks the I/O addresses during information input to determine if they are in the cache. If so, the cache entries are invalidated to avoid stale data.
The protocols that are used to maintain coherency for multiple processors are referred to as cache coherency protocols. There are two classes of cache coherency protocols:
1. Directory based: The information about one block of physical memory is kept in just one location. This information usually includes which cache(s) has (have) a copy of the block and whether that copy is marked exclusive for future modification. An access to a particular block first queries the directory to see if the memory data is stale and the real data resides in some other cache. If it is, then the cache containing the modified block is forced to return its data to memory. Then the memory forwards the data to the new requester, updating the directory with the new location of that block. This protocol minimizes inter-bus module (or inter-cache) disturbance, but typically suffers from high latency and is expensive to build due to the large directory size required.
2. Snooping: Every cache that has a copy of the data from a block of physical memory also has a copy of the information about the data block. Each cache is typically located on a shared memory bus, and all cache controllers monitor or snoop on the bus to determine whether or not they have a copy of the requested block.
Snooping protocols are well suited for multiprocessor system architectures that use caches and shared memory because they operate in the context of the preexisting physical connection usually provided between the bus and the memory. Snooping is preferred over directory protocols because the amount of coherency information is proportional to the number of blocks in a cache, rather than the number of blocks in main memory.
The coherency problem arises in a multiprocessor architecture when a processor must have exclusive access to write a block of memory or object, and/or must have the most recent copy when reading an object. A snooping protocol must locate all caches that share the object to be written. The consequences of a write to shared data are either to invalidate all other copies of the data, or to broadcast the write to all of the shared copies. Because of the use of write back caches, coherency protocols must also cause checks on all caches during memory reads to determine which processor has the most up to date copy of the information.
Status bits are provided in a cache block to implement snooping protocols. This information is used when monitoring bus activities. On a read miss, all caches check to see if they have a copy of the requested block of information and take the appropriate action, such as supplying the information to the cache that missed. Similarly, on a write, all caches check to see if they have a copy of the data, and then act, for example by invalidating their copy of the data, or by changing their copy of the data to the most recent value.
Because every coherent bus transaction causes the caches to check their address tags, snooping interferes with the CPU's access to its cache regardless of the snoop result. For example, even when snooping returns a miss, the CPU is prevented from cache access because the cache is unavailable, i.e. the cache is busy checking tags to match against the snoop address. Thus, the CPU stalls or locks if it needs to access the cache while the cache is busy with a coherency check.
Snooping protocols are of two types:
Write invalidate: The writing processor causes all copies in other caches to be invalidated before changing its local copy. The processor is then free to update the data until such time as another processor asks for the data. The writing processor issues an invalidation signal over the bus, and all caches check to see if they have a copy of the data. If so, they must invalidate the block containing the data, and provide the data if the status indicates that the block has been modified. This scheme allows multiple readers but only a single writer.
Write broadcast: Rather than invalidate every block that is shared, the writing processor broadcasts the new data over the bus. All copies are then updated with the new value. This scheme continuously broadcasts writes to shared data, while the write invalidate scheme discussed above deletes all other copies so that there is only one local copy for subsequent writes. Write broadcast protocols usually allow data to be tagged as shared (broadcast), or the data may be tagged as private (local). For further information on coherency, see J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, Inc. (1990).
In a snoopy coherence multiprocessor system architecture, each coherent transaction on the system bus is forwarded to each processor's cache subsystem to perform a coherency check. This check usually disturbs the processor's pipeline because the cache cannot be accessed by the processor while the coherency check is taking place.
In a traditional, single ported cache without duplicate cache tags, the processor pipeline is stalled on cache access instructions when the cache controller is busy processing cache coherency checks for other processors. For each snoop, the cache controller must first check the cache tags for the snoop address and, if there is a hit, then modify the cache state, and provide data if the status indicates that the block has been modified. Allocating cache bandwidth for an atomic (unseparable) tag read and write (for possible modification) locks the cache from the processor longer than needed if the snoop does not require a tag write. For example, 80% to 90% of the cache queries are misses, i.e. a tag write is not required.
It is possible to reduce contention between the processor pipeline and the bus snoops by implementing a dual ported cache. However, this solution requires additional hardware and interconnect, and is therefore difficult and expensive to implement.
In multi-processor systems, duplicate tags (which are also referred to as "tag caches") may be used to minimize the number of coherence checks performed on a processor. By performing fewer coherence checks on the cache, the cache can be more fully used to execute instructions, and thereby improve system performance.
In prior art implementations, the duplicate tags are an exact copy of the tags of the actual cache. As caches continue to increase in size, either the portion of the cache integrated circuit surface area devoted to maintaining duplicate tags also increases, or the device pin-count (e.g. for off-chip duplicate tags) required to maintain the duplicate tags becomes costly.
A system for maintaining duplicate cache tags in a simple and inexpensive way, and that minimizes the use of integrated circuit surface area by, or device pin-out associated with, such duplicate cache tags would be a significant advance in uniprocessor and multiprocessor architecture design.