1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to coherency protocols employed within multiprocessor computer systems having shared memory architectures.
2. Description of the Related Art
Multiprocessing computer systems include two or more processors that may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole.
A popular architecture in commercial multiprocessing computer systems is a shared memory architecture in which multiple processors share a common memory. In shared memory multiprocessing systems, a cache hierarchy is typically implemented between the processors and the shared memory. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared memory multiprocessing systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches that are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory.
Shared memory multiprocessing systems generally employ either a snooping cache coherency protocol or a directory-based cache coherency protocol. In a system employing a snooping protocol, coherence requests are broadcast to all processors (or cache subsystems) and memory through a totally ordered address network. Each processor “snoops” the requests from other processors and responds accordingly by updating its cache tags and/or providing the data to another processor. For example, when a subsystem having a shared copy of data observes a coherence request for exclusive access to the block, its copy is typically invalidated. Likewise, when a subsystem that currently owns a block of data observes a coherence request to that block, the owning subsystem typically responds by providing the data to the requestor and invalidating its copy, if necessary. By delivering coherence requests in a total order, correct coherence protocol behavior is maintained since all processors and memories observe requests in the same order.
In a standard snooping protocol, requests arrive at all devices in the same order, and the access rights of the processors are modified in the order in which requests are received. Data transfers occur between caches and memories using a data network, which may be a point-to-point switched network separate from the address network, a broadcast network separate from the address network, or a logical broadcast network which shares the same hardware with the address network. Typically, changes in ownership of a given cache block occur concurrently with changes in access rights to the block.
Unfortunately, the standard snooping protocol suffers from a significant performance drawback. In particular, the requirement that access rights of processors change in the order in which snoops are received may limit performance. For example, a processor may have issued requests for cache blocks A and B, in that order, and it may receive the data for cache block B (or already have it) before receiving the data for cache block A. In this case the processor must typically wait until it receives the data for cache block A before using the data for cache block B, thus increasing latency. The impact associated with this requirement is particularly high in processors that support out-of-order execution, prefetching, multiple core per-processor, and/or multi-threading, since such processors are likely to be able to use data in the order it is received, even if it differs from the order in which it was requested.
The other standard approach to cache consistency uses a directory-based protocol. In systems that implement a directory-based protocol, both the address network and the data network are typically point-to-point, switched networks. When a processor requests a cache block, the request is sent to a directory which maintains information regarding the processors that have copies of the cache block and their access rights. The directory then forwards the request to those processors which must change their access rights and/or provide data for the request (or if needed, the directory will access the copy of the cache block in memory and provide the data to the requester). Since there is no way of knowing when the request arrives at each processor to which it is sent, all processors that receive the request must typically acknowledge reception by providing data or sending an acknowledge (ACK) message to either the requester or the directory, depending on the protocol.
Typical systems that implement a directory-based protocol may be associated with various drawbacks. For example, such systems may suffer from high latency due to the requirement that requests go first to a directory and then to the relevant processors, and/or from the need to wait for acknowledgment messages. In addition, when a large number of processors must receive the request (such as when a cache block transitions from a widely shared state to an exclusive state), all of the processors must typically send ACKs to the same destination, thus causing congestion in the network near the destination of the ACKs and requiring complex logic to handle reception of the ACKs. Finally, the directory itself may add cost and complexity to the system.