As processor performance continues to increase, the latency of a processor's access of a remote memory location continues to increase beyond the latency of a processor's access of local memory.
Processor caches are a way of reducing the average memory access time, by saving copies of frequently used memory addresses in processor-resident storage locations. The locality of memory references over time (temporal locality) and space (spatial locality) allows the cache to fulfill most memory accesses, so that the longer latencies associated with remote-location memory are effectively reduced.
Within a multiprocessor system, each processor may have its own cache. Multiple copies of the same data may be cached concurrently within several or all of these processors. To maintain a consistent view of addressable memory, all such copies must be equal. Maintaining equal copies of cached data values within multiprocessor caches is the responsibility of cache coherence protocols.
For simplicity, the cache coherence protocols track the state and presence of equal-sized collections of data bytes, called a line, rather than individual data bytes. A larger line reduces the overhead of including a header (containing command and address information) with every data packet transferred on the system interconnect, as well as other tracking information possibly associated with each line. The number of data bytes within each line is typically 64 bytes, although smaller (32 bytes) and larger (128 bytes or 256 bytes) numbers of data bytes within each line are possible.
The most common way to maintain cache coherence between multiprocessors relies on broadcasting writes (or intents to write) from processors, wherein a write to a line address by one processor (called the owner) is broadcast to others for the purpose of invalidating other cached copies of the same line address. On large multiple-bus or mesh connected multiprocessor systems, such broadcasts are inefficient in that their propagation consumes interconnect bandwidth to each potential processor cache, regardless of the number of actual processor-cache resident copies.
The intent of these broadcasts is to distribute invalidate messages to other caches, which then have the opportunity to check their caches for a matching-address line, a process called snooping. Other caches with shared clean copies are responsible for invalidating their copies, so that their copies will remain consistent after the owner's write is performed. Another cache with a dirty copy is responsible for providing that data to the new owner, so that the write effects the most recently modified data, rather than a stale copy obtained from memory.
Snooping-based cache protocols are inefficient on multiple-bus or multiple-link interconnects, since broadcasting of snoop information reduces the performance of the interconnect to that of a single shared bus. A preferred alternative is to restricting of snoop information to only those caches that have a copy of the line address. This requires retaining information that identifies which caches have (or are likely to have) shared copies, on an addressable-line basis. Cache coherence protocols that retain such copy-location information are called directory-based cache coherence protocols.
Central-directory cache-coherence protocols rely on bits within the memory controller (or a cached version of memory) to identify each of the possible shared-copy locations. Such simple protocols are sufficient to support small multiprocessor systems, since the overhead of these bits is small compared to the memory line size. Special/complex adaptations (such as reassociating bits with processor-cache clusters) are required to support larger multiprocessor system or even small multiprocessor systems with large numbers of possible cache addresses.
Distributed-directory cache-coherence protocols rely on a pointer within the memory controller (or a cached shared cache) to identify the first of many possible shared-copy locations. Each shared-copy location has state to identify additional shared copies, typically through a singly-linked lists, a doubly-linked list, or binary-tree structures.
An instance of a distributed-directory cache-coherence protocols is the cache coherence protocol specified by IEEE Std 1596 Scalable Coherent Interface. This specification assumes the presence of memory tags to identify the first cache location (the head of the list); shared copies are found by walking the doubly-linked list from the head. A distinct list is maintained for each possible memory-line address.