In a shared-memory multiprocessor computer system with per-processor caches, it is desirable for hardware to keep the caches coherent. This means that when one or more processors have a copy of a block or line of memory and one of the processors performs a write to that line, the cached copies must either be updated or invalidated.
There are two basic mechanisms for doing this. One is to have all processors “snoop” on all write operations in the system and invalidate or update their own copies when appropriate. This is called “snoopy” coherence or “snooping caches”, and is generally done in shared-bus multiprocessors.
The other alternative is to use directories that are associated with main memory. In large systems, memory, and thus the directories, may be physically distributed among the machine. The directory associated with a region of memory keeps track of which processors, if any, have copies of memory lines in that region. When a write occurs to a line, the directory is consulted, and updates or invalidates are sent to any processors that have a copy of the line. In this way, processors are relieved from having to snoop on all memory write traffic.
The basic problem with directories is that the required storage does not scale well. The canonical directory structure associates a directory entry with every line of main memory. Each directory entry consists of some state, along with a bit vector (one bit per processor in the system) indicating which processor(s) currently have a copy of the line. As the number of processors is increased, the number of lines for which directory information is needed grows linearly. In addition, the size of the directory entries also grows linearly. The net result is that the total amount of storage needed for the directories grows as the square of the number of processors, becoming prohibitively expensive for large systems.
There are a number of ways to reduce directory overhead. One way is to reduce the number of directory entries by noting that, since caches are much smaller than main memory, most memory lines will not be cached by any processor at any given time. Thus, the size of the directories are reduced, and the directory is managed as a cache. Any line for which a directory entry cannot be found is assumed to be in the uncached state.
Another way to reduce directory overhead is to limit the size of the directory entries such that an entry cannot represent any arbitrary set of processors in the system. The system is then either prohibited from allowing non-representable sets of processors to cache a line concurrently, (by, say, invalidating the copies of certain processors when other processors obtain a copy of the line), or, more preferably, when a non-representable set of sharing processors occurs, the directory entry is set to represent some superset of the sharing processors. Then when the line is written, an invalidation or update message is sent to the superset of processors caching the line.
The two approaches are orthogonal. That is, use of both approaches in tandem should result in a greater reduction in directory overhead than can be achieved with either one by itself.
In the following discussion and throughout this patent specification, for convenience an invalidate-based coherence protocol is assumed, although the approach is equally applicable to an update-based protocol. A goal of a directory structure for a large multiprocessor system is to use a modest number of bits in a directory entry, yet minimize the number of “spurious” invalidation messages that must be sent when a line is written. That is, keep the superset as close to the size of the actual set of sharers as possible.
At one end of the spectrum is a full broadcast mechanism. In this scheme, as soon as a line becomes cached by any processor (or perhaps only when it is cached by more than one processor), the state of the corresponding directory is set to indicate that a broadcast is necessary. When the line is written, invalidations are sent to all processors in the system. This mechanism minimizes the number of bits needed in the directory entry, but maximizes spurious invalidations. The performance of this scheme is very poor for large systems.
At the other end of the spectrum is the full bit-vector mechanism described above, in which a directory entry includes a bit for each processor in the system. This maximizes directory storage overhead, but eliminates spurious invalidations. The storage overhead for this scheme is unacceptable for large systems.
A reasonable middle ground is a “coarse-vector” directory structure like the one used in the Origin 2000 manufactured by Silicon Graphics In. of Mountain View, Calif. The directory structure in the Origin 2000 includes a bit vector of size v in the directory entries (where v=32). Each bit represents one or more processor nodes in the system. For systems with thirty-two or fewer nodes, this size bit vector acts like a full bit vector. For larger numbers of nodes, however, the vector can become “coarse”. When the set of processor nodes sharing a line is contained within an aligned block of consecutive processor nodes, then the bit vector can still be used as a full bit vector, with another small field in the directory entry specifying the block of processor nodes the vector represents. Processor nodes will typically contain one or more processors. In the Origin 2000, each processor node includes two processors.
When the set of processor nodes expands beyond an aligned block of v processor nodes, however, the meaning of the bits in the vector is changed (this is recorded in the state information in the entry). For N-processor-node systems, each bit in the vector now represents N/v processor nodes. For example, in a 512-processor node system with a 32-bit vector, each bit in the vector represents sixteen processor nodes. For every processor node caching the line, the bit representing the set of processor nodes containing that processor node would be set. When the line is written, for each bit that is set in the coarse vector, invalidation messages are sent to the corresponding set of N/v processor nodes. In most cases, this will cause invalidations to be sent to some processor nodes that do not have a copy of the line (spurious invalidates). However, the number of spurious invalidates will generally be much fewer than would be sent with the full broadcast mechanism.
Such an approach is effective in reducing the size of each directory entry. Such an approach does, however, continue to scale linearly as the number of processors in the system increases.
A better way of compacting directory entries, in contrast to the single bit vector approach discussed above, is a multi-dimensional directory structure approach which uses two or more bit vectors to track each processor node. Such an approach is described in U.S. patent application Ser. No. 08/971,184, filed Nov. 17, 1997, now pending, entitled Multi-Dimensional Cache Coherence Directory Structure, the disclosure of which is incorporated herein by reference. In the multi-dimensional approach, each processor node is represented by a bit in each of the two or more bit vectors.
For example, a two-dimensional directory structure includes two or more entries. Each entry includes a row vector such as an r-bit vector (where r is an integer greater than one) and a column vector such as a c-bit vector (where c is an integer greater than one). Each intersection of a row and a column represents a node that may have one or more processors. More generally, for a system with n vectors, as each processor obtains a copy of a shared line, the bits corresponding to that processor are set in each of the n vectors of the associated directory entry. If n-1 of the vectors have only a single bit set, then the pointer information remains exact. If more than one bit is set in at least two of the fields, however, then the pointer information can become inexact, or approximate.
As a result, in addition to invalidation messages being sent to shared processor nodes needing the messages, spurious invalidation messages are sent to alias processor nodes not needing the messages.