1. Field of the Invention
The present invention relates to memory access control in multiprocessors, and more particularly to a directory structure for maintaining cache coherency.
2. Background Information
In a shared-memory multiprocessor system with per-processor caches, it is desirable for hardware to keep the caches coherent. This means that when one or more processors have a copy of a block or line of memory and one of the processors performs a write to that line, the cached copies must either be updated or invalidated.
There are two basic mechanisms for doing this. One is to have all processors xe2x80x9csnoopxe2x80x9d on all write operations in the system and invalidate or update their own copies when appropriate. This is called xe2x80x9csnoopyxe2x80x9d coherence or xe2x80x9csnooping cachesxe2x80x9d, and is generally done in shared-bus multiprocessors.
The other alternative is to use directories that are associated with main memory. In large systems, memory, and thus the directories, may be physically distributed among the machine. The directory associated with a region of memory keeps track of which processors, if any, have copies of memory lines in that region. When a write occurs to a line, the directory is consulted, and updates or invalidates are sent to any processors that have a copy of the line. In this way, processors are relieved from having to snoop on all memory write traffic.
The basic problem with directories is that the required storage does not scale well. The canonical directory structure associates a directory entry with every line of main memory. Each directory entry consists of some state, along with a bit vector (one bit per processor in the system) indicating which processor(s) currently have a copy of the line. As the number of processors is increased, the number of lines for which directory information is needed grows linearly. In addition, the size of the directory entries also grows linearly. The net result is that the total amount of storage needed for the directories grows as the square of the number of processors, becoming prohibitively expensive for large systems.
There are a number of ways to reduce directory overhead. One way is to reduce the number of directory entries by noting that, since caches are much smaller than main memory, most memory lines will not be cached by any processor at any given time. Thus, the size of the directories are reduced, and the directory is managed as a cache. Any line for which a directory entry cannot be found is assumed to be in the uncached state.
Another way to reduce directory overhead is to limit the size of the directory entries such that an entry cannot represent any arbitrary set of processors in the system. The system is then either prohibited from allowing non-representable sets of processors to cache a line concurrently (by, say, invalidating the copies of certain processors when other processors obtain a copy of the line), or, more preferably, when a non-representable set of sharing processors occurs, the directory entry is set to represent some superset of the sharing processors. Then when the line is written, an invalidation or update message is sent to the superset of processors caching the line.
(The two approaches are orthogonal. That is, use of both approaches in tandem should result in a greater reduction in directory overhead than can be achieved with either one by itself.)
A mechanism for limiting the size of the directory entries is discussed next. In the following discussion, we will for convenience assume an invalidate-based coherence protocol, although the approach is equally applicable to an update-based protocol. A goal of a directory structure for a large multiprocessor system is to use a modest number of bits in a directory entry, yet minimize the number of xe2x80x9cspuriousxe2x80x9d invalidation messages that must be sent when a line is written. That is, keep the superset as close to the size of the actual set of sharers as possible.
At one end of the spectrum is a full broadcast mechanism. In this scheme, as soon as a line becomes cached by any processor (or perhaps only when it is cached by more than one processor), the state of the corresponding directory indicate is set to indicate that a broadcast is necessary. When the line is written, invalidations are sent to all processors in the system. This mechanism minimizes the number of bits needed in the directory entry, but maximizes spurious invalidations. The performance of this scheme is very poor for large systems.
At the other end of the spectrum is the full bit-vector mechanism described above, in which a directory entry includes a bit for each processor in the system. This maximizes directory storage overhead, but eliminates spurious invalidations. The storage overhead for this scheme is unacceptable for large systems.
A reasonable middle ground is a xe2x80x9ccoarse-vectorxe2x80x9d directory structure like the one used in the Origin xe2x80x9c(trademark)xe2x80x9d 2000 manufactured by Silicon Graphics Inc (SGI) of Mountain View, Calif. The directory structure in the Origin xe2x80x9c(trademark)xe2x80x9d 2000 includes a bit vector of size v in the directory entries (where v=32). Each bit represents one or more processor nodes in the system; for systems with thirty-two or fewer nodes, this size bit vector acts like a full bit vector. For larger numbers of nodes, however, the vector can become xe2x80x9ccoarsexe2x80x9d. When the set of processor nodes sharing a line is contained within an aligned block of consecutive processor nodes, then the bit vector can still be used as a full bit vector, with another small field in the directory entry specifying the block of processor nodes the vector represents. Processor nodes will typically contain one or more processors; in the Origin xe2x80x9c(trademark)xe2x80x9d 2000, each processor node includes two processors.
When the set of processor nodes expands beyond an aligned block of v processor nodes, however, the meaning of the bits in the vector is changed (this is recorded in the state information in the entry). For N-processor-node systems, each bit in the vector now represents N/v processor nodes. For example, in a 512-processor node system with a 32-bit vector, each bit in the vector represents sixteen processor nodes. For every processor node caching the line, the bit representing the set of processor nodes containing that processor node would be set. When the line is written, for each bit that is set in the coarse vector, invalidation messages are sent to the corresponding set of N/v processor nodes. In most cases, this will cause invalidations to be sent to some processor nodes that do not have a copy of the line (spurious invalidates). However, the number of spurious invalidates will generally be much fewer than would be sent with the full broadcast mechanism.
Such an approach is effective in reducing the size of each directory entry. Such an approach does, however, continue to scale linearly as the number of processors in the system increase. What is needed is a better way of compacting directory entries.
The present invention is a cache coherence system and method for use in a multiprocessor computer system having a plurality of processor nodes, a memory and an interconnect network connecting the plurality of processor nodes to the memory. Each processor node includes one or more processors. The memory includes a plurality of lines and a cache coherence directory structure having a plurality of directory structure entries. Each of the directory structure entries is associated with one of the plurality of lines and each directory structure entry includes processor pointer information, expressed as a set of bit vectors, indicating the processor nodes that have cached copies of lines in memory.
According to another aspect of the present invention, a method is described for maintaining cache coherency across a computer system having a plurality of processor nodes, including a first and a second processor node, wherein each of the plurality of processor nodes includes a cache and at least one processor. The method comprises the steps of assigning a processor number to each of the plurality of processor nodes, defining a plurality of bit vectors, wherein each processor node is represented by a bit in each of the plurality of bit vectors, determining a first location in memory, reading a line from the first location into the cache of said first processor node, wherein the step of reading includes the step of setting a bit in each of the plurality of bit vectors as a function of the processor number assigned to the first processor node, reading a line from the first location into the cache of the second processor node, wherein the step of reading includes the step of setting a bit in each of the plurality of bit vectors as a function of the processor number assigned to the second processor node, and writing to the first location. The step of writing includes the steps of a) determining, as a function of bits set in the bit vectors, processor nodes holding a copy of the line; and b) sending a message to the processor nodes determined in step a) indicating that the line read from the first location is no longer valid. In one embodiment the step of sending a message to the processor nodes includes the step of updating the line.
According to yet another aspect of the present invention, a computer readable medium is described having program code for maintaining cache coherency across a computer system having a plurality of processor nodes, including a first and a second processor node, wherein each of the plurality of processor nodes has an assigned processor number and wherein each of the plurality of processor nodes includes a cache and at least one processor. The program code comprises program code for defining a plurality of bit vectors, wherein each processor node is represented by a bit in each of the plurality of bit vectors, program code for determining a first location in memory, program code for reading a line from the first location into the cache of the said first processor node, wherein the program code for reading includes program code for setting a bit in each of the plurality of bit vectors as a function of the processor number assigned to the first processor node, program code for reading a line from the first location into the cache of the second processor node, wherein the program code for reading includes program code for setting a bit in each of the plurality of bit vectors as a function of the processor number assigned to the second processor node and program code for writing to the first location. The program code for writing includes program code for determining, as a function of bits set in the bit vectors, processor nodes holding a copy of the line and program code for sending a message indicating that the line read from the first location is no longer valid. In one embodiment the program code for sending a message to the processor nodes includes program code for updating the line.