The present invention relates, in general, to the field of multi-processor computer systems. In particular, the present invention relates to a split directory-based cache coherency technique for a multi-processor computer system.
The advent of low-cost high-performance microprocessors has made large-scale multiprocessor computers feasible. In general, these microprocessors are cache-oriented; that is, they maintain a subset of the contents of main memory in high-speed storage close to the processor to improve the access latency and bandwidth of frequently-used memory data. This local memory can become inconsistent if one processor changes an element of memory by modifying its local cache and then the change is not propagated to all processors that share that memory. The precise structure of such caches varies greatly depending on the system design.
This caching problem has led to two basic architectures sometimes known as "shared memory" and "partitioned memory". In a shared memory system, algorithms are used to maintain the consistency of the shared data. Typically, in commercially successful systems, the consistency is implemented by hardware and is invisible to the software. Such systems are called "cache-consistent" and form the basis of almost all multiprocessor computer systems produced. On the other hand, the partitioned memory approach disallows sharing of memory altogether or allows sharing by only a small number of processors, thereby simplifying the problem greatly. In such computer systems, larger configurations are created by connecting groups of computer systems with a network and using a message-passing paradigm that is most often made visible to the application software running on the system.
The development of cache coherent systems has led to some fundamental design problems. For large-scale systems, the data transmission and speed limitations make cache coherency difficult to achieve successfully. Coherency operations transmitted across the communications channel have traditionally been limited by low bandwidths, thus reducing overall system speed. Large-scale systems containing a high number of processors require accurate and high-speed cache coherency implementations.
With this in mind, some fundamental issues must be resolved in order to maintain a consistent view of memory across processors. First, processors must follow an arbitration protocol that grants permission to a processor to read or modify memory contents. To perform this function, coherency protocols divide memory into fixed "lines", (subsections of memory, typically 32, 64, or 128 bytes in size) that are treated as an atomic unit. Typically, each line is allocated to a single processor in "exclusive mode", which allows writing, to one or more processors in "read-only mode", or that line is currently not cached. A processor is required to request a line in exclusive or read-only mode when loading it from the memory. In order to support this, the cache must allow the memory subsystem to delay completion of a request while the state of the line is analyzed and operations are performed on the processor cache while the system is waiting for an operation to complete.
The process of moving a line from one processor to another, when that is required, can be done in many ways. One of these approaches is termed "invalidation based" and is the technique most frequently used in existing multi-processor computer systems. In such systems, lines are removed from other processors' caches when the contents of a line are to be changed. Another approach allows for updating all caches containing the line when that line is changed.
The most common method of providing cache coherence is by using a "snoopy bus" approach. In such systems, all processors can monitor all memory transactions because they are all performed over a small number of buses, usually one or two. This approach cannot be used for large-scale systems because buses cannot supply the required data bandwidth from memory to the processors.
In such cases, most commonly a "directory" approach is used. Such systems use a database to record the processors to which lines are allocated. Transactions on memory require that the directory be examined to determine what coherency operations are required to allocate the line in question. The method of keeping the directory varies.
Many schemes have been proposed to record the contents of the directory. Most either require time-expensive searches when a directory inquiry is made or use broadcasting when the precise set of caches containing the line is too large to be recorded in the directory hardware. "Broadcasting", in this context, means sending a message to all processors in the system, often by the use of special hardware features to support this style of communication. The difficulty with broadcasting is that switch-based networks do not easily support such operations, and the cost of interrupting processors with requests that do not involve their cache contents can be high.
In order to invalidate a line that is to be updated, all caches that contain the line must be contacted, which requires a decision as to which processors to contact. Once a list of processors that have allocated the line has been made from the directory, each processor must be sent a message instructing it to remove the line from the cache and to send any changes to the memory. This operation must be supported by the microprocessor cache hardware.