1. Field of the Invention
The present invention generally relates to a computer system with multiple processors. More preferably, the present invention generally relates to the sharing of data among processors in a Distributed Shared Memory (“DSM”) computer system. Still, more particularly, the invention relates to a high performance directory based cache coherence protocol that permits processors in a DSM system to make more efficient use of their cache memory resources.
2. Background of the Invention
Distributed computer systems typically comprise multiple processors connected to each other by a communications network. In some distributed computer systems, the processors can access shared data. Such systems are sometimes referred to as parallel computers. If a larger number of processors are networked, the distributed system is considered to be a “massively” parallel system. One advantage of a massively parallel computer system is that it can solve complex computational problems in a relatively short time period.
In parallel and massively parallel computer systems, it has become common to distribute memory throughout the computer system, and to permit most or all of the processors to access data stored in the distributed memory, regardless of where the memory is located. Such a shared memory architecture is conventionally known as a Distributed Shared Memory (“DSM”) system. One of the challenges that faces a system designer of a DSM system is to ensure that the data is stored and retrieved in a coherent manner. In particular, the system designer must implement a protocol for handling data that ensures that two different processors don't concurrently modify the same piece of data, and try to save that data back to the same memory location. Thus, to guarantee the coherency of the data in memory, steps must be taken to permit only one processor to modify any particular memory location at any one time.
Recently, DSM systems have been built as a cluster of Symmetric Multiprocessors (“SMP”). In SMP systems, shared memory can be implemented efficiently in hardware since the processors are symmetric (e.g., identical in construction and in operation). It has become desirable to construct large-scale DSM systems in which processors efficiently share memory resources. Thus, the system preferably supports having a program executing on a first processor fetch data stored in a section of system memory that operates under the control of a second processor.
It has become commonplace to provide memory caches (or cache memory) with processors. The memory cache typically is located on the same semiconductor device as the processor (an L1 cache), or is located immediately adjacent the processor, and connected to the processor by a high-speed data bus (an L2 cache). Memory devices used for cache memory usually comprise very high-speed devices, thus permitting much faster access times than standard system memory. The drawback, however, is that cache memory is relatively expensive, and relatively large in size. Thus, cache memory typically has a small data capacity relative to the system memory. The processor uses these high-speed cache memory devices to store data that is either being used repeatedly by the processor, or to store data expected to be used by the processor. Because of the small size of the cache, data is displaced from the cache if it is not used. The displaced data is then re-stored in the system memory. Usually, data is read from the system memory into cache memory, where it is modified by the processor. When the processor has completed modification of data, the data is displaced, and written back to the system memory.
During normal operation, when the processor needs a particular piece of data, it first looks in its associated cache memory. If the desired data is in the cache memory, that is referred to as a “cache hit”, and the processor then performs any necessary operations on the version of data in the cache. If the desired data is not in the associated memory cache(s), that is referred to as a “cache miss”, and the processor then retrieves the data from system memory and stores a copy in the cache for further operations. The fact that each processor in a multi-processor system has its own cache greatly complicates the problem of data coherency. In particular, a problem can arise if multiple processors have made a cache copy of the same data from system memory. The processors then may proceed to modify, in different ways, the copy of the data stored in the processors' associated caches. The various modified copies of data stored in each cache memory raise the specter that multiple inconsistent versions may exist of what was originally the same data obtained from the same memory location. A coherency protocol thus must be implemented that mitigates these concerns to prevent inconsistent versions of the same data from being created, or more particularly, from being indeterminately being written back to system memory.
Because the size of computer systems continues to increase at a rapid rate, and because massively parallel computer systems are being developed with an ever-increasing number of processors, it is desirable if the coherency protocol is scaleable to processor configurations of various sizes. One technique that has been adopted in large multi-processor systems is to implement a directory-based cache coherence system. In a directory based solution, a directory is maintained for all of the memory in the system. Typically, like the system memory, the directory is distributed throughout the computer system. According to conventions developed by the assignee of the present invention, each processor is assigned the responsibility of controlling some portion of the distributed system memory. With respect to that portion of system memory under the control of a given processor, that processor maintains a directory that identifies which portions of that memory have been copied by a processor. In this fashion, it is possible to determine if a particular block of data has already been copied. Various instructions are then used to determine which data is valid in the system, and what data must not be used or must be flushed to make sure that data is maintained coherently.
Most reduced instruction set core (“RISC”) processors use “Load Lock” and “Store Conditional” instructions to ensure synchronization in multiprocessor distributed shared memory architectures. A processor uses the “Load Lock” and “Store Conditional” instructions when it seeks exclusive access to a block of data (which is located in a particular memory address range), so that it is the only processor that can manipulate that data during the period that Load Lock is asserted. Thus, the processor issues a “Load Lock” instruction for a block of data coincidentally with reading that data block and storing a copy in the cache memory of that processor. After the data is manipulated, the processor then issues the “Store Conditional” instruction, which causes the data to be re-written to the original memory location where the data resided, at which time the memory block is released. If that section of memory has been written to in the interval between when “Load Lock” and “Store Conditional” was asserted, the “Store Conditional” instruction fails. Thus, the “Store Conditional” instruction is a conditional instruction that only executes (i.e., that only stores) if the data in that memory location has not been modified since the Load Lock instruction was issued.
As an example and referring to FIG. 1, assume four processors A, B, C and D have shared access to memory locations M, N, O and P. Memory M operates under control of processor A, memory N operates under control of processor B, and so on. If processor A seeks exclusive access to memory block X in memory O, it will issue a “Load Lock” instruction for that block of memory to the processor responsible for block X (which happens to be processor C), and that processor records in its directory (shown as “D”) that processor A has a copy of that memory block. If, while the “Load Lock” is active, processor B writes to that same memory block X, processor A must be informed of that event, so that processor A can determine that the “Load Lock” has failed. In operation, processor A learns of the write operation during “Load Lock” when “Store Conditional” is executed.
A “Load Lock” instruction may be used in situations where a programmer desires serialized access to a location that is shared in a distributed memory, multi-processor system. Thus, “Load Lock” may be used in situations where multiple processors are instructed to modify a block of data in a specific order (processor A modifies the data, followed by processor B, followed by processor D, . . . ).
As mentioned above, DSM systems typically include a directory that records which memory locations have had data read out and stored in the local cache of a processor. The directory may be located anywhere in the computer system, and essentially comprises a table with entries corresponding to each block of memory. Typically, the directory is fragmented, and multiple processors are responsible for certain portions of the directory. Typically, control of the memory is distributed among various processors, and each processor maintains the directory for the portion of memory for which it is responsible. In the example, processor C is responsible for controlling memory block X in memory O, and processor C includes a directory for memory O, with an entry that indicates that processor A has made a copy of block X. When processor B writes to that same memory block X, the directory causes an “Invalidate” or “Shared Invalidate” instruction to be sent to circuitry within processor A, informing processor A that the block of data copied from memory block X is now invalid due to the write operation by processor B. In response, processor A will flush that data from its cache memory, since it is no longer valid.
A block of memory can be requested as a writeable copy (i.e., a copy of the memory block with authority to modify the block), or a read-only copy (which can be read, but not modified). If processor A requests a read-only copy of memory block Y in memory P, and does not issue a “Load Lock” instruction, the directory for block Y in processor D records that processor A has a read-only copy of memory block Y, which is commonly referred to as a “shared” state. If another processor were to write to memory block Y, then the directory for processor D would issue an instruction to processor A invalidating the read-only copy of memory block Y.
If processor A instead requests a writeable copy of block Y, that is referred to as an “exclusive” state, because only processor A has access to memory block Y. If, however, processor A begins working on other data, and the writeable copy of block Y is displaced from the cache memory of processor A, the data is returned to memory block Y, together with a “victim” instruction. The “victim” instruction clears the exclusive state of that memory block Y, and the directory in processor D for memory P no longer shows any association with processor A. If another processor then obtains a writeable copy of block Y, and modifies block Y, than the directory in processor D will not notify processor A because there is no longer any association of block Y with processor A. If processor A then subsequently executes a Store Conditional on block Y, it will be unaware that block Y has been modified, and the directory will not detect this event. Thus, between the assertion of “Load Lock” and “Store Conditional” by processor A, a different processor may cause the data that was the subject of these instructions to be modified. As a result, data incoherency exists because the data was not handled in the proper order specified by the programmer.
To prevent this situation from occurring, most conventional designs do not allow the writeable copy to be displaced or evicted from the cache until the “Store Conditional” instruction executes. Instead, conventional designs require that writeable copies of memory be locked in the cache until operation on that copy is completed. This prevents subsequent writes to the same block from going undetected. This approach of locking the cache, however, ties up cache resources. In a multithreaded system where multiple programs can execute simultaneously on the same processor, this can severely impact performance of the system. As an example, block Y may be copied by processor A for program 1, but program 2 begins executing on Processor A. Unfortunately, the cache of processor A may become full, because instructions have also been fetched for program 1. It would be desirable if processor A could free its cache to operate expeditiously on program 2, if necessary. Under conventional implementations, block Y could not be evicted from the cache associated with processor A until operations are complete on block Y and the data has been conditionally stored. Despite the apparent inefficiencies such an approach produces, to date no one has developed an efficient solution to this problem.