The present invention relates to computer systems and, more particularly, to cache-coherent distributed-memory multiprocessor systems. A major objective of the present invention is to provide for faster average memory access.
Much of modern progress is associated with the rapid advance of computer technology. However, no sooner are more powerful and feature-laden computers introduced than are appetites whetted for more power and more features.
Computers typically include one or more processors and memory. Computer memory stores information in the form of binary data, the basic unit of which is referred to as a "bit". Most of the data stored in memory is "user data", which includes program instructions and program data. Processors process data many bits at a time; the number of bits handled at a time defines the word size for the incorporating system. Early processors manipulated 8-bit words (one byte) at a time. 32-bit-word systems are now prevalent, and 64-bit-word systems are becoming more widely used.
A processor executes instructions, which can involve performing operations on program data. Multiprocessor systems achieve higher performance by performing concurrently tasks that a single-processor system would perform sequentially. Like single-processor systems, some multiprocessor systems address a unified main memory. However, the gains to be achieved by adding additional processors are partially offset by the latencies incurred as the processors contend for access to the unified memory.
To reduce memory contention, main memory can be distributed among two or more memory cells. Each cell contains its own memory and one or more processors. To provide compatibility with programs assuming unified memory, each processor can access not only the local memory, but also the memories of other cells via cell communications link circuitry. While access of local memory is faster than access of remote memory, all main-memory accesses are slow compared to the processor speeds.
Caching can ameliorate the performance limitations associated with memory accesses. Caching involves storing a subset of the contents of main memory in a cache memory that is smaller and faster than main memory. Various strategies are used to increase the probability that cache contents anticipate requests for data. For example, since data near a requested word in memory address space is relatively likely to be requested near in time to the requested word, most caches fetch and store multi-word lines. The number of words stored in a single cache line defines the line size for a system; for example, a line can be eight words long.
Since caches typically have far fewer line storage locations than main memory, many main-memory line addresses are associated with each cache location. Accordingly, a tag is stored at each cache location along with data to indicate uniquely the main-memory line address owning the cached data. While there are several types of caches, direct-mapped caches are the fastest since only one cache location needs to be examined for each data request.
In both single-processor and multiprocessor systems, there is a challenge of ensuring "coherency" between the cache and main memory. For example, if a processor modifies data stored in a cache, the modification must be reflected in main memory. Typically, there is some latency between the time the data is modified in the cache and the time the modification is reflected in main memory. During this latency, the yet-to-be-modified data in main memory is invalid. Steps must be taken to ensure that the main-memory data is not read while it is invalid.
Maintaining coherency in multiprocessor systems can be especially complex since data can be stored concurrently in multiple caches. When a replica of data in one cache is modified, the corresponding data in the other caches would be rendered invalid. Thus, some means is required to track which caches hold what data and to indicate when cached data is rendered invalid due to a modification of a replica of that data by another cache.
Typically, "permission" is required to modify cached data. That permission is only granted if the data is stored in exactly one cache. Data stored in multiple caches is treated as read only. Each cache line can include one or more state bits indicating whether permission is granted to modify data stored at that line. While the exact nature of the states is system dependent, there is typically a "privacy" state bit used to indicate permission to modify. If the privacy bit indicates "private", only one cache holds the data and the associated processor has permission to modify the data. If the privacy bit indicates "public", any number of caches can hold the data, but no processor can modify it.
In a multiprocessor system, for a processor to read or modify data, there must be a way to determine which caches, if any, have copies of the data and whether permission is given for modification of the data. "Snooping" involves examining the contents of multiple caches to make the determination. If the requested data is not found in the local cache, remote caches can be "snooped". Recalls can be issued to request that private data be made public so that another processor can read it, or recalls can be issued to invalidate public data in some caches so that another cache can modify it.
The communications bandwidth involved in snooping scales more than linearly with the number of caches to be snooped. For large numbers of processors and caches, exhaustive snooping impairs performance. For this reason, some distributed-memory multiprocessor systems snoop within cells and rely on directory-based cache coherency for intercell coherency.
In a distributed-memory system employing directory-based cache coherency, the main memory of each cell associates a directory entry with each line of memory. Each directory entry identifies the cells caching the line and whether the line of data is public or private. Snooping is used to determine which cache within a cell has the data. Thus, each cell contains a directory indicating the location of cached copies of data stored in its main memory.
For example, in an eight-cell system, each directory entry would be nine bits long. For each of the cells, a respective "site" bit indicates whether or not that cell contains a cached copy of the line. The ninth, "privacy", bit indicates whether the data is held privately or publicly. A change of state to "private" is indicated first in the coherency directory for the cell owning (storing in main memory) the data; a change of state to public is indicated in the cache first. At other times, for a given line of data, its privacy state as indicated in a cache matches its privacy state as indicated in coherency directory. To avoid coherency problems, the cache privacy bit is precluded from indicating "private" while the corresponding privacy bit in a coherency directory indicates "public".
When data is requested from main memory, the associated coherency directory must be examined to determine whether a recall is necessary. Since the recall must be completed after main memory is accessed and before the data request is met, some memory accesses are slower than they would be in a cacheless system. Because the caches reduce the number of main-memory accesses, overall performance is generally improved. However, with the insatiable demand for computing power, further improvements in performance are desired.