Multiprocessor computers by definition contain multiple processors that can execute multiple parts of a computer program or multiple programs simultaneously. In general this multiprocessor computing executes computer programs faster than conventional single processor computers, such as personal computers (PCs), that execute the parts of a program sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a program can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared-memory multiprocessor computers offer a common memory address space that all processors can access. Processes within a program communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, on the other hand, have a separate memory space for each processor. Processes communicate through messages to each other.
Multiprocessor computers may also be classified by how the memory is physically organized. In distributed memory computers, the memory is divided into modules physically placed near each processor. This placement provides each processor with faster access time to its local memory. By contrast, in centralized memory computers, the memory is physically located in just one location, generally equally distant in time and space from each of the processors. Both forms of memory organization use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are often organized into nodes with one or more processors per node. Also included in the node is local memory for the processors, a remote cache for caching lines obtained from memory in other nodes, and logic for linking the node with other nodes in the computer. A processor in a node communicates directly with the local memory and through the remote cache to obtain data. For example, if the desired data is in the local memory, a processor obtains the data directly from local memory. But if the desired data is stored in memory in another node, the processor must access its remote cache to obtain the data. A cache hit occurs if the data has been obtained recently and is presently stored in the cache. Otherwise a cache miss occurs, and the processor must obtain the desired data from the local memory in another node through the linking logic. Accessing local memory is faster than obtaining data located in another node. Consequently, such distributed shared memory systems are termed Non-Uniform Memory Access systems (NUMA).
Data coherency is maintained among the multiple caches and memories of a multiprocessor computer through a cache coherency protocol, such as the protocol described in the Scalable Coherent Interface (SCI) (IEEE 1596). The multinode computer systems require a local cache directory (also called a directory) that holds information relating to whether other nodes have copies of lines stored in cache. There are several types of directories used in different multinode systems, including full-mapped and chained directories. In a full-mapped directory, information about all lines in all caches resides in a directory. There are two main approaches for implementing the full-mapped directory. Either a central directory contains duplicates of all cache directories, or a bit vector called the present flag vector is associated with each cache line. With the central directory, all cache directories have to be searched for each memory access. With the bit vector for a cache line, each node of the system is associated with one bit of the vector. If a bit is set, the node corresponding to this bit has a copy of the line.
In the chained directory, instead of having a centralized directory containing information about all cached lines, the directory information is distributed over the system as a linked list. Linked lists are commonly used to maintain cache coherency by identifying each node that contains a copy of the cache line of interest. Thus, the directory entry for a cache line is a common reference point containing the state and a pointer to a head of a sharing list. Likewise, each node on the sharing list contains a pointer field used for maintaining the list. This pointer field either holds a reference to the next cache that has a copy of the line or it contains a link terminator indicating that this node has the tail of the list for the line in question. This approach reduces the storage space due to the distributed sharing list. SCI is an example of a cache protocol that uses a chained directory.
Regardless of the type of directory used, a remote node may have a “capacity miss.” A capacity miss occurs when the remote node's cache is full and cannot store more lines. To make additional room in the cache, a cache line may be overwritten (also called a “roll out” or “eviction”). When rolling out a line, the remote node informs the home node of the rollout so that the local directory on the home node can be updated. If the remote node contains the only copy of the line, then the remote node transfers the line to the home node so that it is stored in main memory (often called a “write back”). Some Symmetric Multiprocessor (SMP) systems have silent rollouts of lines. With silent rollouts, the remote node rolls out cache lines that are shared without reporting the rollout to the home node. The silent rollouts are possible in SMP systems because all nodes share a common bus and, as a result, modified data responses can be “snarfed” meaning that the memory also reads a copy of the cache line and writes it back to main memory.
In order to increase the speed and reduce costs of multinode systems, it is desirable to simplify existing architectures and the protocol used for inter-node communication. In particular, for applications that only require a two-node system, the hardware and software can be simplified while maintaining a high degree of performance. However, it is not readily apparent what modifications can be made to multinode systems to reduce costs.
An object of the invention, therefore, is to reduce costs in a multinode computer system by reducing the complexity of the protocol and/or hardware used to communicate between the nodes. Another object of the invention is to ensure that the multinode computer system maintains forward progress of requests for lines.