Multiprocessor computers by definition contain multiple processors that typically can execute multiple parts of a computer program or multiple programs simultaneously in a manner known as parallel computing. In general multiprocessor computers execute computer programs faster than conventional single processor computers, such as personal computers (PCs), that must execute the parts of a program sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a program can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared-memory multiprocessor computers offer a common memory address space that all processors can access. Processes within a program communicate through shared variables in memory which allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, on the other hand, have a separate memory space for each processor. Processes communicate through messages to each other.
Multiprocessor computers may also be classified by how the memory is physically organized. In distributed memory computers, the memory is divided into modules physically placed near each processor. This placement provides each processor with faster access time to its local memory. By contrast, in centralized memory computers, the memory is physically located in just one location, generally equally distant in time and space from each of the processors. Both forms of memory organization use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are often organized into nodes with one or more processors per node. Also included in each node are local memory for the processors, a remote cache for caching data obtained from memory in other nodes, and logic for linking the node with other nodes in the computer. A processor in a node communicates directly with the local memory and communicates indirectly with memory in other nodes through the remote cache. For example, if the desired data is in local memory, a processor obtains the data directly from local memory. But if the desired data is stored in memory in another node, the processor must access its remote cache to obtain the data. A cache hit occurs if the data has been obtained recently and is presently stored in the cache. Otherwise a cache miss occurs, and the processor must obtain the desired data from the local memory in another node through the linking logic.
An important factor in the design of a multiprocessor computer system is the memory consistency model that is supported by the computer system. The memory consistency model defines the apparent order of the reads and writes (i.e., loads and stores) from all of the processors in the computer system and how that apparent order relates to the order of the reads and writes specified in the program being executed. The strongest consistency model, sequential consistency, requires that all processors see the same order of operations (reads and writes) and that all operations are seen in the order specified in the program. In weaker consistency models, different processors can see different orders, and operations do not have to appear in the order specified in the program. A "fence" operation is provided to the programmer as a means for the programmer to indicate in a program where the program order must be maintained. Many commercial multiprocessor computers have implemented a consistency model called processor consistency. In this model the order of writes from a single processor must be observed in that order by all other processors, but reads may be observed out of order from writes. There are no ordering requirements placed on writes from different processors. The processor consistency model is described in U.S. Pat. No. 5,420,991.
It is generally recognized in the literature that weaker consistency models can lead to higher performance. See, for example, "Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors" by Gharachorloo et al., Proc. Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991. However, many programs have been written assuming the stronger processor consistency model is followed, and therefore maintaining compatibility with these programs is extremely important in the design of multiprocessor computers.
Most commercial multiprocessor computer systems use a set of signal conductors known as a common bus to electrically interconnect the various parts of the computer including the processors, caches, I/O devices and memory modules. In a bus-based multiprocessor computer, operations taken by each processor are necessarily seen by the other processors in the order in which they are emitted. If a processor emits reads and writes in an order that maintains processor consistency, then the bus ensures that all other processors see the reads and writes in that order. Examples of bus-based interconnection architectures for multiprocessors are Sequent Symmetry computers from Sequent Computer Systems, Inc. of Beaverton, Oregon (described by Lovett et al. in Proceedings '88 Int'l Conference on Parallel Processing, Vol. 1, Penn State University Press, 1988, pp. 303 et seq.) and the Futurebus+standard (IEEE 896.1-1990).
Bus-based architectures, however, suffer from performance limitations inherent in their designs. The bus has a maximum bandwidth for transferring data between processors. As processor frequencies have increased, each processor requires more bandwidth to run at maximum speed. Given a maximum bus bandwidth, fewer processors can run at the faster speeds. On the other hand, to increase the bus bandwidth, the physical length of the bus must be shortened to reduce electrical propagation delay from one end of the bus to the other. But a shorter bus length reduces the space available for connecting processors and thus the number of processors that can be connected to the bus. Typical bus-based multiprocessor architectures have an upper limit of 10-30 processors.
One proposed solution to this problem of limited data throughput is to use a point-to-point network to interconnect multiple processor nodes together. Each node includes within it a bus limited to a length and number of processors that allow the processors to perform at the maximum processor frequency. Data communication between the nodes of the computer is handled through an interface, or protocol, such as the Scalable Coherent Interface (SCI) set forth in IEEE Standard 1596. The interface between network nodes may be realized with a number of topologies such as rings, meshes, multistage networks and crossbars. The current SCI standard supports up to 65,520 nodes and specifies the supported interface to run at 500 MHZ over 16 parallel lines, yielding a raw data throughput of one gigabyte/second. Furthermore, an SCI-based interconnection network can send a symbol stream from one point to another within the network without having to wait for the signals to propagate through the entire interface. In this way, a sequence of symbols may simultaneously reside on the transmission medium. Thus an SCI-based network can provide the data throughput needed to meet the demands of advanced processors now being designed into multiprocessor computers.
While solving the data throughput problem, the point-to-point approach creates another problem: it provides no guarantee that operations will be performed in the order they are emitted. Even if a processor emits operations in an order that maintains processor consistency, some networks do not guarantee that the operations reach all destinations in the order they were emitted. SCI communication, for example, is through message packets that contain, among other things, the address of the source and destination nodes and the data to be communicated. These messages may traverse various nodes in their journey from the source to destination nodes, and may arrive at the destination in a different order than which they were sent. For example, processor A in one node may send an invalidate message to processor C in another node to alert the processor C that certain data the processors share in their respective caches is no longer valid in the cache of processor C. Processor A may send this message because it is changing its copy of the data through a write operation. When processor A is finished changing the data, it then sends a second message indicating that the data in its node is now available to other processors. For the messages to work correctly, processor C must receive the first message before it receives the second message. Otherwise, upon reading the second message, processor C assumes that the data presently in its cache is already valid (that no change has occurred) and may access that invalid data rather than the changed data in the cache of processor A.
A simple approach for maintaining processor consistency in an unordered networked multiprocessor computer is to delay completion of a write operation until all other processors that share the data acknowledge that they have invalidated their copies. But this approach stalls further program execution until all of the acknowledgments are received, creating delays that significantly degrade overall processor performance.
A higher-performance approach is to allow the processor to proceed when possible with another write operation before all acknowledgments are received. This approach would enhance computer performance. But some mechanism must then ensure that write operations are received by processors in other nodes in the order in which the write operations are issued by the processor. However, no commercially viable approach that ensures this result is presently known to exist.
An objective of the invention, therefore, is to provide a method and apparatus for maintaining a desired memory consistency in an unordered networked multiprocessor computer system. Another objective of the invention is to allow operations following a write operation to complete as soon as possible while maintaining memory consistency. Still another objective is to maintain such consistency while allowing write operations to complete before all other processors that share the data have acknowledged that they have invalidated their copies.