The invention relates generally to distributed, shared-memory multiprocessor systems.
Many multiprocessor systems employ a shared bus to connect multiple processors, memory modules, and I/O devices. Such multiprocessors are usually called symmetric multiprocessors (SMP) since the latency for any processor accessing any portion of memory is uniform. An SMP system usually equips each processor with a cache and provides a snoopy cache coherence protocol to reduce the traffic on the bus. It has been shown that the shared bus is a cost-effective interconnect for attaching more processors owing to simplicity of the SMP architecture. See, for example, W. K. Dawson et al., "A Framework for Computer Design," IEEE Spectrum, pp. 49-54, October 1986; and D. B. Gustavson, "Computer Buses--A Tutorial," IEEE Micro, pp. 7-22, August 1984.
Although the shared bus architecture is simple and effective, the system's performance does not scale well. As the number of processors increases beyond a certain point, the shared bus, due to is limited bandwidth, becomes the major performance bottleneck. Furthermore, given the speed at which processor technology is rapidly progressing, it will be even more difficult in the future for shared bus architectures to provide adequate bandwidth in a multiprocessor system.
Lately, scalable networks have been proposed as an interconnect for multiprocessor systems. Scalable networks, such as rings, meshes, and trees, provide a multiprocessor system with higher bandwidth as the number of processors increases. With scalable networks, a large-scale parallel machine can be built for a number of nodes. Each node can be either a single-processor system or an SMP system.
FIG. 1 shows an example of a scalable multiprocessor system based on the ring network. In general, each node 10 on the ring network 12 includes a CPU 14 with cache memory 16, local memory 18, and a local bus 20, over which CPU 14 can access local memory 18. The node is connected to the ring network 12 through an interface module 22.
For multiprocessors based on scalable networks, it is important to be able to run a wide variety of applications without excessive programming difficulty. A single address space greatly aids in the programmability of multiprocessors by reducing the problems of data partitioning and dynamic load distribution, two of the more difficult problems in programming parallel machines. The shared address space also provides better support for parallelizing compilers, standard operating systems, multiprogramming, and incremental tuning of parallel machines. For further details see D. Lenoski et al., "The DASH Prototype: Logic Overhead and Performance," IEEE Transactions on Parallel and Distributed Systems, pp. 41-61, January 1993.
In multiprocessor systems, it is feasible to physically partition the shared memory into several portions and distribute those portions among the nodes. These shared memory portions can be accessed by all the nodes in the system as described by D. Lenoski et al. in the above-identified reference and as further described by D. Kuck et al., in "Parallel Supercomputing Today and the Cedar Approach," Science, vol. 231, pp. 967-974, February 1986 and by G. Pfister, et al., in "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture," International Conference on Parallel Processing, pp. 764-771, August 1985. Such systems are called distributed shared memory multiprocessors. The shared memory will also be called global memory interchangeably in the following. A cache coherence protocol can be included in distributed shared memory multiprocessors in order to improve the performance on shared memory access (see D. Lenoski, above).
Although many distributed shared memory multiprocessor systems can use off-the-shelf microprocessors directly, the interconnect is typically proprietary and cannot connect existing workstations (or personal computers) in a "pay-as-you-go" fashion. Alternatively, multiprocessor clustering has been proposed as an approach to constructing scalable multiprocessors efficiently. With multiprocessor clustering, commodity workstations or personal computers can be grouped together to form a distributed shared memory machine.
As noted above, the performance of multiprocessor systems with a shared bus is limited due to the bottleneck of the bus. Hence, to provide higher bandwidth in such systems, hierarchical buses have been proposed in many machines. See, for example, the above-referenced article by D. Kuck and also U.S. Pat. No. 5,237,673. FIG. 2 shows a representative shared memory multiprocessor system with hierarchical buses. In such systems, each local bus 30 connects its associated processors 32 and local memory 34. Separate shared memory 36 resides on a global bus 38, which can be accessed by the processors 32 on any local bus 30. Although hierarchical buses help improve the bandwidth of shared memory multiprocessors, they still do not scale very well since the global bus causes a performance bottleneck.
Many shared memory multiprocessors based on scalable networks have been proposed or implemented. In addition to the above-referenced articles by D. Leonski and G. Pfister, also refer to A. Gottlieb et al., "The NYU Ultracomputer--Designing an MIMD Shared Memory Parallel Computer," IEEE Transactions on Computers, pp. 175-189, February 1983; KSR, "KSR-1 Overview," Internal Report, Kendall Square Research Corporation, 1991; and U.S. Pat. No. 5,297,265. Some of these machines include a cache coherence protocol to improve its performance on accessing shared memory. However, these machine architectures are not open enough in the sense that their interconnects are proprietary designs, although they may use off-the-shelf processors. With proprietary architectures, the system cost is usually high due to the limited volume of sale. Alternatively, multiprocessor clustering has been proposed to overcome this problem by connecting a group of existing workstations. In a clustered shared memory multiprocessor system, each cluster node has one or more processors, and the global shared memory is partitioned and distributed among the nodes. With multiprocessor clustering, systems can be expanded in a "pay-as-you-go" fashion.
FIG. 3 shows the configuration of a typical cluster node. In such a cluster node, there can be more than one processor 42 (with its local cache memory) on a shared bus 40. There exist two types of memory on the bus: private memory 50 and global memory 52. The private memory 50 can only be accessed by the processors 42 in the local cluster node, while the global memory 52 can be accessed by any processor in the whole system. A memory control unit 43 controls accesses to private memory 50 and a cluster cache and directory 54, which is usually associated with each node, contains the line copies from remote clusters and keeps track of the states of the memory lines and the cache lines in the associated node. A cluster interface 56 is a controller which is in charge of translating transactions between the shared bus 40 and the intercluster networks. It also maintains cache coherence over the intercluster network. A routing switch 58 is used to transmit and receive transaction packets to and from other cluster nodes.
The major problem with the architecture in FIG. 3 is that the additional memory required for the global memory in the system increases the cost of clustering. A separate global memory in the cluster node makes memory utilization inefficient. As described below, a solution to this problem is to borrow a portion of the existing memory in the cluster node for the global memory.