One of the problems associated with increasing performance in multiprocessor parallel processing systems is the efficient accessing of data or instructions from memory. Having adequate memory bandwidth for sharing of data between processors is another problem associated with parallel processing systems. These problems are related to the organization of the processors and memory modules and the processor architecture used for communication between a processor and memory and between processors. Various approaches to solving these problems have been attempted in the past, for example, array processors and shared memory processors.
Multiprocessor systems can be classified generally in terms of coupling strength for communication between processors. Those multiprocessor systems that communicate using a share memory facility between the processors and the shared memory over an interconnection network are generally considered tightly coupled. Loosely coupled multiprocessor systems generally use an input/output (I/O) communication mechanism in each processor, such as message passing, for communicating between the processors over an interconnection network. A wide variety of interconnection networks have been utilized in multiprocessing systems. For example, rings, bus connected, crossbar, tree, shuffle, omega, butterfly, mesh, hypercube, and ManArray networks, have been used in prior multiprocessor systems. From an application or user perspective, specific networks have been chosen primarily based upon performance characteristics and cost to implement tradeoffs.
A network for an application of a multiprocessor system is evaluated based on a number of characteristics. Parameters considered include, for example, a network size of N nodes, where each node has L connection links including input and output paths, a diameter D for the maximum shortest path between any two pair of nodes, and an indication of the cost C in terms of the number of connection paths in the network. A ring network, for example, provides connections between adjacent processors in a linear organization with L=2, D=N/2, and C=N. In another example, a crossbar switch network provides complete connectivity among the nodes with L=N, D=1 and C=N2. Table 1 illustrates these characteristics for a number of networks where N is a power of 2.
Network of N nodesN a power of 2Links (L)Diameter (D)Cost (C)Ring2N/2NBxB Torus for N = 2K4B = 2K/22NFor K even & B = 2K/2XD Hypercube forLog2NLog2N(X/2)NX = Log2NXD ManArray hypercube4222k−1((4 + 3k−1) − 1)for X = 2k and X evenCrossbarN1N2
FIG. 1A illustrates a prior art 4×4 torus network 100 having sixteen processor (P) elements (PEs). Each PE supports four links in the regular nearest neighborhood connection pattern shown. The diameter is four, which is the maximum shortest path between any two nodes, such as, for example, P00 and P22. The cost is thirty-two representing the thirty-two connections used to interconnect the PEs.
FIG. 1B illustrates a connectivity matrix 150 for the 4×4 torus network 100 of FIG. 1A. Each of the sixteen PEs represents a column and a row of the matrix. A “1” in a cell of the connectivity matrix 150 indicates that the row PE connects to the column PE. For example, four “1”s populate P21 row 154, indicating that P21 connects to P11, P20, P22, and P31. The connectivity matrix 150 is populated only with the nearest neighbor connections.
FIG. 2 illustrates a prior art 4×4 ManArray network 200, as illustrated in U.S. Pat. No. 6,167,502. The 4×4 ManArray network 200 has sixteen processors such as processor 1,3 (0110) 204. Each processor is connected to a local cluster switch, such as local cluster switch 208 associated with a 2×2 processor cluster, such as, 2×2 processor cluster 212. In the cluster switch are a number of multiplexers which are connected to the processors to provide the interconnecting network for the sixteen processors. For example, each of the four processors in the 2×2 processor cluster 212 connect to four multiplexers in the associated local cluster switch 208. The 4×4 ManArray network 200 has an indication of the cost C of 88 and a diameter of 2.
FIG. 3 illustrates a prior art shared memory processor 300 having processor nodes P0-Pp−1 304, memory nodes M0-Mm−1 306, input output (I/O) nodes I/O0-I/Od−1 308 interconnected by a cross bar switch 310. The cross bar switch provides general data accessing between the processors, memory, and I/O. The processors typically interface to memory over a memory hierarchy which typically locates instruction and data caches local to the processors. The memories M0-Mm−1 typically represent higher levels of the memory hierarchy above the local caches.
The prior techniques of interconnecting memory and processors have to contend with multiple levels of communication mechanisms and complex organizations of control and networks.