In parallel computers, a plurality of processors are interconnected in a manner which is adapted for performing a particular computing task. In such parallel computers, and in other systems, it is desirable to be able to quickly switch or change the interconnection configuration. A particular other system might be a telephone system. The telephone system is an example of a blocking network, in which any two entities or terminals may be interconnected, but not all possible interconnections can be made simultaneously.
The practical difficulties of construction of interconnection networks tends to become dominant as the number of devices to be interconnected increases. While a custom-made interconnection can be fabricated for each separate configuration in the context of a parallel processing computer, a separate, custom-made back plane is required for each different configuration. In a telephone context a new system of telephone lines would be required for each different pair of parties who wished to speak together. Many types of interconnection schemes have been proposed for rapid reconfiguration of the network extending between communication devices. While most of these schemes are effective for interconnection of small numbers of communication devices, some of the schemes grow more rapidly than others as the number of devices increases.
Various types of networks, such as common bus, tree, multistage switch network, ring, N-dimensional hypercube, and 2-dimensional mesh, have been proposed for interconnecting communication devices. Crossbar switched systems allow complete interconnectability, but practical considerations limit the number of interconnections in a complete crossbar switch.
FIG. 1a illustrates a common bus interconnection arrangement, and FIG. 1b represents a star network, its topological equivalent. These systems are not switched. A common bus 8a, and a common point 10b of FIGS. 1a and 1b, respectively, allow information to be transmitted by a plurality of devices or entities 10a, 10b, 10c . . . 10n, each of which is connected to bus 8 by a data link 9, and received by others of the entities. Each communication device or entity 10 is identified by the letter P, representing a processor. Bus 8a or point 8b must carry all the information, and it can be active 100% of the time. Each device 10 connected to the bus must share the available bus time. Thus, a device can transmit onto the bus only if no other device is transmitting, which requires time-multiplexing. The common-bus arrangement is not satisfactory for systems or processors which must be expandable, or which must have greater capability by virtue of additional devices, because it fails to increase in capability as it is enlarged. An increase of system size by addition of further communication devices 12 reduces the per-device share of the available time. In general, this may be viewed as reducing the bandwidth available to each device. It should be noted that processors 10 of FIGS. 1a and 1b are connected to bus 8 by bidirectional links, and that the connection of a link 9 to a processor 10a or 10b is at a port (not separately designated) which is an input port when the processor receives data and an output port when the processor transmits data. Thus, the nature of the port (input or output) depends on the direction of signal flow.
FIG. 2 represents a tree network, as described, for example, in U.S. Pat. No. 5,123,011, issued Jun. 16, 1992 in the name of Hein et al. Such a system is not switched to change configurations. As described therein, a total of 63 out of 64 processing units (P) 10 are arrayed in tree fashion in ranks numbered 1 to 6. Processor 10 of rank 1 (the "root" node) is connected by two data paths or links 12, 14 to inputs (or outputs, depending upon the direction of flow of signal or information) of two further processors 10 of the second rank. Outputs of the two processors 10 of the second rank are connected by links 16, 18, 20 and 22 to four further processors 10 of the third rank. Rank 4 includes eight processors, rank 5 includes 16 processors, and rank 6 includes 32 processors. As described in the aforementioned Hein et al patent, the inter-processor links may be serial or parallel bit paths. Tree structures may be optimal for specific applications. In many cases, however, the tree structure results in concentrations or "funneling" of traffic volume around the root node. This in turn results in the practical limitation that the network performance may be limited by the bandwidth of the root node, or those nodes near the root node, regardless of the size of the system. In some systems, "expansion" of capability is achieved by increasing the throughput of links near the root node, but this mode of expansion is limited by the available technology.
A conceptually desirable type of interconnection network is the multistage switch network (MSSN), of which the shuffle-exchange based network, Claus, Omega, and banyan are variations. FIG. 3 illustrated a three-stage full banyan network as described in the abovementioned Hein et al patent. In FIG. 3, three stages 213, 217 and 221 of bidirectional crossbar switches (S) are arranged in columns at the left, center and right, respectively. Sixty-four data paths 212 at the left of FIG. 3a are adapted to be connected to the output ports of a corresponding number of processors (not illustrated). Input data paths 212 are grouped into sets of eight, and each set is coupled to the inputs of eight 8.times.8 crossbar switches, designated 214, of first stage 213. Each first stage switch 214 accepts eight inputs 212, and can controllably couple any of its input data paths to any of its output data paths. The output data paths of uppermost crossbar switch 214 in FIG. 3a are designated 216. Each of the eight output paths 216 of uppermost crossbar switch 214 is connected to an input data path of a different one of eight 8.times.8 crossbar switches 218. Each of the seven other crossbar switches of first stage 216 has one of its output data paths coupled to an input port of one of the crossbar switches 218 of second stage 217. This allows an input signal which is applied to any one of the input stage 216 crossbar switches 214 to reach any crossbar switch 218 of the second stage. The eight output data paths 220 of uppermost crossbar switch 218 of second stage 217 are each coupled to an input port of a different one of crossbar switches 222 of output stage 221. Each of the other seven crossbar switches 218 of second stage 217 has one of its eight output data paths 220 connected to an input port of a different one of the crossbar switches 222 of output stage 221. Since the interconnection structure is symmetrical about an axis of symmetry 200, a signal path can be established between any output port 224 of an output stage 221 crossbar switch 222 and any crossbar switch 218 of second stage 217.
In general, multistage switch networks (MSSNs) provide good performance in most applications, and also tend to require less hardware in terms of linkages and switch nodes. However, MSSNs are not easily expanded due to the apparent irregularity of the link connections for different network sizes. For example, if the network of FIG. 3a were to be reduced to one-half its illustrated size, by cutting and discarding that portion lying below horizontal line of symmetry 202, the connections between stages would clearly have to change, since half the connections would no longer be connected. Also, addition of further stages to an existing network pattern requires different connections, more readily seen in the network of FIG. 3b. In FIG. 3b, the link pattern joining first stage 223 and second stage 227 differs from the link pattern joining second stage 227 and third stage 231, and each differs from the link pattern extending between third stage 231 and fourth stage 233. It appears that the "back-plane" connections, i.e. the link connections, have patterns and repetition dimensions which depend upon both the size of the network and the particular stages being joined. A backplane pattern useful for first to second stage interconnect in FIG. 3a cannot be used in FIG. 3b, nor can the first-to-second, second-to-third, or third-to-fourth stage interconnect patterns of FIG. 3b be interchanged. Consequently, the MSSN pattern, despite its other advantages, requires a custom interconnection for each different size system, and cannot readily be enlarged.
Ring networks, hypercubes, and 2-d mesh networks are generally similar, in that each network switch node corresponds to a network input/output node, which is different from the above described MSSN, in which there may be fewer switch nodes than network ports.
FIG. 4 illustrates a ring network, including a plurality of communication devices 410a, 410b, 410c . . . 410h, each of which is connected by a corresponding 3-port switch 414a, 414b, 414c . . . 414h into a closed loop or ring 16. Such an arrangement in effect partitions the bus of FIG. 1a into different portions, which different portions may be used simultaneously for communication between different devices 210. For a uniform communications distribution, about half of the messages must travel less than a quarter of the way around the ring, while the other half must travel more than one-quarter of the way. This means that, on average, each message must consume bandwidth on one-fourth of the available inter-switch links in the network. The total bandwidth of the network, on average, is equal to that of only four links. This surprising result arises, because any single message occupies one-fourth of the total network bandwidth, regardless of the network size. The network can only "carry" four messages at a time, corresponding to four links. Since this is invariant regardless of the network size, the per entity share of bandwidth decreases as additional entities are added. Thus, the ring network is not expandable. An additional disadvantage of the ring network is that, as the network grows, each message must, on average, make more link-hops, as a result of which the latency (delay between initial transmission and reception at the intended receiver) increases. The increase in latency occurs because the switches at each mode assume a state which depends upon the destination of the message, which requires that at least the address portion of the message be decoded.
FIG. 5 illustrates an N-dimensional hypercube network, in which N=4. In FIG. 5, each processor (or communication device or entity) 510 is connected by a data link to one port of a corresponding M-port switch 512, where M=9. In such a hypercube, each M-port switch is connected to its own processor and to each immediate neighbor. More particularly, eight processors 410a1, 510a2, . . . 510a8 of FIG. 5 are associated with the vertices of the outermost of the illustrated cubes, and each is connected to a corresponding one of eight nodes 512a1, 512a2, . . . 512a8, each of which represents a six-port switch. Only the front, upper left corner processors 510b1, 510c1 of the center and innermost cubes, respectively, are illustrated, but each vertex or corner of every hypercube is associated with a corresponding processor 510 and a six-port switch 512. The number of ports per switch can be understood by examining the connections of central-cube node 512b1. A first link 514b1 connects a port of switch 512b1 to its associated processor 510b1. Three additional links designated 516b1/b2, 516b1/b4, and 516b1/b5 connect node 512b1 to adjacent nodes 512b2, 512b4, and 512b5 of the same cube. Two additional data links 510b1+ and 510b1- connect node 512b1 to adjacent nodes 512a1 and 512c1, respectively, in the next larger and next smaller cubes. Node 512b1 is also connected to three additional nodes (not illustrated) by extensions of links 516b1/b2, 516b1/b4, and 516b1/b5 designated 515'b1/b2, 516'b1/b4, and 516'b1/b5, which connect to other adjacent structures (not illustrated) equivalent to that illustrated in FIG. 5. Thus, each node or N-port switch is connected to eight adjacent nodes and to the associated processor, for a total of nine data links or switch ports. While a hypercube of dimension 4 has been described, corresponding structures using four and five-sided pyramids are possible. These types of communication network architectures are very effective, but present considerable construction difficulties for moderate or large networks.
FIG. 6 illustrates a 2-d mesh architecture 600. All the communication devices 610xy, where x represents the column, and y represents the row, and the five-port switches 612x,y of network 600 lie in a plane, and it is therefore termed "two-dimensional" (2-d). One port of each 5-port switch 612x,y is connected to the associated communication device or entity 610xy. The other four ports of each 5-port switch are coupled by links, some of which are designated 614, to the four nearest neighbors. The two-dimensional structure allows straightforward design, layout and expansion of the networks. The average communication latency of a 2-d mesh network can be improved by connecting the edges of the 2-d mesh to the other side, which is accomplished by "toroidal" connections illustrated in FIG. 6 as 616. The toroidal connections 616 do not conform to the equal-length connection links 614 lying in the 2-d plane. These toroidal connections therefore present an implementation challenge which adds complexity to the 2-d mesh communications network.
It appears, from the foregoing considerations, that the 2-d mesh is the most reasonable network to consider for a communications system which must be expandable and realizable at reasonable cost.
Many parallel processing systems are currently available, some of which are described above, but they are not widely applied to the many uses for which they are well adapted. It is widely recognized that the single greatest barrier to usage of parallel processing is the difficulty of efficiently mapping applications onto the parallel system. The programming or mapping task for efficient use of parallel processors is so cumbersome and complex, and therefore time consuming, costly and risky, that few applications have been paralleled. Part of this complexity arises from the dependence of the performance on the match between the processing system's architectural limitations and the application's communication requirements.