The present invention relates to an interconnection scheme of processor elements of a parallel computer, and in particular to a switch configuration suitable to the case where high interconnection capability is needed but all processers cannot be connected by a full crossbar switch because the number of processors is large.
In a typical method of the prior art, respective processor elements are connected to one bus or several buses. Other representative schemes include a scheme in which adjacent processor elements among processor elements arranged in a lattice form are connected as described in JP-A-60-151776, a scheme in which all processor elements are connected by one or several crossbar switches as described in JP-A-59-109966 and "Toward a parallel processing system for AI", T. Suzuoka, S. Nakamura and S. Oyanagi, 35th (the last half year of 1987) National Conference of Information Processing Society of Japan, Sep. 28, 1987, pp. 135-136, a scheme in which all processor elements are connected by a multistage switch as described in JP-A-57-111654, and a scheme in which hypercube connection is used as described in reference 1.
Reference 1: C. L. Seitz, "The Cosmic Cube", communications of the ACM, vol. 28, no. 1, pp. 22-33, 1985.
Among the above described conventional techniques, the bus connection scheme has an advantage that a small amount of hardware is required, but has a problem that the performance is lowered by competition for buses when the number of connected processor elements is large. It is said that there is a limit of ten and several processor elements.
In the lattice connection (called also as mesh connection), the amount of hardware is similarly small, and a large number of processor elements can be connected. On the other hand, a processor element can communicate with only adjacent processor elements, and hence the overall communication performance largely depends upon the property of the problem to be dealt with. The communication performance is fine in a case of derivation of a solution of a partial differential equation and in a case of picture processing suited for neighborhood calculation. In case of the finite element method, fast Fourier transformation (FFT), and logic/circuit simulation, the overhead for communication becomes significant.
In the full crossbar switch connection, all processor elements are completely connected by a matrix switch. Therefore, the full crossbar switch connection has the highest performance among all connections. Since the amount of hardware is in proportion to the square of the number of processor elements, however, there is typically a connection limit of several tens processor elements.
In case of a multistage switch, the amount of hardware is limited to approximately Llog.sub.2 L, where L is the number of processor elements, and complete connection is possible. Therefore, the multistage switch has been regarded as a connection scheme suited for highly parallel computers including a large number of processor elements. However, there is a problem that the numeral length of the communication path (i.e., the number of relaying stages) becomes approximately log.sub.2 L and hence the transfer delay is accordingly large. There is also the problem that when a large number of processor elements gain access to an identical shared variable, a plurality of access paths must scramble for a communication path on the way and general paralysis of the network, called hot spot contention, can occur (the paralysis extends to all accesses). Yet another problem for the multistage switch is that when the access competition is significant, sufficient performance is still not obtained even if hot spot contention does not occur.
A hypercube connection is known as connection through which relatively efficient communication can be performed. In this case, however, the other party of communication must be specified on the program and hence programming becomes complicated. If an automatic relaying mechanism is disposed for each processor element in order to avoid the complication of programming, the amount of hardware increases. Further, there is a problem that mounting is troublesome because of intersected wiring.
It is known that a specific interprocessor communication pattern often appears in parallel processing of large-scale numerical calculus. The lattice connection, the ring connection and the butterfly connection can be mentioned as representative communication patterns. If communication of these specific patterns can be processed at high speed, therefore, it can be said that the effectiveness of the network is large. Only the full crossber switch and the hypercube among the above described conventional techniques contain the lattice connection, the ring connection and the butterfly connection have their own connection topologies which enable communication in these patterns without requiring the relaying function. Neither the bus connection, nor the lattice connection, nor the multistage switch is capable of processing all communication of these specific patterns. Further as a special example, a spanning bus hypercube, which is obtained by expanding a binary hypercube based upon the connection of two processor elements into configuration based upon the connection of a plurality of processor elements is described in Reference 2. Since a plurality of processor elements are connected via a bus, however, only two processors can communicate at one time, and hence it is not considered that the spanning bus hypercube contains the above described connection topolygy.
Reference 2: Dharma P. Agrawal et. al., "Evaluating the Performance of Multicomputer Configurations", May 1986, pp. 28-29, 1986.
Among the above described problems, the problem that the number of processor elements connected in the bus connection is limited has not been solved when the number of processor elements is large. Further, both the problem that the performance of the lattice connection largely depends upon the property of the problem dealt with and the problem of hot spot contention in the multistage switch are basic and essential problems and are not solved under the present art. Further, these connections, together with the spanning bus hypercube, have a problem of degraded performance in principal applications caused by the fact that these connections do not contain all of the lattice connection, the ring connection and the butterfly connection.
Two remaining networks, i.e., the (full) crossbar switch and the hypercube are free from the above described difficulties based on principles. On the other hand, in the (full) crossbar switch, the amount of hardware is too large, and hence a large number of processor elements cannot be connected. In the hypercube, a large number of processor elements can be connected, but programming and mounting are troublesome and the performance is also degraded when the number of connected processor elements is increased. Further, if communication is performed between two processor elements which are not directly connected in a hypercube, another processor element must perform the relaying function. Such a communication method of taking an information packet temporarily into a processor element and then transferring the information packet to a different processor element is called a store and forward scheme. Not only the hypercube but also the other store and forward scheme has a problem that a deadlock state may be caused. That is to say, when a loop communication path is formed by a plurality of processor elements P.sub.1, P.sub.2, P.sub.3 - - - to perform relaying function, P.sub.1 cannot finish the transmission operation until P.sub.2 finishes the transmission operation and is ready to receive the information, P.sub.2 cannot finish the transmission operation until P.sub.3 finishes the transmission operation and is ready to receive the information, and so on. In this way, the processor elements engage each other and are not able to operate, resulting in the deadlock state.
The performance is evaluated by means of the number of basic changeover switches (cross points) that one unit of transmitted information passes through until it reaches a final destination. The amount of hardware is evaluated by means of the total number of cross points constituting the network. In general, however, the amount of hardware is related to the performance by a trade-off relationship. As the total number of cross points is increased, therefore, the number of cross points through which one unit for transmitted information passes is decreased.