1. Field of the Invention
The invention relates generally to interconnection networks for parallel computers. More particularly, the invention relates to techniques for interconnecting and packaging processors in a parallel computer such that (1) global communication between processors is supported efficiently, (2) the parallel computer can be partitioned into identical components (i.e., chips, boards, or racks) that can be used to create parallel computers with arbitrarily large numbers of processors, and (3) the parallel computer can be customized to match the packaging constraints imposed by each level of the packaging hierarchy.
In more specific terms, given N, the desired number of processors, and given a set of packaging constraints which state (a) the maximum number of components in each level of the packaging hierarchy that can be placed in a single component in the next level of the packaging hierarchy; and (b) the maximum number of wires that can leave a single component in each level of the packaging hierarchy, one aspect of the invention teaches how the processors can be interconnected and packaged such that (1) all of the packaging constraints are satisfied, (2) each component in each level of the packaging hierarchy is identical to every other component in the same level, (3) the components at each level of the packaging hierarchy can be re-used to create parallel machines with more processors, and (4) the resulting parallel computer efficiently supports global communication such as that required by the Fast Fourier Transform (FFT), the bitonic sort, the Benes permutation algorithm, and all algorithms in the classes Ascend and Descend.
The Fast Fourier Transform is described by Preparata et al in an article entitled "The Cube-Connected Cycles: A Versatile Network For Parallel Computation", published in the Communications of the ACM, 24(5): 300-309, May, 1981; the bitonic sort is described in the aforementioned Preparata et al publication and in an article by K. E. Batcher entitled "Sorting Networks and Their Applications", published in the Proceedings of the AFIPS Spring Joint Computer Conference, pages 307-314, 1968; the Benes permutation algorithm is described in the aforementioned Preparata et al publication, in articles by V. E. Benes entitled "Mathematical Theory Of Connecting Networks and Telephone Traffic", published by the Academic Press, 1965, and "Optimal Rearrangeable Multistage Connection Networks", published in the Bell System Technical Journal, 43:1641-1656, 1964, and an article by A. Waksman entitled "A Permutation Network" published in the Journal of the ACM, 15(1):159-163, January, 1968; and algorithms in the classes Ascend and Descend are described in the aforementioned Preparata et al publication. All of the above identified publications are hereby incorporated herein by reference.
According to a further aspect of the invention, an efficient technique is taught for implementing a wide class of parallel algorithms, including all of those algorithms in the classes Ascend and Descend, on the parallel computers as described.
2. Description of the Related Art
Many parallel computers consist of multiple processors, each with its own associated memory, and communication links that connect certain pairs of processors. A key issue in the design of such a parallel computer is the arrangement of the communication links, which are referred to collectively as the "interconnection network". The design of the interconnection network represents a trade-off between the communication requirements of the algorithms which will be implemented on the parallel machine and the packaging constraints imposed by technological limitations.
More specifically, many algorithms require global communication patterns in which each processor sends messages to a large number of other processors, some of which may be far away in the physical implementation of the parallel machine. The FFT, the bitonic sort, and the algorithms in the classes Ascend and Descend (referred to hereinabove) are examples of algorithms which require such global communication. Thus, these algorithms could best be supported by providing a high-bandwidth connection between each processor and all (or a large number) of the other processors.
On the other hand, technological constraints make it impossible to provide a high-bandwidth connection between each processor and all of the remaining processors. In particular, parallel computers are typically implemented using a packaging hierarchy consisting of two or more levels. For example, each processor may occupy a single chip, while multiple chips are placed on a single board, multiple boards are combined to create modules, multiple modules are combined to create racks, and multiple racks are combined to create the complete parallel computer. Each level of this packaging hierarchy imposes bandwidth constraints, called pin limitations, that limit the number of wires that can leave each component in the given level of the packaging hierarchy.
In addition to pin limitations, the packaging hierarchy places a number of other constraints on cost-effective implementations of parallel computers. Due to the costs of designing and manufacturing different components, it is preferable to have all components in each level of the packaging hierarchy be identical to all other components in the same level. Such an implementation will be referred to as a uniform implementation. Also, parallel computers are typically manufactured in a range of sizes. Even if the implementation for any given number of processors is uniform, it is possible that different components are needed for different size machines. A parallel computer architecture which can be implemented uniformly using the same components in machines with different numbers of processors will be referred to herein as "scalable".
A large number of different interconnection networks have been proposed for parallel computers. However, all of the previously proposed networks fail to provide one or more of the following desirable features: (1) efficient support of global communication, (2) small pin requirements which match the pin limitations of each level in the packaging hierarchy, and (3) a regular structure which allows a uniform and scalable implementation of parallel computers which utilize the given interconnection network.
For example, many parallel computers use a 2-dimensional or 3-dimensional mesh interconnection network. Examples of parallel computers with 2-dimensional mesh interconnection networks include the "MPP" manufactured by Goodyear Aerospace, the "MP-I" manufactured by MASPAR, and the "Paragon" manufactured by Intel The "J-Machine" which is under development at MIT, has a 3-dimensional mesh interconnection network. Although parallel computers with mesh interconnection networks can be packaged efficiently, they cannot support global communication efficiently due to their large diameter. In particular, an N processor parallel computer with a 2-dimensional mesh network has a diameter that is proportional to N.sup.1/2, while such a computer with a 3-dimensional mesh network has a diameter that is proportional to N.sup.1/3.
U.S. Pat. No. 4,843,540, to Stolfo, U.S. Pat. No. 4,591,981, to Kassabov, and U.S. Pat. No. 4,583,164, to Tolle, all describe tree-structured interconnection networks. Although trees have small pin requirements, they cannot support global communication effectively because the root of the tree becomes a bottleneck through which a large number of messages are forced to pass.
Another important type of interconnection network is the hypercube. Commercial parallel computers based on the hypercube topology include the "NCUBE/10" from NCUBE, Inc., the "iPSC/2" from Intel, and the "CM-2" from Thinking Machines. U.S. Pat. No. 4,805,091, to Thiel et al describes a technique for packaging parallel computers with the hypercube topology. Although parallel computers based on the hypercube technology having a few thousand processors have been built, pin limitations have forced the connections to be very narrow (such as one bit wide), thus limiting communication performance. Furthermore, hypercube computers (i.e., those based on a hypercube topology) with more processors require more pins per packaging component, so pin limitations prevent hypercube computers with arbitrarily large numbers of processors from being constructed. Finally, hypercube computers are not scalable, as different components must be used in parallel computers with different numbers of processors.
Several interconnection networks which are related to the hypercube have been proposed for use in parallel computers. These include the shuffle-exchange as described in (1) an article by Nassimi et al entitled "Data Broadcasting In SIMD Computers" published in the IEEE Transactions On Computers, C-36(12):1450-1466, December, 1987, (2) an article by J. T. Schwartz entitled "Ultracomputers", published in the ACM Transactions On Programming Languages and Systems, 2(4):484-521, October, 1980, and (3) an article by H. S. Stone entitled "Parallel Processing With The Perfect Shuffle" published in the IEEE Transactions On Computers, C-20(2):153-161, February, 1971; the de Bruijn network, described in an article by Bermond et al entitled "de Bruijn and Kautz Networks: A Competitor For The Hypercube?", published in Hypercube and Distributed Computers, pages 279-293 by Elsevier Science Publishers B.V. (North Holland), 1989, and an article by Samatham et al entitled "The de Bruijn Multiprocessor Network: A Versatile Parallel Processing and Sorting Network For VLSI" published in IEEE Transactions On Computers, 38(4):567-581, April, 1989; and the cube-connected cycles described in the aforementioned Preparata et al publication.
Both the shuffle-exchange and de Bruijn networks have irregular structures, and as a result, there is no known uniform implementation for parallel computers based on either of these networks which has small pin requirements. Parallel computers with the cube-connected cycles network can be implemented in a uniform manner with small pin requirements, but this implementation is not scalable. Also, when pin limitations are taken into account, all of these networks are less efficient in supporting algorithms in the classes Ascend and Descend than are the new hierarchical networks presented herein.
Finally, a number of computers with hierarchical interconnection networks have been proposed. As indicated hereinabove, Schwartz proposed the layered shuffle-exchange computer, which has a two-level network that consists of a number of identical components, such as chips or boards. Although the layered shuffle-exchange computer is uniform and scalable, its diameter is proportional to the number of packaging components (e.g., chips or boards) that are used, so it is not efficient when implementing global communication in a large parallel machine. The shuffle-shift shuffle-exchange computers defined by R. Cypher in an article entitled "Theoretical Aspects of VLSI Pin Limitations" Technical Report T.R. 89-02-01, published by the University of Washington, Department of Computer Science, February, 1989, are not uniform, as different processors have different degrees. Furthermore, neither the layered shuffle-exchange computers nor the shuffle-shift shuffle-exchange computers can be customized to match the constraints imposed by three or more levels of the packaging hierarchy.
Hierarchical Interconnection Networks proposed by Dandamudi et al in an article entitled "Hierarchical Interconnection Networks For Multicomputer Systems", published in the IEEE Transactions On Computers, 39(6):786-797, June, 1990, are not uniform because different processors have different degrees, and they are not optimized for implementing algorithms with global communication patterns such as those in the classes Ascend and Descend. Parallel computers which use the hierarchical cubic networks described by K. Ghose et al in an article entitled "The Design and Evaluation Of the Hierarchical Cubic Network", published in the proceedings of the International Conference On Parallel Processing, pages 355-562, 1990 (Volume 1), are not scalable, as the degree of each node grows with the number of processors. The hypernet networks proposed by J. Ghosh et al in an article entitled "Hypernet: A Communication-Efficient Architecture For Constructing Massively Parallel Computers", published in the IEEE Transactions On Computers, C-36(12):1450-1466, December, 1987 have a fixed number of connections with identical bandwidth at each level of the packaging hierarchy, so they cannot be tuned to match arbitrary packaging constraints.
Thus, none of the previously known parallel architectures are simultaneously uniform, scalable, adjustable to arbitrary packaging constraints, and efficient in implementing algorithms with global communication, such as those algorithms in the classes Ascend and Descend.