Algorithms recently proposed for applications in Embedded Signal Processing (ESP) systems, e.g. in radar and sonar systems, demand sustained performance in the range of 1 GFLOPS to 50 TFLOPS. As a consequence, many processing elements PEs) must work together, and thus the interconnect bandwidth will increase. Other requirements that typically must be fulfilled in ESP-systems are real-time processing, small physical size, and multimode operation. To be able to handle all these constraints at the same time, new parallel computer architectures are required.
Several such parallel and distributed computer systems for embedded real-time applications have been proposed, including those systems, which use fiber-optics in the interconnection network to achieve high bandwidth. See, for instance, M. Jonsson, High Performance Fiber-Optic Interconnection Networks for Real-Time Computing Systems, Doctoral Thesis, Department of Computer Engineering, Chalmers University of Technology, Goteborg, Sweden, November 1999, ISBN 91-7197-852-6.
Actually, by introducing optical technologies in ESP-systems, many uncompromising requirements can be met, The physical size, for example, can be reduced and the bandwidth over the cross section that divides a network into two halves, usually referred to as bisection bandwidth, can be improved, se for example K. Teitelbaum, “Crossbar tree networks for embedded signal processing applications”, Proceedings of Massively Parallel Processing using Optical Interconnections, MPPOI'98, Las Vegas, Nev., USA, Jun. 15-17, 1998, pp. 200-207. This document also discloses that high Bisection Bandwidth (BB) reduces the time it takes to redistribute data between computational units that process information in different dimensions, and this property is of high importance in ESP-systems.
However, to make the best use of optics in inter-processing computing systems, all optical and opto-electronic properties must be taken into consideration. These properties include transmission in all spatial dimensions, light coherence, and high fan-out etc.
In fact, it has been shown that optical free-space interconnected 3D-systems (systems using all three spatial dimensions for communication), with globally and regularly interconnected nodes, arrayed on planes, are best suited for parallel computer architectures using optics, see for example H. M. Ozaktas, “Towards an optimal foundation architecture for optoelectronic computing”, Proceedings of Massively Parallel Processing using Optical Interconnections, MPPOI'96, Maui, Hi., USA, Oct. 27-29, 1996, pp. 8-15. Folding optically connected 3D-systems into planes will also offer precise alignment, mechanical robustness, and temperature stability at a relatively low cost J. Jahns, “Planar packaging of free-space optical interconnections”, Proceedings of the IEEE, vol. 82, no. 11, November 1994, pp. 1623-1631.
The hypercube is a topology that has been investigated extensively. One reason for its popularity is that many other well-known topologies like lower-dimensional meshes, butterflies, and shuffle-exchange networks can be embedded into the hypercube structure. Another reason is that this topology can be used to implement several algorithms requiring all-to-all communication, e.g. matrix transposition, vector reduction, and sorting, for example as described in 1. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley Publishing Company, Inc., Reading, Mass., USA, 1995.
Geometrically, a hypercube can be defined recursively as follows: The zero-dimensional hypercube is a single processor. An n-dimensional hypercube with N=2n Processing Elements (PEs) is built of two hypercubes with 2n−1 PEs, where all PEs in one half are connected to the corresponding PEs in the other half. In FIG. 1, a 6-dimensional hypercube is shown. This hypercube is built of two 5D-hypercubes, which in turn are built of 4D-hypercubes, FIG. 1b. The 4D-hypercube is further subdivided into 3D-hypercubes, FIG. 1c. The thick lines in FIG. 1c correspond to eight interconnections each.
A disadvantage of the hypercube is its complexity. It requires more and longer wires than a mesh, since not only the nearest neighbors but also the distanced neighbors are connected to each other, if the dimension is greater than three, i.e. more dimensions than physical space. The fact is that the required amount of electrical wires (of different length) in a relatively small hypercube will be enormous. Consider, for instance, an implementation of a 6D-hypercube on a circuit board, where the transfer rate of a unidirectional link between two processing elements must be in the order of 10 Gbit/s. This implementation requires 12,288 electrical wires, of different length, each clocked with a frequency of 312.5 MHz (32-bit wide links assumed). Since the wires are not allowed to cross each other physically, numerous layers are required.
Above, it was stated that interconnection networks in for example ESP-systems must be able to efficiently redistribute data between computational units that process information in different dimensions. In FIG. 2, this reorganization process is shown. Here, the first cluster of processing elements, left cube, computes data in one dimension (marked with an arrow). Next working unit, right cube, computes data in another dimension, and thus redistribution must be performed.
This redistribution of data, referred to as corner turning, accounts for almost all of the-inter-processor communication in ESP-systems. Note also that corner turning requires all-to-all communication.
In hypercubes, a corner turn is actually, from a mathematical point of view, a matrix transposition. Therefore, as stated above, algorithms exist for this interconnection topology. Also, since the BB scales linearly with the number of processors in hypercubes, higher dimensions lead to very high BB.
A full corner turn takes:                                           1            2                    ⁢                      D            size                    ⁢                                    log              2                        ⁡                          (              P              )                                                PR                      link            ,            eff                                              (        1        )            seconds. Dsize is the total size of the chunk of data to be redistributed, P is the number of processors in the hypercube, and Rlink,eff is the efficient transfer rate of a single link in one direction when overhead is excluded, e.g. message startup time. The equation above is based on the hypercube transpose algorithm described in I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison Wesley Publishing Company, Inc., Reading, Mass., USA, 1995. In this algorithm, data is only exchanged in one dimension at a time. Using this one-dimension-at-a-time procedure is a direct result of the cost saving “single-port” behavior. This is an extra feature compared to single-port communication where a node only can send and receive on one of its ports at the same time. In addition, each node is also capable of receiving different data from different neighbors at the same time, i.e. similar to a multi-port behavior. However, the algorithm chosen here is the same as the SBT-routing scheme described by S. L. Johnsson and C-T. Ho, “Optimum broadcasting and personalized communication in hypercubes”, IEEE Transactions on Computers, vol. 38, no 9. September 1989, pp. 1249-1268. SBT-routing is within a factor of two of the lower bound of one-port all-to-all personalized communication.
In broadcasting, the data transfer time for one-port communication is minimized if one dimension is routed per time, i.e. the same principle as above, and all nodes use the same scheduling discipline. Using this principle, each node copy its own amount of data M to its first neighbor (along the first dimension), and simultaneously receives M amount of data from the same neighbor. Next time, each node copy its own data and the data just received from the first neighbor, to the second neighbor (along the second dimension), and simultaneously receives 2M amount of data. This procedure is repeated over all dimensions in the hypercube. Thus each node has to send and receive:                                           ∑                          l              =              0                                                                        log                  2                                ⁡                                  (                  P                  )                                            -              1                                ⁢                                    2              l                        ⁢            M                          =                              (                          P              -              1                        )                    ⁢          M                                    (        2        )            amount of data. M is the data size in each node that has to be copied to all other nodes in the hypercube, and P is the number of processors (nodes). Since each node has an efficient transfer rate of Rlink,eff, broadcasting will take:                                           (                          P              -              1                        )                    ⁢          M                          R                      link            ,            eff                                              (        3        )            seconds. However, this equation is only valid if the nodes are considered as single-port. In reality, as described above, one copy of data from one node can actually be distributed to all log2 (P) neighbors at the same time, and each node can actually receive data from all its neighbors at the same time. The equation above should therefore not be considered as the optimal for this architecture, but good enough for its purpose.