Typical parallel multiprocessor systems include a plurality of processor nodes interconnected by a communication network over which the processor nodes exchange information. In general, the processor nodes cooperate to perform computationally intensive applications, such as signal processing. Recently, the computational throughput of processor nodes has increased significantly because of increased processor speeds and the use of multiple processing cores within a processor node. For some applications, however, the communication bandwidth cannot keep up with processing throughput of the processor nodes.
Although some applications can run effectively with limited communication bandwidth, other applications experience greatly reduced processor efficiency. For example, applications running large graph algorithms on parallel processors often suffer significant performance reduction because of limited communication bandwidths. Another example of communication-intensive processing is a corner turn operation, which is often conducted as part of a signal-processing application. For these signal-processing applications, after a certain point, adding more processor nodes to the parallel multiprocessor system may not improve the total computational throughput, again, because of limitations in data communication.
Some commercial parallel multiprocessors utilize multiple vector processing units, multiple cores, or both to achieve very high computational throughput on a processor node, but support relatively little communication bandwidth. For example, cell processor nodes, each performing 410 GFLOPS at peak operation, may be connected by only two 10 Gbps communication ports on each of the nodes. This amounts to approximately 0.05 bits per second of communication for every 32-bit operation at the peak processing rate. Because some applications can require communication rates that are significantly higher than 0.05 bits per second, new types of communication networks are needed to support communication-intensive applications running on parallel multiprocessors.
Many types of communication networks have been proposed to support communications between multiple processor nodes. These types of networks include 1-D ring, 2-D grid, 3-D grid, 2-D toroidal grid, 3-D toroidal grid, hypercube, tree, fat tree, FFT (Fast Fourier Transform) butterfly, and omega networks. However, making efficient use of the network, regardless of type, still poses a challenge.
For many applications, each processor node needs to send messages to a number of other processor nodes. A conventional communication optimization algorithm collects all the messages that a source processor node needs to send to a particular destination processor node, and sends the collection as a single communication message, thus attempting to minimize the associated communication overhead. To convey this message from source to destination, however, the communication network often needs to dedicate the communication paths involved in the message communication, potentially preventing other messages from traversing these same occupied paths.
To use the network resources efficiently in this type of messaging system, the communications between multiple pairs of source and destination processor nodes require careful planning and management in order to maximize the simultaneous use of all communication paths within the network. Implementing such planning and management, however, can be difficult because optimizing network utilization requires simultaneous consideration of every communication path between all possible source and destination pairs and message length. Therefore, in practice, the achievable network utilization is often low.
Because packet-switching communication networks do not typically require careful global message planning and management, parallel multiprocessor systems are adopting their use. In a packet-switching communication network, switching nodes make all routing decisions locally. In addition, long messages can be divided and transmitted as multiple short messages and reconstructed at the receiving end. However, if one source processor node needs to send many short messages to a destination processor node, these messages can monopolize certain communication paths between these two processor nodes, and prevent other messages from traversing these same paths. When other pairs of sources and destinations add their communications to the network, the congestion can worsen and result in poor overall throughput.