1. Field of the Invention
This invention relates generally to the field of distributed-memory message-passing parallel computer design and system software, and more particularly, to a novel method and apparatus for interconnecting individual processors for use in a massively-parallel, distributed-memory computer, for example.
2. Discussion of the Prior Art
Massively parallel computing structures (also referred to as “ultra-scale computers” or “supercomputers”) interconnect large numbers of processing nodes, generally, in the form of very regular structures, such as grids, lattices or tori. One problem commonly faced on such massively parallel systems is the efficient computation of a collective arithmetic or logical operation involving many nodes. One example of a common computation involving collective arithmetic operations over many processing nodes is iterative sparse linear equation solving techniques that require a global inner product based on a global summation. Such a collective computation is not implemented in the hardware of conventional networks. Instead, the collective computation involves software on each processing node to treat each packet in the computation, the latency of which can be on the order of 100 times that of an equivalent hardware treatment. Furthermore, there may be insufficient processing power for the software treatment to keep up with the network bandwidth. In addition, the topology of a conventional, multi-purpose network may reduce the efficiency of a collective computation, which is based on the longest path from any processing node involved in the computation to the processing node where the final result is produced.
A second problem commonly faced on massively-parallel systems is the efficient sharing of a limited number of external I/O connections by all of the processing nodes. Typically, this sharing is handled by assigning processing nodes to act as middlemen between the external connections and other processing nodes. These nodes can either be dedicated to the job of handling input/output (I/O), or they can perform application computations as well. In either case, the network traffic caused by I/O can be disruptive because it is often asynchronous with respect to the application's communication. For example, massively parallel systems often output checkpoints or partial results while a computation is in progress. A second drawback of sharing a single network between I/O and application communication is that the I/O bandwidth is limited to the bandwidth of the shared network. A third drawback is that the topology of a shared network may restrict the freedom to use an optimal number of dedicated I/O processing nodes or locate them optimally. For example, many massively-parallel systems use a grid interconnect because it is a regular and scalable topology. In order not to disrupt the regularity, which is good for applications, dedicated I/O processing nodes are usually located at the edges of the grid, which is relatively far from processing nodes at the center of the grid. In a torus interconnect, dedicated I/O processing nodes may need to occupy an entire row of the torus in order not to affect the regularity of the structure.
While the three-dimensional torus interconnect computing structure 10 shown in FIG. 1 comprising a simple 3-dimensional nearest neighbor interconnect which is “wrapped” at the edges, works well for most types of inter-processor communication, it does not perform as well for collective operations such as reductions, where a single result is computed from operands provided by each of the compute nodes 12, or efficient sharing of limited resources such as external I/O connections (not shown).
It would thus be highly desirable to provide a network architecture that comprises a unique interconnection of processing nodes optimized for efficiently and reliably performing many classes of operations including those requiring global arithmetic operations such as global reduction computations, data distribution, synchronization, and limited resource sharing. A dedicated network that efficiently supports collective communication patterns serves these needs well.
The normal connectivity of high-speed networks such as the torus are simply not fully suited for this purpose because of longer latencies and because of the disruptive nature of I/O. That is, mere mapping of a collective communication pattern onto the physical torus interconnect results in a tree-shaped pattern of greater depth than is necessary if adjacent nodes of the tree-shaped pattern are required to be adjacent on the torus, or a tree with longer latency between nodes when those nodes are not adjacent in the torus. In order to compute collective operations most efficiently and support simultaneous application messaging and I/O transfers, a dedicated collective network is required.