1. Field of the Invention
This invention relates generally to the field of distributed-memory message-passing parallel computer design and system software, and more particularly, to a novel method and apparatus for interconnecting individual processors for use in a massively-parallel, distributed-memory computer, for example.
2. Discussion of the Prior Art
Massively parallel computing structures (also referred to as “ultra-scale computers” or “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as grids, lattices or tori.
One problem commonly faced on such massively parallel systems is the efficient computation of a collective arithmetic or logical operation involving many nodes. A second problem commonly faced on such systems is the efficient sharing of a limited number of external I/O connections by all of the nodes. One example of a common computation involving collective arithmetic operations over many compute nodes is iterative sparse linear equation solving techniques that require a global inner product based on a global summation.
While the three-dimensional torus interconnect computing structure 10 shown in FIG. 1 comprising a simple 3-dimensional nearest neighbor interconnect which is “wrapped” at the edges, works well for most types of inter-processor communication, it does not perform as well for collective operations such as reductions, where a single result is computed from operands provided by each of the compute nodes 12, or efficient sharing of limited resources such as external I/O connections (not shown).
It would thus be highly desirable to provide an ultra-scale supercomputing architecture that comprises a unique interconnection of processing nodes optimized for efficiently and reliably performing many classes of operations including those requiring global arithmetic operations such as global reduction computations, data distribution, synchronization, and limited resource sharing.
The normal connectivity of high-speed networks such as the torus are simply not fully suited for this purpose because of longer latencies.
That is, mere mapping of a tree communication pattern onto the physical torus interconnect results in a tree of greater depth than is necessary if adjacent tree nodes are required to be adjacent on the torus, or a tree with longer latency between nodes when those nodes are not adjacent in the torus. In order to compute collective operations most efficiently when interconnect resources are limited, a true tree network is required, i.e., a network where the physical interconnections between nodes form the nodes into a tree.