This invention relates generally to networks and, more specifically, relates to switch architectures for networks.
This section is intended to provide a background or context to the invention disclosed below. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived, implemented or described. Therefore, unless otherwise explicitly indicated herein, what is described in this section is not prior art to the description in this application and is not admitted to be prior art by inclusion in this section.
Collective Communication involves more than one process participating in one communication operation in a network of compute nodes. Collective communication operations aim at reducing both latency and network traffic with respect to the case where the same operations are implemented with a sequence of unicast messages. The significance of collective communication operations for scalable parallel systems has been emphasized by their inclusion in widely used parallel programming models, such as the Message Passing Interface (MPI).
As such, collective reduction and broadcast operations are commonly used in High Performance Computing (HPC) applications. An example is the MPI_Allreduce( ) function supported in the MPI library. For this function, in a cluster of compute nodes, each node contributes one or more numbers, and the result of MPI_Allreduce( ) is one sum or a vector of sums of all corresponding numbers from each node. The final result is then broadcast to all participating nodes.
Collective operations are typically separated as short or long. Typically, short can be a single double precision number per node, or 8 bytes, and long can be at least a network packet size, >=256 bytes, as examples. Exact definitions for these terms depend on implementation. In short collective operations (where collective operations are often called “collectives”), each node contributes only a few numbers, and the latency of the operation is very important. In long collectives, where each node supplies a long vector of numbers, the overall collective reduction bandwidth is an important measure. For floating point reductions, the order of operations matter. A fixed order of operations can generate reproducible results, but orders that are not fixed may not.
Direct hardware support for collectives in the network can reduce collective reduction latency for short collectives and improve bandwidth for long vectors. The IBM BLUE GENE family of supercomputers supports one collective reduction operation (short or long) at a time per node in the embedded network logic, with reproducible floating point results. The IBM POWER 7IH (P71H) torrent network (the IBM torrent chip is a network hub chip used in the P7IH machine, which is a high performance computer) supports multiple short collectives in hardware, but may not guarantee reproducibility for floating point operations. The associated project for the P71H is PERCS (Productive, Easy-to-use, Reliable Computing System), as described in, e.g., G. Tanase et al., “Composable, non-Blocking Collective Operations on Power7 IH”, ICS'12, Jun. 25-29, 2012. As the HPC systems evolve, it is imperative for the network hardware to support multiple collective operations at the same time, e.g., with low latency for short collectives and high bandwidth for long collectives, and to generate reproducible results for floating point reductions.