By way of background, HPC (high performance computing) clusters represent a group of computational units connected together in a specific network architecture by a high performance network. Each individual computation unit is called a node; each node can include multiple processors. Some commonly used network architectures are 2D mesh, 3D torus, infiniband and others. HPC applications are mostly scientific applications (e.g., PDE [partial differential equation] computations, computational fluid dynamics) which can be run on massively parallel architecture. Each application includes a number of tasks, where each task performs some computation, and different tasks perform communication.
By way of further background, collective operations denote communication operations involving multiple nodes (>=3). The set of nodes/processors on which the operation is performed is called a communicator in MPI (message parsing interface) terminology, and the communicator denoting all nodes in the system is referred to as MPI_COMM_WORLD. Common collective operations are Broadcast, Reduce, Allreduce, AlltoAll etc., by way of example. The performance of MPI collectives is often critical in determining the performance of parallel scientific applications. Several algorithms are used for performing collective operations such as, for instance, binomial tree, bucket algorithms, recursive doubling, and ring.
Many HPC applications involve synchronized collective operations over multiple processor groups. As such, many scientific applications involve collective operations not only on the entire processor partition, but also on sub partitions (i.e., sub-communicators in MPI). For example, in linear algebra operations, the processor partition is decomposed into row and column partitions to compute matrix multiplication. In 3D FFT via 2D decomposition, there are row and column transposes that result in row and column all-to-all operations. In molecular dynamics, each processor communicates with other processors that contain atoms that interact with the atoms in the original processor, resulting in an arbitrary subset of processors forming a sub-communicator.
Some salient features of problems such as those discussed above are as follows. There is generally more than one processor group, and each node occurs in exactly one processor group. The applications are generally bandwidth bound. The subcommunicator communication should generally be synchronous, i.e. all processors should enter and leave the communication phase at the same time. In conventional implementations, independent processor groups communicate simultaneously. However, these communications can interfere with each other and create a bottleneck.