A symmetric multi-processor (SMP) refers to an aspect of hardware in a computing system, and more particularly, relates to the physical layout and design of the processor planar itself. Such multiple processor units have, as one characteristic, the sharing of global memory as well as equal access to input/output (I/O) of the SMP system. An SMP cluster refers to an environment wherein multiple SMP systems/nodes are coupled together for parallel computing. SMP clusters continue to become more popular, and are now widely deployed in the area of scientific and engineering parallel computing. These cluster environments typically include hundreds of SMP nodes connected by low latency, high bandwidth switch networks, such as the High Performance Switch (HPS) offered by International Business Machines (IBM) Corporation of Armonk, N.Y. Each SMP node has, for example, two to sixty-four CPUs and often has more than one switch adapter to bridge the gap between switch and single adapter capability. For instance, two switch adapters can be installed on an IBM eServer pSeries 655, which has eight IBM Power4 CPUs.
As further background, the message passing interface (MPI) standard defines the following schematic: that processes in a parallel job exchange messages within a communication domain (or “communicator”) which guarantees the integrity of messages within that domain. Messages issued in one domain do not interfere with messages issued in another. Once a parallel job begins, subsets of the processes may collaborate to form separate communication domains as needed.
The MPI standard defines a set of collective communication operations. Some of the MPI collectives are “rooted”, meaning that either the source or the sink of the message is only one MPI process. These collectives are for one-to-many or many-to-one communication patterns. The most often used are MPI_Bcast and MPI_Reduce. Non-rooted collectives, such as MPI_Barrier, MPI_Allreduce and MPI_Alltoall are for many-to-many communication patterns.
On SMP clusters, the collectives (e.g., occurring within the context of MPI communicators) usually follow a hierarchical message distribution model to take advantage of the fast shared memory communication channel on each SMP node. With the fast development of switch technology, however, a single MPI process often cannot fully utilize the available switch network capacity. Stripping techniques have been used to achieve higher bandwidth than one adapter can deliver, but do not help meet latency requirements.
Thus, a new communication approach for collectives of an SMP cluster environment is desirable wherein the switch/adapter capacity is fully utilized, and shared memory facilitates the inter-SMP communication portion of the collective operations.