This invention relates generally to all-to-all message exchange operations in parallel computing systems, more specifically, the present invention relates to methods for all-to-all message exchange between program tasks connected by a hierarchical interconnection network.
In parallel computing systems, programs can be executed by a plurality of compute nodes operating in parallel. The compute nodes might in general be separate machines (e.g. workstations, servers), processors, cores, etc., depending on the hardware level at which parallelism is implemented. An individual compute node can execute one or more of the parallel program entities, or tasks, of a parallel algorithm (where the term “task” herein refers to such a program entity in general without implying any particular level of granularity). Most parallel algorithms alternate between phases of computation and communication, wherein data is exchanged by the program tasks. The set of compute nodes which collectively implement the parallel algorithm are commonly interconnected via a network to permit this data exchange.
The way in which data is exchanged among a group of parallel tasks can vary widely, but, in practice, most data exchanges involving more than a pair of tasks can be mapped to a small set of typical exchanges. One of the most widely used collective communication operations is the all-to-all exchange (sometimes referred to as an all-exchange, index operation or personalized all-to-all exchange). In an all-to-all message exchange, each task in a given set must send one distinct message to every other task in that set (and in some cases also to itself). The exchange operation is typically organized in a succession of phases, the number of which equals the number of messages to be sent by each task, such that each task sends one message in each phase of the exchange. The overall exchange pattern, i.e. the pattern according to which source (sending) tasks communicate with destination (receiving) tasks in the successive phases of the exchange, is fundamental to the overall efficiency of the exchange. A simple way to verify this is to consider the extreme case where all sending tasks choose the same destination task simultaneously in a given phase. All senders will experience congestion because of the serialization of messages at the input port of the receiver. All these blocked messages in the interconnection network can create even more congestion and severely impact performance. This extreme example is easily circumvented, and most all-to-all exchange proposals address this particular scenario.
Formally, an all-to-all exchange pattern is completely characterized by the function ƒ:IN×IN→IN that takes a source task index (s) and a phase index (p) and maps them to a destination task index (d), such that each s sends one (and only one) message to each d, and each d receives one (and only one) message from each s. Two very common exchanges, present in most communication libraries are:
(a) the linear shift (or “strided”) exchange represented by:
(s, p)→(s+p+shift) modulo X, where “shift” is a fixed integer value and X is the total number of communicating tasks; and
(b) the XOR (“binary XOR” or “recursive halving”) exchange represented by:
(s, p)→s XOR p.
The exact structure of the interconnection network via which tasks are connected has a strong impact on message exchange operations. Such networks commonly have some form of hierarchical network topology. Hierarchical networks include explicitly hierarchical topologies, such as dragonfly networks, as well as tree-shaped topologies such as tree networks and fat tree networks (including extended generalized fat trees, slimmed fat trees, etc.). These are explained briefly in the following.
In tree-shaped topologies, the hierarchy is defined by the series of levels from the leaves (level 0) to the root(s) (level N) of the tree. Tasks are considered to be placed on the leaf nodes, whereas all other nodes are used for message routing. FIG. 1 of the accompanying drawings shows a simple example of a three-level tree interconnect with one task per compute node. The compute nodes, represented by circles, form the leaves in level 0 of the tree here. The higher levels are made up of switches (represented by squares) each of which is connected via links (represented by lines) to a group of descendants, or “children”, in the immediately preceding level. In the example shown, level 1 switches are each connected to a group of three compute nodes in level 0. Level 2 switches are each connected to a group of three level 1 switches. Level 3, the highest level in this example, includes here of a single switch, again connected to a group of three switches in the preceding level. The well-known fat tree network topology is similar to such a standard tree topology—having N levels above the leaves in level 0 with each node on a level l having exactly Ml descendants—with the difference that the connection between any given node and its parent is made up of multiple links. In the original design, the link capacity available from a node to its parent was equal to the aggregate link capacity from that node's children to itself. Consequently, the total capacity of each upward link at level l equals the total number of leaves reachable from the node where the upward link originates (which is equal to M1·M2· . . . ·Ml where “·” denotes multiplication) times the injection capacity per node. As this number grows exponentially with the height of the tree, the concept of extended generalized fat trees (XGFTs) was introduced. This class of topologies achieves a design that is functionally similar to that of basic fat tree networks, without requiring switches with capacity increasing exponentially towards the roots of the network. XGFTs are currently one of the most popular options for interconnect design in high performance computing.
Dragonfly networks are another well-known hierarchical network of which FIG. 2 shows a simple example. This example has one task per compute node (again represented by circles) which are connected in groups of two to respective switches (again represented by squares) in level 1 of the hierarchy. The higher levels are indicated by the broken lines in the figure. Level 2 includes four groups of level 1 switches, each level 2 group including a “local group” of three interconnected level 1 switches here. Level 3, the highest level here, includes a single group containing all four level 2 local groups.
While a given all-to-all exchange operation can complete successfully regardless of the underlying topology, the exchange pattern can result in sub-optimal performance. By way of illustration, FIG. 3 illustrates an exchange pattern for a linear shift exchange between sixteen tasks connected by an interconnection network with the hierarchical network topology shown in FIG. 4. The particular network implementation here can be an (N=2)-level tree-shaped network with the sixteen communicating tasks placed in respective compute nodes at the leaf level, four first-level switches and a single second-level switch. A first level, labelled l1 here, of the topology hierarchy includes four l1 groups, each of four tasks. The next (here highest) level, labelled l2, includes a single l2 group being the group of all four level l1 groups. The exchange pattern of FIG. 3 illustrates the linear shift exchange pattern in this topology with a shift value of 0. The tasks are denoted by the circles, numbered 0 to 15 on the left of the figure. The lines in successive columns of the figure show the pairing of sending and destination tasks in the sixteen successive phases of the exchange. In phase 0, as indicated by the dotted lines here, each task sends a message to itself.
It is apparent from a consideration of the FIG. 3 exchange pattern that the linear shift exchange function completely ignores the layout of the network topology. This exchange pattern is thus oblivious to the hierarchical structure of the topology, taking no account of hierarchical distance (i.e. number of hierarchy levels which must be traversed for communication) between sending and receiving nodes. This concept of hierarchical distance is, however, fundamental to hierarchical networks, providing a notion of locality/remoteness which is inherent in these topologies. The fewer hierarchy levels that separate a pair of tasks, the “closer” the tasks are, i.e., the shorter the path between them, and the lower the latency to reach each other.
FIG. 5 shows the exchange pattern obtained with the XOR exchange algorithm for the hierarchical network topology of FIG. 4. It can be seen that the first four phases handle local exchanges between tasks in the same level l1 group. The subsequent phases handle exchanges between nodes in different level l1 groups. As illustrated by this simple example, application of the XOR exchange pattern in this network topology results in the message exchange being performed in increasing order of remoteness. Thus, tasks collocated on the same node first perform exchanges among themselves (if there is more than one task per compute node), then tasks in neighboring nodes (in the hierarchical sense) perform exchanges exclusively among themselves (so intra-node exchanges are excluded), and so on, progressing through the hierarchy. This has the advantage of ensuring that traffic is contained as much as possible at lower levels in a majority of phases, as well as ensuring that communication latency is constant (when no contention is present) between all pairs in a given phase. This provides much better synchronization within each of the phases. Synchronization is a critical factor in optimizing the overall performance of the exchange, as desynchronization implies either that subsequent phases will overlap, thus causing additional contention, or that gaps in between subsequent phases will emerge.
A severe limitation of the XOR exchange is that it usable only when the number of interconnected nodes where tasks can be placed is an integer power of two. This excludes many networks of practical interest, including dragonfly networks. Furthermore, the XOR exchange is only then usable if the application can be partitioned in a power of two number of tasks. Indeed, the algorithm only achieves the “increasing remoteness” feature described above through restriction of its application to this limited class of networks. The algorithm itself is still oblivious to the true network topology, simply performing bit-wise modulo-2 addition of the binary representations of the source task index s and phase index p irrespective of the physical network hierarchy. In the example of FIGS. 4 and 5, for instance, the XOR exchange behaves as if for an overlaid topology having four levels, with two tasks in each level l1 group, and each level l(n>1) group containing two level l(n−1) groups. Each real hierarchy level in the network topology of FIG. 4 is thus effectively separated into two overlaid levels. Consequently, we see a differentiation of the local and remote exchanges into two separate categories, one involving only one overlaid sub-level and one involving the other. It can therefore be seen that the way in which the XOR function determines destination tasks for messages is not dependent on the actual network topology, but rather on a simple overlaid structure which can be imposed on only a limited class of network topologies.
In practice, parallel computing systems rarely comply with the limitations necessary for use of the XOR exchange function, especially as regards the underlying network topology. Because of this, one often resorts to using the less effective, but more generic, linear shift exchange discussed above. Unbalanced and suboptimal application of the XOR pattern in power-of-two sub-partitions of the original number of nodes has also been proposed in “Optimization of Collective Communication Operations in MPICH”, Thakur et al., International Journal of High Performance Computing Applications, Vol. 19, No. 1, Spring 2005, pp. 49-66. In “The Hierarchical Factor Algorithm for All-to-All Communication”, Sanders et al., Proceedings of the 8th International Euro-Par Conference on Parallel Processing, 2002, LNCS 2400, pp. 799-803, an approach is described for hierarchical systems with nodes having different numbers of processors whereby messages are exchanged in order of node size (number of processors in a node).
Improvements in all-to-all exchange operations for hierarchical networks would be highly desirable.