Multiprocessor computer systems having parallel architectures employ a plurality of processors that operate in parallel to simultaneously perform a single task. On parallel multiprocessor systems, a problem is broken up into several smaller sub-problems then distributed to the individual processors which operate on separate segments of the sub-problems allocated to them. With such a parallel architecture, a complex task can be solved in parallel simultaneously by all processors to produce a solution in a much quicker time than it would take for a single processor to perform the task alone.
In order for many processors to collectively solve a single task, they must communicate with each other by sending and receiving messages through a network which interconnects all the processors. In many cases, collective communications are required among different subsets of processors. An efficient and portable communication library consisting of a set of frequently used collective communication primitives is crucial to successfully programming a parallel processing system. The communication library provides not only ease in parallel programming, debugging, and portability but also efficiency in the communication processes of the application programs.
A frequently used collective communication primitive is the multi-packet routing operation wherein each processor in a set has a collection of datablocks of various sizes and wherein each datablock needs to be sent to its respective destination processor in the set on the parallel network. A few frequently used collective communication primitives, which are special cases of the multi-packet routing operation, are the scatter, gather, and index operations. In the scatter operation among a set of n-nodes, there is a source node in the set which has n-blocks of data which need to be distributed amongst n-nodes in the set at one datablock per node. In the gather operation among n nodes, wherein each node has one datablock initially, there is one destination node in the set to collect all n-blocks from all n-nodes, i.e., concatenate datablocks in a node without performing reduction. In the index operation among a set of n-nodes, each node in the set has n blocks of data initially. Each i-th node in the set needs to send its j-th datablock to node j and receive the i-th datablock of node j.
The scatter, gather, and index operations and the multi-packet routing operation are important collective communication primitives in the communication library which finds uses in many real-world scientific, numeric and commercial applications. Examples of such applications include the matrix transpose which is one of the most basic linear algebra operations, bitonic sorting, the Alternating Direction Implicit (ADI) method for solving partial differential equations, and the solution of Poisson's problem by either the Fourier Analysis Cyclic Reduction (FACR) method or the two-dimensional FFT method. Multi-packet routing operation can also be used as a run-time support to general communication which is required in a network of workstations and a network of disks or I/O nodes. Thus, a fundamental problem in the art has been to devise efficient algorithms for the scatter, gather, index, and the multi-packet routing operations.
Most prior art for collective communication primitives are dependent on a fixed parallel system topology such as a mesh or hypercube. Although these prior art algorithms are well-suited to the topology for which they are specifically designed, these algorithms cannot be easily ported to other parallel topologies. Many of these algorithms are restricted to certain forms because of the number of processors involved. For instance, algorithms for collective communication primitives on hypercubes typically assume that the number of processors is a power of two and thus cannot be easily extended to an arbitrary number of processors without a subsequent loss of efficiency. Thus, what is needed is to devise algorithms which are independent of the underlying topology without loss of efficiency and to devise algorithms for collective communication primitives which can run on an arbitrary set of processors wherein the number of processors in the set is not necessarily a power of two and the processors in the set do not necessarily form a structure such as a subcube or a submesh.
There are many other advantages in having topology-independent algorithms. One is that they accurately reflect certain parallel architectures where processors are interconnected through multi-stages of switches, crossbar switches, and buses. Another is that with the availability of more advanced routings such as the circuit-switched, wormhole, and virtual cut-through routings wherein the distance (i.e., number of hops) between processors becomes irrelevant. Still another is that topology-independent algorithms can be helpful for creating algorithms for more specific topologies.
The issue of efficiency of topology independent algorithms for collective communication primitives is important for increased performance of most application programs. Efficiency most often depends on two important parameters of the underlying machines: one is the communication start-up time and the other is the transmission time. The communication start-up time is the overhead, e.g., caused by software and hardware, that is associated with each subsequent communication call, namely a send or a receive operation, while the transmission time is the time required to transmit each data element, e.g., one byte on the communication network.
Due to the different characteristics of different parallel machines, it has been desirable to design routing algorithms which not only are portable but which also remain efficient across various parallel machines. For instance, a multi-packet routing algorithm, which is optimized for the start-up time, may have poor performance on a machine which has a large transmission time relative to the start-up time. On the other hand, a multi-packet algorithm, which is optimized for the transmission time, may perform poorly on a machine which has a large start-up time relative to the transmission time. Therefore, what is needed in the art is a routing algorithm which is tunable according to machine parameters such as the start-up time and transmission time and to have a class of algorithms which can be parameterized so as to provide a balance between the start-up time and transmission time of the specific architecture.