1. Technical Field
The present invention relates in general to data processing systems and, in particular, to improvements in multiprocessor data processing systems.
2. Description of the Related Art
In parallel computing environments, the Reduce operation is frequently used to combine multiple input vectors of equal length utilizing a desired mathematical or logical operation to obtain a result vector of the same length. As an example, the reduction operation can be a vector summation or a logical operation, such as a maximum. A long term profiling study has indicated that a conventional application having parallel processing capabilities may spend 40% or more of the cycles used for parallel processing to perform Reduce operations.
The Message Passing Interface (MPI) is a language-independent communications protocol commonly used to program parallel computers. MPI defines several collective communication interfaces for Reduce operations, such as MPI_REDUCE and MPI_ALLREDUCE. In MPI, the MPI_REDUCE operation is a global operation across all members of a process group defined by the MPI communicator. The MPI Reduce operation returns the result vector at one process specified by the user, known as the root. In the following discussion, it will be assumed that a commutative Reduce operation is employed and the N participating processes are ranked from 0 to N−1, with the root having a rank of 0.
To reduce vectors having short lengths, the best known MPI_REDUCE algorithm is the Minimum Spanning Tree (MST) algorithm, in which the root process is the root of a MST consisting of all participating processes. A process in the tree first receives vectors from all of its children, combines received vectors and its own input vector, and then sends the result vector to its parent. This bottom-up approach continues until the root produces the result vector by combining its input vector and all input vectors of its children.
The MST algorithm does not work as well, however, if the vectors are long. Efficient long vector algorithms for MPI_REDUCE focus on better exploitation of available network bandwidth and processor computing power, instead of minimized message start-up and latency overhead. One of the algorithms widely adopted for reduction of long vectors is the Recursive Halving Recursive Doubling (RHRD) algorithm, in which each of the N participating processes first takes log(N) steps of computation and communication.
During each step k, process i and process j exchange half of their intermediate reduction results of step (k−1), where j=i^ A mask, a carat (^) denotes a bitwise exclusive-OR operation, and mask is the 1's binary representation left shifting (k−1) bits. If i<j, process i sends the second half of the intermediate result, receives the first half of the intermediate result, and combines the received half with the half that was not sent. This procedure continues recursively, halving the size of exchanged/combined data at each step, for a total of log(N) steps. At the end of the steps, each process owns 1/N of the resulting vector: process 0 owns the first 1/N, process 1 owns the second 1/N, process i owns the (i+1)th 1/N and so on.
All processes then perform a gather operation to gather the pieces back by the root process from other processes. The gather operation also consists of log(N) steps, for a total Reduce operation length of 2*log(N) steps. During step k of the gather, process i sends the partial results it has to process j, where both i and j are less than N/2k,j<i,j=^mask and mask is the 1's binary representation left shifting (k−1) bits. These steps continue recursively, doubling the size of the data passed at each step. Finally, the root process 0 obtains the final result of the reduce operation of the entire vector.
The above description of RHRD algorithm applies to cases where N is an integer power of two. When N is not an integer power of two (as is common), the RHRD algorithm includes an extra preparation step prior to the exchange of intermediate results. In the preparation step, processes from N′ to N−1 send their input vectors to processes from 0 to r−1, where r=N−N′ and N′ is the largest integer power of two less than N. Specifically, process i sends its vector to process i−N′ if i>N′. Processes from 0 to r−1 then perform local reduce operations on the received vectors and their own input vectors and will use the results as their input vectors in the above-described algorithm. Processes from N′ to N−1 do not participate in the remainder of the MPI_REDUCE operation, and processing by the rest of the processes remains the same as in the case in which N is an integer power of two.
FIG. 6 depicts an example of MPI_REDUCE operation on five processes using the conventional RHRD algorithm. The example assumes a vector size of 4, with each vector containing elements ABCD. Partial reduce operation results are represented by element and rank number (in subscript), e.g., A-B0-3 represents reduce results of elements (i.e., vector halves) A and B of processes 0, 1, 2 and 3.
The conventional RHRD algorithm does not scale well. Modern parallel high performance computing applications often require MPI_REDUCE operations on tens of thousands or even hundreds of thousands of processes. Even When the vectors are lone (e.g., 1 MB), the RHRD algorithm breaks the vector into very small chunks, e.g., 64 bytes when N=16K, since each of N′ processes gets 1/N′ th of the final result vector at the end of the recursive halving and before the recursive doubling. With a large number of short messages each carrying a small chunk of the vector, message start-up overhead is high and available network link bandwidth is not fully utilized.