Parallel computing is the distribution of a computing operation among a plurality of processors and/or a plurality of computing systems. Parallel computing is advantageous because a computationally expensive processing task may take less time to complete if more processors are used. For example, scientific and technological research frequently comprises computational tasks which, due to their complexity, require years to complete on a single processor. However, such tasks can frequently be completed in a manageable amount of time if divided among a large number of processors. Weather forecasting and computerized simulations of real-world phenomena also frequently comprise complex computational tasks which may benefit from parallel computing. Parallel computing is particularly advantageous for high performance computing, a term used in the art to denote computing tasks requiring very large amounts of computational resources.
Historically, computationally expensive computing tasks have been executed by supercomputers, specialized systems having very fast processors. However, a parallel computing system may solve such tasks in a more cost-effective manner than a supercomputer. A primary reason is because adding additional processors through parallel computing may improve performance more effectively and/or at a lower cost than increasing computing power by increasing the speed of an individual processor. This is because there are diminishing returns to increasing the speed and performance of a single processor. By contrast, there is often virtually no limit to the number of processors which may contribute to a computing task. The overhead required to couple together multiple processors is a far less significant factor than the diminishing returns of increasing the speed of a single processor. Moreover, parallel computing may beneficially reduce the power consumption required to complete a computing task. This is because performance derived from parallel execution is generally more power efficient than performance derived from increased processing speed.
One of the operations in parallel computing is the global reduce operation. In a global reduce operation, a plurality of processes collaborate to complete a computing operation. The processes are located at different processors and/or different computing systems. Each process initially has a quantity of data known in the art as an input vector. The global reduce operation combines all the input vectors using a specified computing operation or set of computing operations. When the vectors are large, this may be achieved by each processor performing the computing operations or set of computing operations on a subset of the vector. It is emphasized that a wide variety of computing operations known in the art may be applied in conjunction with global reduce. The performance of the global reduce operation is essential to many high performance parallel applications.
For parallel computing to succeed, it is important for the processes sharing responsibility for a computing task to interact effectively with each other. It is highly advantageous for the computing task to be divided among the participating processes in predefined ways. To achieve this goal, the processes should ideally communicate with each other according to predefined communication protocols. The Message Passing Interface, or MPI, is a protocol known in the art for facilitating communication between a plurality of processors cooperating on a computing task. MPI defines semantics of various types of communications.
To facilitate communication, MPI defines a plurality of primitives. Some MPI primitives perform point-to-point communication. Among these primitives are a one-way sending operation and a one-way receiving operation. Other MPI primitives facilitate collective communication. These primitives include MPI_BARRIER and MPI_BCAST. A subset of the collective communication primitives are noteworthy because they distribute a computing operation across multiple processes. Specifically, the primitives combine both communication and computing within the same operation. This subset comprises the primitives MPI_REDUCE and MPI_ALLREDUCE, both of which perform a global reduce operation.
In an MPI_REDUCE operation, each process has an input vector. The output of the MPI_REDUCE operation is the result of applying the global combining or reducing operation on all the input vectors. Certain core computing operations, including summation and determining a minimum or maximum value, are defined by MPI. Additionally, customized computing operations not envisioned by MPI may be implemented. When the reduction is complete, the result of the reduce operation is available at a single process, known as the root. It is possible to specify which process shall serve as the root.
The input and output of an MPI_ALLREDUCE operation is similar to a MPI_REDICE operation. However, at the conclusion of an MPI_ALLREDUCE operation, the combined result of the reduce operation across all processors is available at each processor.
The performance of the MPI_REDUCE and MPI_ALLREDUCE primitives are important for the performance of parallel computing applications based on the MPI. One long term profiling determined that for parallel computing applications using the Message Processing Interface, the amount of time spent within MPI_REDUCE and MPI_ALLREDUCE accounted for more than 40% of the total time spent by the profiled applications in any MPI function. See Rolf Rabenseifner, Optimization of Collective Reduction Operations, International Conference on Computational Science 2004, Lecture Notes in Computer Science (LNCS), Volume, 3036/2004, Springer.
The computational cost of MPI_REDUCE or MPI_ALLREDUCE is at least (N−1)Lγ, where N is the number of processes, L is the length of the vector in bytes and γ is the reduce operation cost per byte. If distributed evenly across the processes, the computational cost at any particular process is at least
            (              N        -        1            )        N    ⁢  L  ⁢          ⁢      γ    .  
The interconnection means of computing systems may include Remote Direct Memory Access (RDMA) capability. RDMA is a method by which data in memory belonging to one computer system can be transmitted to memory belonging to another computer system concurrently with processors of both systems performing distinct operations and without interfering with those operations. Techniques exist in the prior art for using RDMA to improve the performance of MPI_REDUCE and MPI_ALLREDUCE for large messages. Specifically, the overlapping of processing with data transfer provided by RDMA can be combined with the pipelining provided by MPI_REDUCE and MPI_ALLREDUCE. This prior art technique will be referred to herein as the Pipelined RDMA Tree (PRT) algorithm.
For MPI_REDUCE, the PRT algorithm splits each input vector into q slices and pipelines the handling of those slices. Communications are organized along edges connecting nodes of a process tree. Nodes of the tree represent the participating processes, and the root of the tree is the root of the MPI_REDUCE operation. Each process requires q steps of communication and computation. At step i, a process first waits for all of its child processes to deliver slice i of their vectors via RDMA. The parent process then combines the received slices with slice i of its own input vector. The reduce operation performed on slice i at the process is overlapped with the receiving of slice (i+1) from its child processes. This is possible because the reduce operation is performed by the processor while data transfer is handled by an RDMA adapter. Finally, the parent process sends the combining result to its parent. Here again, the sending of slice i can be overlapped with the reduce operation on slice (i+1).
While the PRT algorithm offers improved performance compared to non-RDMA implementations of the MPI_REDUCE and MPI_ALLREDUCE primitives, it is nonetheless suboptimal. For example, consider that either the computational cost to reduce a single byte is greater than or equal to the communication cost to transmit a single byte, or the computational cost to reduce a single byte is less than the communication cost to transmit a single byte. This follows directly from the mathematical truism that for two defined numbers, the first number must be either greater than, equal to or less than the second number.
If the reduction cost is greater than or equal to the communication cost, most of the communication cost is overlapped by computation. In this case, the computational cost of the PRT algorithm when implemented using binary trees is approximately
      (          2      +              2        ⁢                                            log              ⁡                              (                N                )                                      -            1                    q                      )    ⁢  L  ⁢          ⁢      γ    .  This is more than double the lower bound of the distributed computational cost inherently required by MPI_REDUCE and MPI_ALLREDUCE. Implementing the PRT algorithm with other process tree structures would result in an even higher cost.
If instead the reduction cost is less than the communication cost, the communication cost then becomes the predominant factor in the total cost required to apply the PRT algorithm. The communication cost can be approximated by
            (              1        +                                            log              ⁡                              (                N                )                                      -            1                    q                    )        ⁢    L    ⁢                  ⁢    β    ,where β is the communication cost to transmit a single byte. The total amount of data communicated by each process which is neither a leaf note nor a root node is 3L. This is because each such process receives a vector L bytes in length from each of two child processes and sends a vector L bytes in length to its parent process. Although RDMA removes the serialization bottleneck at the processor and memory for transferring this data, the adapter bandwidth requirement is significantly increased. This disadvantage is especially pronounced when implementing the MPI_ALLREDUCE primitive, because each process which is neither a leaf nor the root must transfer 6L bytes of data.