In order to provide useful background information to better frame the environment in which the present invention is employed, consideration is given to what is most likely viewed as the paradigm application in which collective operations are performed. Accordingly, attention is directed to the calculation of the largest one of a large set of numbers. The set of numbers is divided up and they are parceled out to a number of independent data processing units each one of which is capable of determining a maximum value for the set of numbers that it has been given (either directly or by the passage of a boundary address for a subset of the whole set of numbers). Clearly, this is the type of operation that can be parceled out again to another set of data processing nodes. The net result is that operations of this sort, referred to as collective operations, are ones that parceled out with data moving down a structured tree of processing elements and with resulting data being passed back up the tree. In the basic example illustrated herein, each node in the tree computes a maximum value for the set that it has been assigned and returns that result to the node from which it received the data, further up the tree (a tree root at top picture being assumed so as to make sense of the use of the word “up”). That node then picks a maximum value from the set of data returned to it and, in turn, passes that result further up the tree. Such is the basic nature and function of so-called collective operations. While this example illustrates the general principles and justifications for the use of collective operations, it should be noted that, in general, the data that is passed up and down the branches of a tree structure is often of a significant size and is not limited to a single number. It is often numeric, however, and often has a specific structure. The present invention is directed to structures and processes which underlie collective operations.
Collective communication operations play a very important role in high performance computing. In collective communication, data are redistributed cooperatively among a group of processes. Sometimes the redistribution is accompanied by various types of computation on the data and it is the results of the computation that are redistributed. The de facto message passing programming model standard, namely the Message Passing Interface (MPI) defines a set of collective communication interfaces, including MPI_BARRIER, MPI_BCAST, MPI_REDUCE, MPI_ALLREDUCE, MPI_ALLGATHER, MPI_ALLTOALL etc. These are application level interfaces and are more generally referred to as APIs. In MPI, collective communications are carried out on communicators which define the participating processes and a unique communication context.
Functionally, each collective communication is equivalent to a sequence of point-to-point communications, for which MPI defines MPI_SEND, MPI_RECEIVE and MPI_WAIT interfaces (and variants). MPI collective communication operations are implemented with a layered approach, that is, the collective communication routines handle semantic requirements and translate the collective communication function call into a sequence of SEND/RECV/WAIT operations according to the algorithms used. The point-to-point communication protocol layer guarantees reliable communication. A communication protocol stack often consists of several layers, each provides certain functionality and service to a higher layer. The MPI point-to-point layer itself is sometimes built on other point-to-point communication layers, some of which are not of the two sided communication model. One such example is the IBM Parallel Environment/MPI and IBM LAPI (a Low-level Application Program Interface set of definitions and functions). The MPI point-to-point communication layer in IBM PE/MPI is called Message Passing Client Interface (MPCI), which is built on top of point-to-point active message functionalities provided by the IBM Low-level Application Programming Interface (LAPI). LAPI consists of interfaces for the source side of the data making transfer requests and handler interfaces for upper layer functionality to be carried out by LAPI on its behalf. There are three types of handlers in LAPI, including send side completion handler, receive side header handler and receive side completion handler. The collective communication operations in IBM PE/MPI interfaces with MPCI, which in turn interfaces with IBM LAPI.
Despite its advantages, the layered approach has performance issues, one of which is the locking overhead in a threaded environment. A lock is required for each layer to protect its internal data structure. Multiple lock/unlock costs are paid when control goes through multiple layers. In the above example, to complete a collective communication operation, the MPI layer of IBM PE/MPI may make multiple calls to the MPCI layer, each one resulting in a LAPI function call and each requiring the following sequence:                MPI processing;        releasing the MPI lock;        acquiring MPCI lock;        MPCI processing;        releasing MPCI lock;        acquiring LAPI lock;        LAPI processing;        releasing LAPI lock;        reacquiring MPCI lock;        MPCI processing;        releasing MPCI lock; and        requiring MPI lock;        
Another issue is that the interfaces provided by a two-sided, point-to-point communication lower layer (for example, MPCI) are generic and may not be convenient to serve certain special requirements of a particular upper layer. The MPCI protocol complies to the MPI point-to-point communication semantic which sometimes complicates things more than necessary. One of these cases is transferring a large message involved in collective communication operations. To send a large message in standard mode, MPCI implements the rendezvous protocol in which the sender sends a “message envelop” to the receiver and waits for the receiver's signal on sending the data. This can add substantial overhead and increase implementation complexity. In collective communication, the message envelope is not required to be delivered to the receiver for message matching purpose when “send” is posted before the receive. Message matching semantics enforced by the two sided point-to-point communication interface is not necessary for collective communication operation. Another example is the implementation of MPI_Reduce. MPI_Reduce combines inputs from all participating processes, using a combine operation specified through the interface, and returns the results in a receive buffer of one process (called the root of the reduce operation). The task that performs the reduce operation needs to receive some inputs from other tasks. With the point-to-point send/receive protocol, temporary buffers are allocated and extra copies are used to store those inputs at the MPI layer before carrying on the reduce operation. A third example is a small message MPI_Bcast where the message available at one process (referred to as the root) is transferred to all of the other participating processes. Implementation of MPI_Bcast is often based on tree algorithms in which the message is sent from the root of the tree to internal nodes of the tree and then forwarded along the branches. With the point-to-point send/receive protocol, a node can only receive the message after all nodes along the branches from the root to the node made the bcast call. Delay at any internal node in calling beast delays the completion of the beast at the downstream nodes along the branch.
In the process shown in FIG. 1, this is shown using the layered implementation of MPI_Barrier as an example. The algorithm for BARRIER requires log N round of communications by each process (logarithms are assumed to be base 2). During round j, process i sends a 0 byte message to process ((i+2j) mod N) and receives a 0 byte message from process ((i+N−2j) mod N). A message consists of an envelope and a payload of user data. Messages without user data payloads are referred to as “0 byte messages.” These may also be referred to as “control messages” or “zero payload messages.” The “send” in a new round cannot start until the receiver of the previous round completes. At MPI level, the algorithm is implemented by a loop of nonblocking mpci_recv and blocking mpci_send calls followed by mpci_wait to make sure the mpci_recv completes. MPCI calls LAPI_Xfer to send messages and loops on calling LAPI_Msgpoll for message completion. MPCI registers the generic header handler, the completion handler and the send completion handler for message matching and other two-sided point-to-point communication semantics.