Message Passing Interface (“MPI”) defines a standard application programming interface (“API”) for using several processes at one time to solve a single large problem, or a “job,” on a multiprocessor and often multi-node computer (i.e., commonly one process per CPU across 1 or more multi CPU nodes). Each job can include multiple processes. A process can also commonly be referred to as a task. Each process or task can compute independently except when it needs to exchange data with another task. The program passes the data from one task to another as a “message.” Examples of multiprocessor computers are, e.g., an IBM eServer Cluster 1600 available from IBM Corporation, Armonk, N.Y., and supercomputers available from Cray, Silicon Graphics, Hewlett Packard and the like.
The most commonly used portions of the MPI standard use a two sided protocol. A message transfer occurs in a two sided protocol when a task one initiates a send request and another task initiates a receive request. No data transfer occurs until both the sending and receiving tasks have made a call into the API. Often it is convenient to implement a two sided protocol with a one sided protocol. For example, IBM offers a library implementing the MPI API which in turn is implemented using the one sided LAPI (Low level API) programming model. In the one sided programming model, the sender may initiate data transfer even if the receiving task has not made a call requesting data transfer.
Data transfer in parallel programming applications can involve contiguous or non-contiguous data structures. FIG. 1 illustrates the simplest form of data transfer: the transfer of a message from a contiguous data structure 104 in an originating task 102 to another contiguous data structure 108 in a target task 106. FIG. 2 depicts the more complex scenario of transferring a message contained in a non-contiguous data structure 204 in the originating task 202 to another non-contiguous data structure 208 in the target task 206. It is also sometimes necessary to transfer from a contiguous data structure to a non-contiguous structure and vice-versa.
Transferring a message from a source 302 to a destination 304 based on a non-contiguous data structure presents significant challenges in cases where the network adapter is incapable of directly handling non-contiguous data structures. It is more difficult to transfer such a message in a one-sided model because the target data structure is provided in the original side. A one-sided communication is a communication where the receiver is not expecting or waiting to receive the data communications. U.S. Pat. No. 6,389,478, “Efficient Non-contiguous Vector and Strided Data Transfer in One-sided Communication on Multiprocessor Computers,” describes a mechanism to transfer vector type non-contiguous data in one-sided communication model, which is hereby incorporated by reference in its entirety. In that invention, as shown in FIG. 3, data is transferred in such a way that each packet 306 contains description(s) 308 of the data structure and data block(s) 310.
This prior art mechanism requires a whole description of the user data structure and the data contained within the structure in a non-contiguous message to be transferred from the send side to the target side. However, there are cases when a block by block description of the data is longer than data itself. For example, a four byte data block needs two values in its description (e.g. its address and its length). Thus, the total number of bytes transferred (including data and the description of the data structure) in this method could be significantly larger than the data bytes transferred.
Another commonly owned prior art invention, U.S. patent application Ser. No. 09/517,167, entitled “Data Gather/Scatter Machine,” describes a method of constructing a compact data description and packing/unpacking data from/to non-contiguous data structure to/from contiguous data structure based on the description, which is hereby incorporated by reference in its entirety. A Data Gather/Scatter Program (“DGSP”) is a message level description that potentially eliminates unnecessary individual data block descriptions. However the original DGSP and Data Gather/Scatter Machine (“DGSM”) concept does not handle any communication protocol problems. Using a DGSP description of non-contiguous data can improve the efficiency of non-contiguous message transfer by reducing overhead of transmitting the target data description. However, there are still problems in transferring data in this manner, such as:                1. Poor pipelining. Before any message data can actually be stored into target user addresses, the complete data description has to be available at the target. This is because the data description includes the target structure layout and data size, and the data packets cannot be processed into a user data address at the target without that information. One method of ensuring this is to have the description transferred in its entirety before starting to transfer data packet. However, such a scheme suffers from poor pipeline performance since the data cannot be shipped before it is known at the origin that the data description has arrived at the target. The notification to the sender that all data description packets have reached the destination is via an acknowledgement from target; therefore, the sender must wait for this acknowledgement before sending actual message data. This round-trip rendezvous to handshake with the receiver is very expensive, especially if there are only a few data packets in the message. One solution to this problem of poor pipelining is to try to send the data description and data together in a pipeline fashion. However, such a pipeline transmission leads to at least two other issues, mentioned next.        2. Data may arrive earlier than data description. Since networks can have multiple disjoint routes between any source-destination pair, it is possible that packets injected from a source end point to a destination end point could arrive at the destination out of order (based on the routes the individual packets chose). This race based out-of-order occurs with a small range and brief delays. More rare, but more substantial out-of-order arrival of packets can be caused by packet loss in the network and ensuing retransmission by the communication protocol. It is therefore possible for some data packets to arrive before all data description packets have arrived. Without knowing the layout of the data in the user address space that is contained in the message description, it is not possible to process these incoming data packets as they arrive from the network and assemble them into their final destination data structure. While out-of-order arrival is not extremely rare, discarding data packets that arrive before their description packets causes performance problems which result from unnecessary retransmissions and delays due to the time-outs that trigger retransmissions at the sending side.        3. The difficulty of deciding message completion. In this method of transferring non-contiguous data, the whole message consists of a data description and the data itself. Since the data description alone could be multiple packets, even for the same number of data bytes, the number of data description packets varies depending on the complexity of data structure layout, therefore the total packets is a variable. It is not possible to use traditional one total number of packets expected as a mechanism to decide if the message is complete, and a new mechanism must be devised instead.        4. The ability to send user header or data description across multiple packets. In previous implementations of vector or active messages, there are restrictions on how large the user header or data description can be. The user header is the portion of user data which is processed by client code, called the header handler, which returns the location of where to start unpacking the rest of the data. Basically, the description must fit within one user packet, not span multiple packets. This restriction severely limits the functionality of these API calls and the overall vector implementation.        
Therefore, a need exists to overcome the problems with the prior art as described above.