The invention relates to multiprocessor computers and more particularly to a message passing interface (MPI) application programming interface (API) for passing messages between multiple tasks or processes.
S/390 and IBM are registered trademarks of International Business Machines Corporation, Armonk, N.Y., U.S.A. and Lotus is a registered trademark of its subsidiary Lotus Development Corporation, an independent subsidiary of International Business Machines Corporation, Armonk, N.Y. Other names may be registered trademarks or product names of International Business Machines Corporation or other companies.
Message Passing Interface (MPI) defines a standard application programming interface (API) for using several processes at one time to solve a single large problem called a xe2x80x9cjobxe2x80x9d on a multiprocessor and often multi-node computer (i.e., commonly one process per node). Each job can include multiple processes. A process can also commonly be referred to as a task. Each process or task can compute independently except when it needs to exchange data with another task. The program passes the data from one task to another as a xe2x80x9cmessage.xe2x80x9d Examples of multiprocessor computers are, e.g., an IBM RISC System 6000/SP available from IBM Corporation, Armonk, N.Y., and supercomputers available from Cray, Silicon Graphics, Hewlett Packard, Thinking Machines, and the like.
Specifically, a programmer can use an explicit MPI_SEND to identify what data from the memory of the source task is to be sent as a given message. The programmer can also use an explicit MPI_RECV at the destination task to identify where the data is to be placed in the receiver memory.
In conventional message passing, a send or receive call would identify a memory address and byte count. This is restrictive because it is common for the content which logically comprises a message to be discontiguous in memory.
The conventional approach is, however, a neat fit to the lower level transport model which treats data to be moved between tasks as byte streams.
The conventional solutions have been to send a distinct message for each contiguous unit, or to allocate a scratch buffer and copy or pack the discontiguous data into the scratch buffer before sending. These techniques add programmer burden and execution time overhead.
For example, in a 10xc3x9710 matrix of integers, M, stored row major, a row is 10 contiguous integers but a column is every 10th integer. The programmer with a row to send could exploit the fact that the data was already contiguous and could use a simple send. Conventionally, the programmer with a column to send would need to use one of the more awkward techniques. Similar considerations apply to a receive where the eventual destination of the data may not be contiguous.
To simplify the description which follows, sending of messages will be focused upon although the same would apply to the receiving of messages. To send a message, data is gathered from memory and fed to the transport layer at the rate that the transport layer is able to accept. Bytes of a message are forwarded in chunks and the transport layer dictates the size of each chunk. When the transport layer is ready to accept N bytes, then N bytes are copied from the proper memory locations into the transport (pipe) buffer. The data gather logic delivers a specific number of bytes at each activation and then at the next activation, picks up where it left off to deliver more bytes.
Receiving a message is a mirror image of the sending of one. Some number of bytes becomes available from a pipe and must be distributed. It would be apparent to those skilled in the art that the concepts involved in sending and receiving are so closely related that to understand one is to understand the other.
The MPI standard addresses the problem of dealing with discontiguous memory data by defining a set of calls which enable the programmer to describe any possible layout of data in memory. It then falls to the MPI implementation to gather data and feed it to the transport layer or receive incoming data and scatter it to user task memory. The description is called an MPI_Datatype and can be visualized as a template with a stride and one or more tupples, each tupple representing a data unit and its offset within the template. For the 10xc3x9710 integer matrix, M, mentioned above, assume it is desirable to send a single message taking the first and sixth integer of each row. In conventional message passing a 20 integer long buffer could be allocated and a loop could be coded to copy these 20 integers from their locations in M to the scratch buffer. Then 80 bytes could be sent from the scratch buffer. In MPI, an MPI_Datatype can be defined called, for example, xe2x80x9cnewtypexe2x80x9d to indicate the first and sixth integer of each row: {(0,integer) (20,integer) stride=40} then call MPI_SEND(M,10,newtype, . . . ). The MPI implementation interprets the template 10 times to gather and transmit the 80 bytes.
MPI offers a set of predefined datatypes and a set of constructor calls which allow user-defined datatypes to be built based on the predefined types plus any previously defined user types. Since any new datatype is defined in terms of previously defined types, the natural representation to flow from a sequence of type constructor calls is a tree with predefined types as leaves and user defined types as internal nodes. MPI implementations use such trees to record the datatypes created in a user""s code. To gather data to MPI_SEND 10 of xe2x80x9cnewtypexe2x80x9d, an MPI implementation would traverse the tree representing xe2x80x9cnewtypexe2x80x9d 10 times. Most implementations of MPI allocate a temporary buffer large enough for the entire message and do an entire gather at one time and then send from the temporary buffer. A different implementation uses an approach which gathers in increments and which preserves the state of the gather operation from step to step. Like the former implementations, the latter implementation has depended on traversing the tree as many times as needed.
The current approach has several limitations. The MPI standard (MPI-1) was initially defined in a way which allowed all MPI_Datatype information to be local. If two tasks wish to communicate, each task makes its own type constructor calls and each task produces, its own tree-encoded description of a datatype. The sending task would xe2x80x9cgatherxe2x80x9d based on the MPI_SEND type description and the data would flow to the destination which would xe2x80x9cscatterxe2x80x9d according to the MPI_RECV type description. The programmer constructs datatypes for the MPI_SEND and matching MPI_RECV which were compatible but neither task would have any access to the description used at the other end. Describing a datatype with a tree is adequate (though not ideal) when the description can remain local.
With MPI-2, the MPI standard was extended with features which depend on it being possible for a datatype which is constructed at one task to be sent to another for interpretation at the remote task.
One of the extended features is a one sided communication in which an origin task uses a call, such as, e.g., an MPI_PUT which specifies a first datatype to be used for the local xe2x80x9cgatherxe2x80x9d as well as a second datatype to be used for the xe2x80x9cscatterxe2x80x9d at the target task. Both datatypes used in the MPI_PUT call are local to the task which calls MPI_PUT but the semantic of the call is as if the origin did an MPI_SEND with the first datatype and the target did a matching MPI_RECV with the second. To produce this semantic, the type description which is available at the origin is packaged and sent to the target in a form which the target task can interpret. One sided communication can include an additional complication in the MPI_ACCUMULATE function. An accumulate is like an MPI_PUT except that at the target, each unit of data which arrives (such as, e.g., integer, 4 byte floating point, 8 byte floating point etc.) can be combined with the data already present by some reduction function (such as, e.g., add, multiply and bit-and).
Another extended feature is MPI-IO which allows MPI programs to treat files as if they were organized using MPI_Datatypes. Like one sided communication, MPI-IO uses encapsulation of the description of an MPI_Datatype at one task and sends it to another for interpretation.
A tree structure is inherently local because its nodes are each represented by some unit of memory and the edges between the nodes are pointers. It is not practical to copy a tree structure from one task""s memory to another task""s memory. Even when the tree is local, it is likely to be an inefficient use of processor data cache to traverse the tree, loading type description data from widely scattered tree nodes. It would be desirable for the essential information to be abstracted into a compact and portable form.
Participants in the MPI Forum (i.e., the standards body that defined the MPI standard) are aware of the problem sought to be solved by the present invention. A conventional solution to the problem exists but the conventional solution becomes impractical upon scaling. The MPI standard indicates that any message can be fully described by its xe2x80x9ctype mapxe2x80x9d. The type map for a message has a xe2x80x9ctupplexe2x80x9d (i.e., {offset, type}) for each data item in the message. The type map for an array of 3 integers can be, e.g., ({0,int} {4,int} {8,int}). While a type map can be fully expressive, fully accurate and portable, the type map can become quickly useless because it can become to big when, e.g., a message of 1,000,000 integers is considered. The message of 1,000,000 integers would require a 1,000,000 tupple type map. If this 1,000,000 tupple type map needs to be sent to another task before the message can be sent, the cost can become prohibitive. Real but complex datatypes are often expressible in an affordable type map. Large but simple types can be expressible by a simple, compact formula. However, no one has devised a scheme which matches the expressive power of the MPI datatype constructor facility. Attempts to recognize common patterns and using a different encoding for each different common pattern have fallen short. Alternative implementations to the solution of the present invention fall back to using flat type maps when the type does not fit a neat category. It is desired that an improved scheme, matching the expressive power of the MPI datatype constructor facility be provided.
An embodiment of the present invention is directed to a method for compiling, storing, and interpreting, as often as needed, a representation of any MPI datatype, including the steps of compiling a tree representation of an MPI datatype into a compact, linear data gather scatter program (DGSP) wherein the DGSP is of a form general enough to encode an arbitrarily complex datatype, registering the compact linear DGSP with a communications subsystem for later interpretation by the subsystem for at least one of sends, receives, packs and unpacks, creating a registered DGSP, and interpreting the registered DGSP.
The invention briefly involves taking each datatype encoded in a tree format, and compiling the datatype to a linear format, interpreting that linear format to gather data according to a pattern, concatenate the datatype, and push the datatype out over a communication link, then the contiguous data can come off the link and can be distributed or scattered to destination buffers. Both the send (gather) and receive (scatter) tasks can use the same datatypes. A tree can be analyzed and a DGSP can be created to provide a portable representation of the tree. The present invention uses a MPI_TYPE_COMMIT call to compile and register the DGSP for later use. The DGSP after compile can be executed by the DGSM interpreter. Calls including, for example, MPI_SEND and MPI_PACK, which use datatypes identify the type by its handle. The handle is created by MPCI when the DGSP is registered as part of MPI_TYPE_COMMIT. The DGSP is saved by MPI and passed to MPCI for any of the calls which use the datatype. A subsystem can execute the program as many times as directed in the MPI_SEND and can do this more efficiently than tree traversal. The subsystem does not need any information not encoded in the DGSP, so it does not care where the DGSP was created.
In one embodiment of the present invention, the form of the DGSP uses a single generalized representation. In another embodiment the single generalized representation covers any of the arbitrarily complex datatype patterns that can arise in this context. In yet another embodiment, the single generalized representation provides that any datatype that can be constructed using an application programming interface (API) in MPI can be converted into the form.
In another embodiment of the present invention, the compiling step obviates a need for a set of multiple representations for simple cases together with an inefficient representation for all others.
In one embodiment of the present invention, the DGSP is constructed using relative branch addresses. In an embodiment of the invention, the DGSP can be relocated without need to modify the DGSP. Two or more DGSPs can be concatenated to form a new DGSP, or DGSP fragments can be concatenated without rewrite.
In another embodiment of the present invention, the DGSP is constructed using absolute branch addresses.
In another embodiment of the present invention, the arbitrarily complex datatype is any datatype created by any sequence of calls to MPI datatype constructors and represents any possible layout of data in storage. In another embodiment, the storage includes memory. In another, the storage includes files.
In another embodiment of the present invention, the registering step includes returning a handle identifier for the registered DGSP and wherein the registered DGSP is identified by the handle.
Another embodiment of the present invention is directed to a method for enabling MPI datatype portability including the steps of compiling a tree representation of an MPI datatype into a compact, linear data gather scatter program (DGSP) wherein the DGSP is of a form general enough to encode an arbitrarily complex datatype, sending the form from a first task to a second task, receiving the form at the second task from the first task, and registering the form for later interpretation.
Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digits in the corresponding reference number.