Message Passing Interface (MPI) defines a standard application programming interface (API) for using several processes at one time to solve a single large problem, or a “job,” on a multiprocessor and often multi-node computer (i.e., commonly one process per CPU across 1 or more multi CPU nodes). Each job can include multiple processes. A process can also commonly be referred to as a task. Each process or task can compute independently except when it needs to exchange data with another task. The program passes the data from one task to another as a “message.” Examples of multiprocessor computers are, e.g., an IBM eServer Cluster 1600 available from IBM Corporation, Armonk, N.Y., and supercomputers available from Cray, Silicon Graphics, Hewlett Packard and the like.
The primary purpose of a Data Gather Scatter Program/Data Gather Scatter Machine (hereinafter referred to as DGSP/DGSM or DGSM) is to provide an efficient, portable solution for handling MPI datatypes in MPI operations. DGSM is described in the co-pending, commonly-assigned herewith, U.S. patent application Ser. No. 09/517,167, entitled “Data Gather/Scatter Machine,” filed on Mar. 2, 2000, the entire disclosure of which is herein incorporated by reference. It is important to note that any API that allows data of arbitrarily complex discontiguity to be described for the purpose of copying it could also use the DGSP/DGSM techniques. All MPI message passing is performance sensitive so efficient treatment of MPI datatypes is crucial. Some MPI operations, such as MPI-1 sided and MPI-IO, require that the representation of a datatype, defined by the application at one task of a MPI job, be interpreted at another task. Thus, portability of the representation is also required.
The DGSP/DGSM mechanism involves compiling a low level “program” (DGSP) for each MPI datatype the application constructs. This DGSP can be used in the task where it is created or delivered to another task. When delivered to another task, that other task can run the DGSP. A DGSP is run on a stack “machine” (DGSM) implemented in software to interpret the layout of a datatype and handle the data transfer.
The DGSP is a compact, portable and general representation of any MPI datatype, consisting of five basic DGSM instructions: COPY, MCOPY, GOSUB, ITERATE and CONTROL. Since the underlying communication subsystem for which the original DGSM was developed provides a buffered contiguous data stream between tasks, a basic DGSM assumes that at least one of the two buffers it performs on is contiguous. One of these “buffers”, the one which can be discontiguous, is normally an application defined data structure rather than a usual communication buffer. DGSM executes a gather from a discontiguous buffer (an application data structure) to a contiguous staging buffer, or a scatter from a contiguous staging buffer to an application data structure.
In certain shared memory optimizations, the data of a data structure belonging to one task, defined by a datatype created at that task needs to be copied into a data structure at some other task on the same OSI (Operating System Image). The datatype defining the data structure belonging to the second task has been created only at the second task. Each datatype is available only within the address space of the task where it was created. The datatype created by each task will have been compiled into a DGSP, also available only within the address space of the task where it was created. A common solution for a data transfer between two discontiguous application data structures having non-uniform data layouts is to stage the transfer through a contiguous intermediate buffer. When the data structures belong to different tasks, the intermediate buffer must be in shared memory which is addressable by both tasks.
In this approach, data is first gathered from the source application buffer into the intermediate buffer and then scattered from the intermediate buffer to the destination application buffer. The DGSP/DGSM mechanism can be applied to both the gather and the scatter. Since the intermediate buffer size is always limited, buffer reuse has to be supported in a fill/drain cycle. The basic DGSM is designed to be activated repeatedly, each time being told how many bytes to process. The state of the machine at the end of one activation is retained so the next activation can resume, with respect to the application buffer, where the previous activation left off. To allow reuse of a single staging buffer or of a pool of staging buffers, each activation of the DGSM is provided with a new pointer to space in the intermediate buffer along with the number of bytes to be copied. At the data source side one DGSM gathers the requested number of bytes into the intermediate buffer. Then at the data destination side another DGSM scatters a specified number of bytes from the intermediate buffer to a discontiguous application data structure. One DGSM runs in each process and each DGSM depends on the DGSP and other information created within the address where that DGSM runs. A few variables in shared memory allow the gather DGSP to tell the scatter DGSP that the staging buffer has been filled so the scatter can begin. Later, the scatter DGSP tells the gather DGSP that all data from the staging buffer has been scattered and the buffer may be overwritten with more data. The cycle can continue until all data has been transferred via the staging buffer. There is no distinction between a DGSP to be interpreted for source (gather) or for destination (scatter) DGSM runs. The DGSP “machine instruction” COPY differs for gather or scatter only in which direction it copies the specified number of bytes.
FIG. 1 illustrates the structure of this prior art staging buffer approach. A discontiguous source application data structure 100 containing chunks of contiguous data with gaps in between is copied to a contiguous intermediate buffer 102 by way of a gather DGSM 103. The gather function uses a state cache 105 and a source DGSP 107 to control its progress. Next, a scatter DGSM 104 is used to write the data contained in the contiguous intermediate buffer 102 to a discontiguous destination application data structure 101. A state cache 106 and destination DGSP 108 are used to control the progress of the scatter function. A set of variables 109 in shared memory is used by both DGSMs 103 and 104 to allow them to take turns with fill/drain of the contiguous intermediate buffer 102.
Since each byte is copied twice in going from the source to the destination, this solution is inefficient in terms of time taken for data transfer and CPU demand; especially when the size of the data to be transferred is large. More severely, any memory system bandwidth bottleneck is double stressed by this two stage approach. Therefore a need exists to overcome the problems with the prior art as discussed above.