The present invention relates to a file I/O control method, a file management server, and a parallel computing system, and more particularly, to a file I/O control method, a file management server, and a parallel computing system for use in a parallel computer which operates a plurality of processes in concert for performing computations.
Parallel computing which initiates a plurality of processes on a parallel computer and operates the plurality of processes in concert for performing computations often employs an SPMD (Single Program Multiple Data stream) model in which all processes execute the same program codes to operate different data from one another. Since all the processes execute the same program codes in this model, I/O requests are often issued from all the processes substantially at the same timing. Particularly, the processing in a parallel program for scientific computations involves array data stored in files for use in the processing, so that the same array data is often divided into sub-regions among a plurality of processes. In this event, a certain process will have an access to data in a particular row or column. Such data in a particular row or column is typically arranged on a file in a noncontiguous manner. For this reason, a file access pattern of each process in a parallel program as mentioned above includes an access to a noncontiguous region.
xe2x80x9cDynamic file-access characteristics of a production parallel scientific workloadxe2x80x9d, David Kots, Song Bac Tho, and Sriram Radhakrishanan, Proceedings of Supercomputing ""94, pp. 640-649, November 1994 (hereinafter, Kots-94) points out the following two problems which are experienced by a program that has a file access pattern as mentioned above, when it issues an I/O request using a conventional UNIX interface.
(1) Since each process issues I/O requests corresponding to the number of noncontiguous regions, overhead caused by system calls increases.
(2) Since the system cannot recognize the relationship among I/O requests issued by a plurality of processes, random disk accesses are generated in response to the respective I/O requests.
Generally, a disk access is low in access speed, as compared with a memory access. Also, the performance of an random disk access is lower than the performance of an access to a contiguous region because of movements of a disk head involved in the random disk access. If respective processes that execute a parallel program access noncontiguous regions in a file as mentioned above, noncontiguous disk regions will be accessed to cause a significant degradation in the file I/O performance due to an inconsistent order of I/O requests issued from the respective processes.
Kots-94 further points out that the foregoing problems result from the fact that a UNIX type file I/O interface does not have a noncontiguous region access function and a function for specifying the relationship of file I/O among a plurality of processes. Thus, as a useful means for addressing these causes, Kots-94 suggests the use of a stride I/O interface, and a collective I/O interface through which all associated processes issue I/O requests to the same file.
The stride I/O interface enables noncontiguous regions on a file to be accessed with a single I/O request, while the collective I/O interface allows all of a plurality of processes to issue I/O requests to the same file.
A typical example of the foregoing interface may be MPI-IO which is defined in Section 9 of xe2x80x9cMPI-2: Extensions to the Message-Passing Interfacexe2x80x9d, Message Passing Interface Forum, http://www.mpi-forum.org/docs/docs. html (hereinafter called xe2x80x9cMPI-2xe2x80x9d). Also, xe2x80x9cData Sieving and Collective I/O in ROMIOxe2x80x9d, Rajeev Thakur, William Gropp, Ewing Lusk, Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp. 182-189 (hereinafter called xe2x80x9cROMIOxe2x80x9d) is known. This is an exemplary implementation of MPI-IO.
According to ROMIO, a parallel program as mentioned above can be converted to accesses to a contiguous region by merging all I/O requests issued by respective processes through the collective I/O, even if the I/O requests involve accesses to noncontiguous regions, thereby improving the file I/O performance. The ROMIO waits for all associated processes until they have issued collective I/O requests, merges all the I/O requests, at the time all the collective I/O requests are issued, to convert them to accesses to a contiguous region, and then performs disk I/O operations, the results of which are notified to the respective processes. As described above, the file I/O performance from a plurality of processes can be improved by using the stride I/O interface and collective I/O interface.
However, the foregoing collective I/O waits for all associated processes until they have issued collective I/O requests. For this reason, if the processes separately issue collective I/O requests at different times, the process which has first issued a collective I/O is kept waiting until collective I/O requests are issued from the remaining processes. This prevents the process which has issued the collective I/O request from executing other processing during the waiting time, resulting in a problem that the activity is reduced.
To solve the foregoing problem, it is an object of the present invention to provide a file I/O control method, a file management server, and a parallel computing system which are capable of reducing a file I/O waiting time of each process while maintaining the file I/O performance associated with file I/O operations from a plurality of processes at a level equivalent to that of the collective I/O.
It is another object of the present invention to provide a file I/O control system, a file management server, and a parallel computing system which are capable of reducing a waiting time for a collective I/O request from each process without modifying the conventional collective I/O interface.
According to the present invention, the foregoing objects are achieved by a file I/O control method for use by a plurality of processes to access the same file in a shared manner, wherein each of the plurality of processes notifies a file management server of a file I/O request of the process, and hint information including information on a file region accessed by all of the plurality of processes, and the file management server provides a buffer for performing an I/O operation to and from the file region notified by the hint information. When a file I/O request from each of the plurality of processes is a file read request, data in the file region specified by the hint information is read from a disk into the provided buffer, and after reading the data from the disk, data in the provided buffer is copied between memories from a memory region corresponding to a file I/O request first issued by a process into a data region specified by the process. When the file I/O request from each of the plurality of processes is a file write request, data is copied between memories from a data region specified by each process by the file I/O request into a region in the provided buffer corresponding to the request of the process to complete a file write for each process. After the intermemory-copying by all of the plurality of processes, the data in the provided buffer is written into a file region in a disk notified by the hint information.
In the present invention which provides the foregoing processing, each user process specifies a file region read by all user processes, as hint information, in addition to a region serviced thereby, upon issuing a collective I/O request. Since the file management server can recognize I/O requests of all the user processes from the hint information when each user process issues a collective I/O request, the file management server can collectively read data for all the user processes from a disk using the hint information. As a result, the present invention can improve the processor activity resulting from elimination of the need for waiting for the issuance of collective I/O requests from the remaining processes, and the ability to promptly return the control to the user process.
Also, in the present invention, data once read from a disk is held in the buffer in the file management server until all the user processes have issued the collective I/O requests. When another user process issues a collective I/O request, only pertinent data is copied from the buffer to a user space. Therefore, the present invention can limit the required disk access to only once, when the first process issues the collective I/O request, thereby attaining the performance equivalent to that provided by a conventional collective I/O method.
The present invention can also be applied to a WRITE access to the same parallel file from a plurality of user processes, in a manner similar to the foregoing. Specifically, when each user process issues a WRITE collective I/O request, the user process specifies a file region into which all the user processes write data, as hint information. The file management server can find a buffer size required for storing merged I/O data, using this hint information, thereby immediately reserving a buffer within a file server. For this reason, the present invention can return the control to the user process immediately after collective I/O data of the user process is copied into a pertinent region in the buffer, so that the processor activity can be improved.
Further, in the present invention, the buffer reserved in the file server is held until all the user processes have completely issued the collective I/O requests, such that collective I/O request data has to be written only once into a disk at the time the collective I/O request data is all ready, thereby making it possible to attain the performance equivalent to that provided by the conventional collective I/O method.
According to the present invention, in the collective I/O processing which permits a plurality of processes to access the same file in a shared manner, an I/O wait time for each process can be reduced to improve the processor activity.