1. Field of the Invention
The present invention relates to a file control method for a file system (parallel file system) adapted for a parallel program running on a parallel computer system in which multiple computers are interconnected by a high-speed network, a parallel file system, and a program storage medium for implementing that parallel file system.
2. Description of the Related Art
As a technique of increasing the speed with which a file on I/O node computers (called server nodes) is accessed by multiple computers on which a user program runs (called client nodes) over a network, the client cache system is well known in which a cache is installed on the client side to minimize the amount of data transferred between the client and the server.
However, the client cache system has a problem that, in the environment in which the so-called xe2x80x9cwrite sharexe2x80x9d in which multiple nodes update a shared file concurrently is generally used, the overhead for acquisition control of right of access to the client cache and processing of making cache data invalid increase.
To solve this problem, the client cacheless system is frequently used which permits communication with the server node each time a read/write request is made by a user program. However, this system has a drawback that, when read/write data used by the user program is small in length, the communication overhead increases sharply and moreover the I/O node striping effectiveness is lost.
As a technique of allowing the data length of read/write request used by the user program to be large in the client cacheless environment, a stride access interface has been proposed. The stride access interface services access to more than one portion of a file in a single read/write request by declaring a discrete sequential access pattern for the file. For example, when the user program desires to make access to discrete data in a certain file, such as 200 bytes of data from the 1000th byte, 200 bytes of data from the 2000th byte, and 200 bytes of data from the 3000th byte, the stride access interface services that access in a single read/write request by declaring a pattern in which data to be accessed are placed.
As compared with the case where read/write requests are issued individually, the stride access interface provides optimization of a file system and higher utilization of the network.
A disk storage unit (hereinafter referred to as a disk), which is an important ingredient of a parallel file system, can exhibit the highest performance when accessed sequentially. When accessed randomly, on the other hand, the disk suffers from considerable degradation in performance due to the seek time and rotational latency time. As a technique of evaluating stride access interface requests issued by two or more client nodes and declared to be related to each other and converting disk accesses to a sequential one by taking advantage of such characteristics of the disk, the collective I/O system is known.
In the collective I/O system, the server node schedules disk accesses and data transfers and handles related access requests issued by all related client nodes collectively to carry out input/output (I/O) operations, thereby minimizing the number of disk accesses and the time required for data transfers.
Conventional parallel programs for write share of a file contain logic to assure data consistency without fail. FIG. 1 shows a general operation of a parallel program adapted for write share of a file. Process 2A (process 1) and process 2B (process 2) are sub-programs that make up a parallel program and run on different compute nodes 1A and 1B. Process 2A sends a message to process 2B on the other compute node 1B to thereby make notification that a file 8 has been ready for processing.
That is, the parallel program for write share should contain a process (notify node B) of notifying the parallel program running on the other compute node of a file having been updated and does not rely on only timing-dependent sequential consistency.
The stride access interface and the collective I/O techniques are useful in significantly improving the input/output operations of a parallel program. However, they have a drawback of requiring considerable amendments to an existing program because they greatly differ from an existing file access interface, for example, the UNIX system, which supposes a parallel program, and a user program needs a great large of I/O buffers.
The client cache system, although having advantages of the capability of servicing small size read/write requests in an efficient manner and moreover permitting an existing program to be used without being amended, has a drawback of requiring a significant overhead for keeping cache consistency.
It is therefore an object of the present invention to improve the performance of a parallel file system without the need of considerable amendment to an existing program.
According to an aspect of the present invention, in a network file system in which multiple compute nodes share files over a network, each compute node comprises a file update notification facility for, when a file update is made by a compute node, notifying other compute nodes of the file update, and each compute node stores read data or write data for the file in a buffer in the compute node. A program that runs on a node calls the file update notification facility when consistency for file update data is needed, and the file update notification facility invalidates the data corresponding to the file update data that each compute node stores in its buffer.
The buffering in each node is performed only for data that has been actually written. Unlike cache control, therefore, there is no need of exclusive control for preventing multiple nodes from simultaneously updating the same cache line, allowing multiple nodes to operate in parallel. Since only modified portions are held, a file will not be destroyed even if an I/O node merges two or more write requests made by the compute nodes. Thus, the concurrent updating of a file by two or more nodes can be made fast without destroying the file.
In particular, an existing parallel program which carries out write share of a file simply declares the inventive control to be put into effect at the time of opening a file and adds a statement to call (propagate) the file update notification facility to statements of the program that notify the other nodes that a file update has been performed. That is, minimal modifications to the existing parallel program can improve its performance.
According to another aspect of the present invention, in a network file system in which multiple client nodes share a file striped on multiple server nodes over a network, each client node temporarily stores data for which a write request is issued by a user program into buffers and passes the data on to multiple server nodes collectively at the time when the buffers become full or the buffer contents reach a predetermined amount, and, for a read request issued by the user program, each client node reads-ahead data from the server nodes into the buffers collectively, and, for subsequent read requests, copies data read-ahead into the buffers into a user buffer in the user program.
Even with a user program involving many write requests for data of small length, since data are temporarily stored in the buffers and, when the buffers become full, the data are sent to the multiple server nodes in parallel, the effectiveness of I/O node striping can be displayed fully irrespective of data length. Even in the environment in which read requests for data of small length are frequently made, since data are read-ahead by the amount equal to the size of the buffer into the buffers collectively from all server nodes, throughput proportional to the number of server nodes can be attained irrespective of data length.
Each client node has a buffer for each of the server nodes on which a file is striped. If, when one of the buffers is filled with data for which a write request is issued by the user program, a predetermined amount of data is stored in the other buffers, data in all these buffers are sent to the server nodes simultaneously.
Thus, the buffers can be used as buffers for communication with the server nodes without modification. This reduces the number of useless memory copy processes, such as from user buffers to system buffers and from system buffers to communication buffers: That, when one buffer becomes full, data in the buffers for the other server nodes are sent simultaneously, allows high throughput proportional to the number of the server nodes to be attained even in the environment in which there are many read/write requests for data whose length is small as compared with the stripe width.
According to still another aspect of the present invention, in a network file system in which multiple client nodes share a file placed on one or more server nodes over a network, upon receiving from one of the client nodes a request for access to an area of a storage medium which is contiguous to an area to which the immediately preceding access is made, each of the server nodes check whether requests issued by the other client nodes have been received and, when received, arrange the requests so that storage medium accesses are made in the order of ascending addresses.
When the user programs running on the client nodes all make requests for access to a file on a disk in the order of ascending addresses but noncontiguously, the access requests made by all the client nodes are evaluated and arranged so that the disk accesses are made in sequential order. This reduces the latency time and the seek time for the disk in which the file is stored.
According to a further aspect of the present invention, in a network file system in which multiple client nodes share a file placed on one or more server nodes over a network, the server node comprises switching means for switching the storage medium access mode between the sequential mode, in which access requests are arranged in the ascending order of storage medium addresses and the non-sequential mode in which access requests are processed in the order in which they were accepted. And the server node monitors whether access requests made by the client nodes are in the ascending order of storage medium addresses for each of the client nodes. The server node switches from the non-sequential mode to the sequential mode when a predetermined number of access requests in the ascending order of storage medium addresses is received in succession from each of the client nodes, and the server node switches from the sequential mode to the non-sequential mode when an access request that is not in the ascending order of storage medium addresses is received from either of the client nodes.
The automatic switching between the sequential mode and the non-sequential mode allows contiguous access consisting of multiple access requests to be serviced in an efficient manner. For noncontiguous access requests, waiting time and arranging time for the requests can be reduced.
The program that is run on the client node or server node to implement the above processing can be stored in a computer-readable storage medium.