In recent years, the I/O data amount used in a computer system has been increased due to the increase in the CPU speed, the memory capacity and the memory band width. Furthermore, various large scale file systems, such as high speed file sharing systems based on a large scale SAN (Storage Area Network) environment and cluster file systems which use multiple file servers, are commonly available, especially in the HPC (High Performance Computing). Moreover, the MPI-IO is supported as a user interface for the parallel I/O in MPI-2 (Message Passing Interface-2) to improve the user's convenience. Such situations enhance the importance of the I/O architecture in the HPC system, necessitating the management of I/O resources required by each job executed on a parallel computer system from the viewpoint of a job management.
Typical job schedulers are, however, mainly directed to the management of the number of CPUS, the number of compute nodes, and the memory capacity, and the I/O resource management is often out of the target of the job scheduling. This may cause inefficient use of I/O resources. In addition, a job scheduler is usually designed to start a job after securing resources required by the job. This undesirably necessitates waiting the job start until all the required resources are completely secured, even when the start of the job is actually possible without securing all the required resources.
Computer systems adapted to dynamic resource allocation are disclosed in Japanese Laid Open Patent Application JP-A 2007-115246 (hereinafter, patent document D1) and Japanese Laid Open Patent Application JP-A 2003-316752 (hereinafter, patent document D2). The system disclosed in the patent document D1 is directed to solve a problem in allocating a resource to an application of interest before another application releases the resource to be allocated to the application of interest. This system, however, suffers from a drawback that a process must be awaited when required I/O resources are not available.
The system disclosed in the patent document D2 is configured to change the resource allocation between the nodes in the operation on the basis of the judgment on whether CPUs, memories and I/O bridges can be allocated to respective nodes. Nevertheless, the system disclosed in the patent document D2 is not adapted to secure resources in accordance with the amounts of the resources required by respective jobs. The system disclosed in the patent document D2 also suffers from a drawback that the dynamic allocation can be executed only in units of hardware resources, for example, in units of I/O bridges. Furthermore, the system disclosed in the patent document D2 is not adapted to time-divisional use of a hardware resource to achieve fine resource allocation. Moreover, the system disclosed in the patent document D2 suffers from a drawback that a process must be awaited when required I/O resources are not available.
One example of the I/O node control method is disclosed in Jose Moreira et al., “Designing a Highly-Scalable Operating System: The Blue Gene/L Story”, In Proceedings of IEEE/ACM Supercomputing SC'06, Tampa, November 2006 (hereinafter, Moreira). In the Moreira's system, an I/O process required on a compute node is transferred to a destination I/O node, and an I/O request is issued to a file system on the I/O node.
The Moreira's system implements I/O node control as follows. As described in Section 3.3 of the Moreira document, a file system of interest is set available for an I/O node. Then, an I/O request based on a system call that requests an I/O operation called on a compute node is transmitted to a destination I/O node through a network. This is followed by performing the requested I/O operation for the requested file system on the I/O node by using a daemon called CIOD. When the I/O operation on the I/O node is completed, the CIOD returns the result of the I/O operation to the requesting compute node.
The Moreira's system, however, suffers from a drawback that the amount of the allocated I/O resources is kept constant, independently of the I/O resource amount required by the job executed in the system, and this disables an efficient allocation of the I/O resources. This results from the system architecture in which I/O nodes are not dynamically allocated to the compute nodes; one I/O node and multiple compute nodes are defined as a logical entity called processing sets or psets, as described in Paragraph 3.1 of Moreira.
Japanese Laid Open Patent Application JP-A 2001-195268 discloses a computer system adapted to a resource allocation method based on the service level. This computer system is provided with AP service level specifying means, resource amount determination means, resource allocation means, AP execution means, resource release means and AP execution waiting means. The AP service level specifying means specifies an application service level to each application program and requests the execution of each application program. The resource amount determination means determines the resource amount of each resource allocated to an application program to be executed in accordance with the application program service level specified to the application program to be executed by the AP service level specifying means. The resource allocation means judges whether or not there is room for the resource amount of each resource determined by the resource amount determination means and allocates a desired amount of a resource to the application program if any resource is available. The AP execution means executes the application program with the resource amount of each resource allocated by the resource allocation means. The AP release means releases the resource allocated to the application program when the application program is completed. The AP execution waiting means places the application program to be executed into the waiting state, if there is no room in the resource quantity of each resource determined by the resource amount determining means.
Japanese Laid Open Patent Application JP-A H08-147234 discloses a stream processing apparatus. This stream processing apparatus is provided with an external input/output device, a storage device, a buffer memory, a schedule generator and a stream allocator. The external input/output device receives stream input/output requests from or to outside devices and provides interfacing of the requested streams in parallel at a predetermined standard speed or at the speed of the predetermined multiple. The storage device stores the streams and provides accesses the stored streams in units of blocks of a predetermined data length at an access speed equal to N times the standard speed. The buffer memory is provided between the external input/output device and the storage device and transiently stores the streams in units of blocks. The schedule generator determines multiple unit streams that can be simultaneously supplied at the standard speed and allocates the storage apparatus, the input/output device and the buffer memory to the respective unit streams. The stream allocator allocates a requested number of the non-used unit streams in accordance with the required speed, when the input/output of a stream is required, and supplies the requested stream in accordance with the schedule of the allocated unit stream.
Japanese Laid Open Patent Application JP-A H10-303932 discloses a communication apparatus including transfer means, judgment means and allocation means. The transfer means transfers communication data to a communication path. The judgment means judges the contents of the transferred communication data. The allocation means allocates the band of the communication path in accordance with the judged contents of the transferred communication data.
FIGS. 1A and 1B are views showing examples of inefficient use of I/O resources in which job #2 requesting an occupation ratio of 75% is submitted when job #1 is executed with an occupation ratio of 66%. In the job scheduling shown in FIG. 1A, the execution of the job #2 is not started until the resource required by the job #2 can be secured, namely, until the completion of the execution of the job #1. In the job scheduling shown in FIG. 1B, on the other hand, the I/O resource is secured with an occupation ratio of 33% and then the job #2 is started, although the required occupation ratio of 75% is not secured. In the example of FIG. 1A, the job #2 is awaited until the completion of the execution of the job #1, although 33% of the I/O resource is available. In the example of FIG. 1B, on the other hand, there is a problem that a potentially-available resource is not used after the completion of the job #1, although the execution of the job #2 is started with an occupation ratio of 33%, which can be secured at the start of the job #2.