1. Field of the Invention
The present invention relates to a distributed resource management system and, more particularly, to a distributed resource management system that performs resource reservation on a per job basis.
The present invention also relates to a method for managing resource reservation and a computer program for defining the procedure of the method.
2. Description of the Related Art
A distributed computing system (distributed resource system) includes a variety of resources such as computers, storage units, and networks. Hence, in order to obtain quality of service (QoS) necessary for a specific job, a resource management scheme that guarantees quality of service on the resource management over a plurality of resources is needed. In particular, in a grid computing scheme where a plurality of sub-systems (hereinafter, each called “domain”) managed independently of each other share resources, it is often the case that integral management of the resources of the entire system is not possible. In this kind of system, a QoS guarantee function is needed which uses a plurality of independent resource management units.
An advance reservation of resources is known in a conventional technology that guarantees QoS in a distributed resource system (hereinafter, reservation refers to the advance reservation unless cited otherwise). The term “advance reservation” refers to an operation that guarantees the QoS of individual resources necessary for the execution of a job during a given time period.
With reference to FIGS. 1A and 1B, an example of the advance reservation of resources will be described. Here, a process is considered wherein the data input from a storage unit B is analyzed by a computer A and the result is stored in a storage unit C (refer to FIG. 1A).
In order to perform a series of processes without any delay, the resources to be used, that is, the computer A, the storage unit B, and the storage unit C are reserved in advance. At that time, a reservation for a data input process in the computer A and a reservation for a data read-out process in the storage B should be made for the same time period (see FIG. 1B). It is also the case with a data output process in the computer A and a data write-in process in the storage C.
An example of a conventional resource management system having an advance reservation function is known (refer to, for example, Japanese Patent Laid-Open Publication No. 2000-259537 (p. 5, FIG. 1)). As shown in FIG. 2, this conventional resource management system includes a plurality of user terminals 400, a plurality of resources 300, a plurality of connection control apparatuses 700, and a resource management unit 800. A conventional resource reservation system having such a configuration operates as follows:
(1) A user terminal 400 issues a reservation request via one of the plurality of connection control apparatuses 700 to the resource management unit 800.
(2) As a result of the reservation request of the above step (1), a reservation certificate is passed to the user terminal 400.
(3) When it reaches the time for which the reservation has been made, the user terminal 400 issues a resource-use request including the reservation certificate issued in the step (2) to a resource 300 reserved.
In the system described in Patent Publication 2000-259537, the single resource management unit 800 integrally manages advance reservations of the resources. Therefore, the system described in the above patent publication cannot be applied to a distributed resource system wherein a plurality of resource management units manage the resources.
Another resource management technology is known which is based on the premise that the resources are managed in a distributed manner (refer to, for example, a first non-patent literature: Ian Foster et. al., “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation”, International Workshop on Quality of Service 99., 1999).
As shown in FIG. 3, the resource reservation system described in the above literature includes a plurality of user terminals 400, a plurality of resources 300, a plurality of job schedulers 500 (each referred to as “co-reservation agent” in the literature), and a resource information service 600. Each resource 300 has a resource management section incorporated therein. A conventional resource reservation system having such a configuration operates as follows:
(1) A user terminal 400 issues a reservation request to one of the plurality of job schedulers 500.
(2) The job scheduler 500 enquires the resource information service 600 to find out the state of reservations of the resources.
(3) The job scheduler 500 determines a resource to be reserved based on the reservation state found in the above step (2) and issues a resource reservation request to the resource management section of the resource to be reserved. When there are a plurality of resources to be reserved, a resource reservation request is issued to the resource management section of each of the resources 300 likewise.
(4) As a result of the resource reservation request in the above step (3), a reservation ID for the reservation is issued and passed to the user terminal 400 of the above step (1).
(5) The user terminal 400 of the above step (1) issues a resource-use request including a corresponding reservation ID obtained in the above step (4) to the resource management section of each resource 300 reserved in the above step (3).
There are known architectures where a plurality of client terminals reserve shared resources. In such an architecture, the client terminals send reservation entries each including start time, time period, and repetition time sequence as resource requests to an AV/C bulletin board. The resource requests are organized as a resource calendar for notifying incurred reservation conflicts to the resources and the users. The system includes a resource schedule controller that allocates the above resources to the client terminals (refer to, for example, Japanese Patent Laid-Open Publication No. 2003-500961 (pp. 14-17, FIG. 1)).
Furthermore, there are known systems wherein tentative reservations of resources are made. In such a system, a first terminal issues a request message for a resource reservation. When receiving the request message, each node apparatus determines whether the reservation is possible; if possible, makes a tentative reservation and issues the request message to a next apparatus; or if not possible, issues a response message that the reservation was denied to a previous apparatus. When receiving the request message, a second terminal determines whether or not it is possible to respond to data communication and issues a response message whether or not the reservation is possible to a previous apparatus. When receiving a response message that the reservation is possible, each node apparatus changes the tentative reservation to a real reservation and issues a response message that the reservation is possible to a previous apparatus (for example, Japanese Patent Laid-Open Publication No. 2002-185491 (pp. 3-5, FIG. 1)).
It is to be noted that a job scheduling which will be described later follows a known algorithm (refer to, for example, second non-patent literature: “Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors”, The Journal of the ACM, Volume 24 Issue 2, 1977). The description thereof is incorporated herein by reference.
In a distributed resource system having a plurality of resource management units, there is the problem that the usability of the resources decreases when multiple jobs are executed at once. The reason for that will be detailed below.
There are the following two kinds of methods for executing a plurality of jobs by using advance reservations of resources:
(i) The first method, after making advance reservations of resources possibly to be used by the jobs, is to select resources, to which the jobs are to be assigned, and time periods from among the resources successfully reserved.
(ii) The second method, after determining resources, to which the jobs are to be assigned and time periods, is to make advance reservations of the determined individual resources for the time periods.
In the above method (i), resources are reserved for the time interval during which the jobs are not actually executed, and thus the usability of the resources decreases compared with the above method (ii). Moreover, there is the problem that, when a failure occurs in the user terminal that is to issue use-requests to the resources reserved, no one can use the resources reserved for a long time interval.
On the other hand, in the above method (ii), resources that are going to be reserved may have been already reserved by another user. In this case, a reservation failure occurs, and thus an appropriate combinational operation of resources may not be able to be used.
It is when the job scheduler (500 in FIG. 3) determining the assignment of jobs does not hold the latest resource-reservation state that a reservation failure occurs. The resource-reservation state information can be obtained by enquiring of the resource information service (600 in FIG. 3); however, the information from the resource information service does not necessarily reflect the latest information. If it is possible to directly enquire the reservation states of individual resources, it will take a long time to enquire the states of multiple resources, during which the states may change. Furthermore, the reservation states may be changed during job scheduling for determining the assignment of jobs to the resources and reservation execution.
The failure of a reservation becomes a problem for a job that uses a plurality of resources simultaneously and the case where there are dependencies between a plurality of jobs. For those jobs, when reservations of some of the resources fail, rescheduling must be performed to reserve the remaining resources. In this case, the combinational operation of resources successfully reserved earlier and resources successfully reserved later may be less appropriate than a combination which would be possible at an earlier stage.
The decrease in the usability of resources due to the failure of a reservation will be described hereinafter with reference to FIG. 4. In this example, two computer clusters A, B are connected via a wide area network (WAN). Each cluster consists of eight nodes. It is considered in this system that a job scheduler makes reservations of resources, for a parallel job-1 using four nodes and a parallel job-2 using eight nodes.
The reservation state that the job scheduler grasps is that no reservation exists in each cluster. In this situation, as a possible job assignment by the job scheduler, it may be that job-1 is assigned to four nodes of cluster A and job-2 is assigned to eight nodes of cluster B (FIG. 4(A)). Meanwhile, it is assumed that another job-3 has been already assigned to four nodes of cluster B (FIG. 4(B)). In this case, for job-1, all reservations are successful and for job-2, only reservations of four nodes of cluster A are successful. It is assumed here that the job scheduler reschedules part of job-2 for which it failed to reserve and that, as a result, four nodes of cluster A is selected (FIG. 4(C)). As a result, job-2 is assigned over clusters A and B. If the reservation state shown in FIG. 4(B) is known in advance, job-2 can be assigned to only cluster A, and then there is no need for communication through WAN. That is, the usability of the network resources decreases due to the failure of proper reservations
Although the above example is the case where the usability of network resources decreases, there may be a case where the usability of another kind of resources decreases. For example, a specific parallel job, to which computers having an equal performance are preferably allocated, may have computers unequal in performance allocated thereto.
Furthermore, not only for a parallel job but also for a plurality of jobs having interdependencies, the usability of resources may decrease. For example, in the case where the output of a job is an input of another job, these jobs are preferably assigned to the same resource or resources connected via the same local area network (LAN). However, when a reservation fails for one job, a resource which can be allocated to the job having failed in the reservation may be disposed on a LAN different from the LAN where a successfully reserved resource exists.
More specifically, since the reservation state of resources cannot be exactly grasped for a job simultaneously using a plurality of resources or jobs having interdependencies, an inefficient combinational operation of resources other than an inherently possible combinational operation of resources may be allocated.