1. Field of the Invention
The present invention relates to job scheduling for a parallel or grid computing system comprising a plurality of nodes and, more particularly, to a method for determining the priority of jobs so as to increase the overall system efficiency and a communication scheme for implementing such a priority determination method.
2. Description of the Related Art
A system in which nodes having one or more CPUs and a main memory are interconnected via a network to exchange jobs and data or operated in coordination with each other to perform calculations on a large scale is called a parallel or grid computing system. The parallel or grid computing system is an essential configuration for establishing a large-scale system where more than 1000 CPUs are used. The difference between the grid computing system and parallel computing system arises from the system's physical expanse and network performance and specifications. However, the present invention does not distinguish between these two systems.
When large-scale calculations are performed by simultaneously using a plurality of nodes in a parallel or grid computing system (these calculations are hereinafter referred to as “multi-node calculations”) it is important that the nodes be effectively mapped. For mapping, it is fundamental that the nodes are preoccupied by a single job. Here, the term “job” refers to a part of a multi-node calculation, which is divided into a number of assignments for various nodes. All the nodes are preoccupied by a single job because a high processing performance is needed. If a single node performs multiple jobs at a time, that is, if the same node is involved in two or more multi-node calculations, the system lowers its calculation efficiency and provides no advantages.
However, if the data originally stored by the nodes differ in nature and the data owned by a specific node is required for two or more multi-node calculations, the system may increase its overall efficiency when the node having such required data is involved in two or more multi-node calculations. If, for instance, only a certain node (hereinafter referred to as “node A”) has raw data (hereinafter referred to as “data L”) required for calculations in its main memory or on its disk and two multi-node calculations are to be simultaneously performed while repeatedly referring to data L, the following two execution methods may work.
Execution method 1: Data L owned by node A is copied to another node so that two nodes have data L. Each multi-node calculation is performed by one of these two nodes.
Execution method 2: Node A having data L is used for both of the two multi-node calculations. In this instance, node A executes two jobs while switching between them.
In a common calculation process, the raw data is huge but the data used for calculations is only a part of the raw data. At the beginning of the calculation process, however, the system does not know what part of the raw data should be used. In most cases, the data to be used is decided during the calculation process. When execution method 1 is used, therefore, the system copies data L entirely because it does not know what part of data L should be used. In this case, an extra amount of data is copied. In reality, execution method 2 is used in most cases, that is, node A is used for both of the two multi-node calculations.
When node A is engaged in two multi-node calculations in this manner, the system's overall efficiency depends on how node A executes two jobs. That is, the important factor for overall system efficiency increase is how the two jobs are scheduled for execution at node A (e.g., node A can alternately execute the two jobs at 1-second intervals for equal execution of the two jobs or execute one job for 0.5 second, then the other job for 1 second, and repeat this execution cycle to give high priority to the latter job). To perform scheduling so as to increase the overall system efficiency is referred to as “job scheduling optimization”.
A means for job scheduling optimization at the beginning of a calculation process is disclosed in JP-A No. 137910/1996. More specifically, the invention described therein provides job scheduling optimization by calculating the scheduled termination time of each node at the beginning of a calculation process, causing the slowest node, that is, the node determining the multi-node calculation speed, to execute a multi-node calculation job with highest priority given to it, and permitting the other nodes to execute the other jobs without exceeding the scheduled termination time limit.
The invention described in JP-A No. 137910/1996 is effective for a process whose scheduled termination time can be determined at the beginning of job execution. However, it cannot effectively be used for a convergence calculation process (which is repeatedly performed until the calculation results converge) or other process whose final processing volume will be determined according to intermediate calculation results.
Further, a processing method for exercising centralized control over calculations at individual nodes is disclosed in JP-A No. 137910/1996. However, this method is not suitable for scalable job scheduling optimization applicable to large-scale multi-node calculations involving hundreds of nodes because it may incur table access conflicts. It is necessary that each node autonomously provide job scheduling optimization.