Today networks or clusters of computers are used for all types of applications. In order for these clusters of computers to perform efficiently and to be utilized to their maximum capacity, it is important that not only the jobs scheduled for execution are scheduled efficiently, but also that in the order in which the nodes are picked up can finish the job in time. It is also desirable that the nodes are utilized optimally. If any individual computer (or processor) in the cluster unexpectedly fails, the effect on the job can be catastrophic and cascading. Similarly, if the jobs are not assigned by optimizing the computer job resource allocation, the jobs might run for several times longer than the usual running time. Given the speed of modern business and the importance of computer cluster job assignment, every small amount of job execution delay and machine down time can prove to be extremely costly. Therefore, it would be advantageous to be able to not only optimize the job scheduling so that the job can be finished in minimum possible time, but also to maximize the processor utilization, keeping an eye on minimum down time for the individual computers.
A generalized scheduler for a cluster of computers should allow to have co-scheduling (simultaneously scheduling multiple jobs into one or more number of computer nodes), process migration and backfilling mechanisms assuming there are no failures. However, through the introduction of intelligent prediction for optimum process migration and checkpointing, a number of interesting components can be included within the scheduling domain to make the automatic fault prediction, job queuing and migration process more effective. For example: a) Job migration is no longer needed to improve the job performance, but may still have value as a way to move jobs away from predicted failures or to reduce temporal fragmentation (splitting a job or multiple jobs with respect to time), b) The scheduler can select nodes on an individual basis, as opposed to picking an entire partition (a partition is a set of nodes that satisfies the job running requirements) c) A node with one or more running jobs is not necessarily excluded as a possible node for subsequent job submission.
Current job scheduling procedure for any type of large-scale computer clusters consider the nodes only based on their availability (or whether busy processing other jobs). There is no mechanism or method to consider the rank of nodes in terms of providing best job performance and/or node utilization. There is a need for a new method to include the node rank criteria while selecting the nodes to submit the jobs which would significantly improve the job performance including the processor or node utilization.
Current job scheduling procedures provide no knowledge about the behavior of the nodes when a job runs. Further, there is uncertainty as to whether a node will fail while a job is running, experience too many errors, or experience performance degradation. Thus, without knowledge of the behavior of the nodes, more redundant nodes must be provided to account for any such failures, errors or performance degradation. For example, if a customer needs a specific job, such as weather forecasting to be completed within a specified time, lack of knowledge of the behavior of the nodes forces the supplier of the nodes to provide redundant nodes to ensure that the customer needs are satisfied. Therefore, there is a need to determine or predict the behavior of nodes to improve the overall utilization of the nodes and thereby reduce the need for redundancy node provision.
A currently pending patent application Ser. No. 10/720,300, assigned to the same assignee as that of the instant application and which is hereby incorporated by reference, discloses a failure prediction mechanism for determining the probability of the occurrence of failure of the nodes. This determination is then used in the maintenance of the nodes, such as determining when to repair or replace a node that has a failure rate above a threshold.
Previously, the failure prediction was envisioned as an algorithm, or function. A known prediction mechanism accepts a node or partition and a time window, and returns predictions (either Boolean or as a probability) about whether there is a possibility that the node would succeed to complete the job or fail.
Learning to recognize rare events is a difficult task. The difficulty may stem from several sources: few examples support the target class; events are described by categorical features that display uneven inter-arrival times; and time recordings only approximate the true arrival times, such as occurs in computer-network logs, transaction logs, speech signals, etc. Therefore there is a need for a system and method for scheduling jobs among a plurality of nodes that overcomes the above-discussed shortcomings.