Computing clusters are used to perform large processing jobs. The clusters contain multiple processing resources, which could be physical processors or virtual processors. A cluster also includes a scheduler that allocates the processing resources to jobs submitted to the cluster. The policy under which a scheduler operates impacts overall performance of the computing cluster.
Jobs may contain multiple computational tasks that can be assigned for execution to multiple computing resources. For most large jobs that are “data-parallel jobs,” some tasks can be executed in parallel, while others are dependent on data generated from other tasks. As a result, some tasks cannot be executed until execution of other tasks is completed. For data-parallel jobs, allocating multiple processing resources to a job may allow more tasks to be executed in parallel, thereby improving execution time of the job. However, for each job, execution time may not improve linearly in relation to the number of processing resources allocated to the job. Despite the fact that there may be many tasks left to execute, at any given time, there is a limit on the number of tasks that are ready to execute.
Moreover, in a computing cluster, multiple jobs may be pending for execution at one time. Allocating too many of the processing resources of the cluster to a single job may impact the performance of other jobs. Accordingly, a scheduler of a cluster may operate according to a policy that seeks to distribute processing resources in a reasonable fashion across jobs. As an example of a policy, some minimum amount of processing resources may be allocated to each job ready for execution. Any remaining resources may then be allocated to jobs as they have tasks ready to execute.
Allocating processing resources to jobs can be particularly challenging for an operator of a computing cluster when, through service level agreements with customers who have agreed to buy computing services from the cluster operator, the operator has committed to complete execution of certain jobs within a specified amount of time. Such a commitment may be regarded as a service “guarantee,” and the service level agreement may entail a significant financial penalty if the job is not completed in accordance with the guarantee.
A service guarantee may create a high priority job for an operator of a computing cluster. Scheduling policies that account for high priority jobs are known. In some scenarios, manual intervention is employed. As the job executes, the operator tracks progress. When the progress appears to be too slow to finish in time, more resources are added to the job. In other policies, the scheduler is simply controlled to allocate to such a job a large amount of resources. In some approaches, a separate compute cluster, containing enough processing resources to complete the high priority job by the guaranteed time, is dedicated to the job. Other policies preferentially allocate processing resources to the high priority job. Another approach is to model execution of the job to determine an amount of processing resources that seems likely to result in execution of the job prior to the guarantee time, and this level of resources may be allocated to the job from the outset.