Data intensive cluster computing is increasingly used for a large number of applications including webscale data mining, machine learning, and network traffic analysis. Jobs may be scheduled to execute concurrently on a shared computing cluster in which application data is stored on the compute nodes. Scheduling computations close to their data is crucial for performance.
A scheduler assigns tasks from the running jobs to compute nodes in a computer cluster. Schedulers for clusters are typically optimized for throughput and offer poor fairness guarantees, wherein fairness ensures that each job gets a fair share of the cluster resources. Thus, a job that uses very few resources may take a long time to complete if it is assigned behind other long running jobs.