In a utility computing or cloud computing model, businesses and users are able to access application services from any location on demand and without regard to where the services are actually hosted. This provisioning of computing services is typically supported by disparately located data centers containing ensembles of networked Virtual Machines. Cloud computing delivers infrastructure, platform and software as services, which may be made available as subscription based services wherein payment is dependent upon actual usage. Multiple types of services are encompassed within cloud computing implementations, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Cloud computing application services include social networking and gaming portals, business applications, media content delivery and scientific workflows. In these instances, the amount of data can be significant, often ranging from terabytes to even petabytes of data. As user demands are unpredictable, and data may be located across disparate nodes in the cloud infrastructure, load balancing and scheduling in this distributed environment must be accomplished dynamically and in real-time.
The most prevalent distributed file system framework is MapReduce, originally designed by Google, Inc. to exploit large clusters to perform parallel computations. The MapReduce framework is used to support distributed computing on large data sets on clusters of computers, or nodes. The framework is composed of an execution runtime and a distributed file system, the Google File System (GFS). The runtime and the distributed file system provide a level of fault tolerance and reliability which are critical in a large-scale data environment. As is appreciated by those skilled in the art, there are two steps as part of a MapReduce framework: map and reduce. During the map step, a master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. During the reduce step, the master node then takes the answers to all the sub-problems and combines them in a way to get the output—the answer to the problem it was originally trying to solve.
There are various runtime implementations of the MapReduce framework; for example, the Apache™ Hadoop™ project. Hadoop™ is an open source MapReduce runtime provided by the Apache Software Foundation. It uses the Hadoop Distributed File System (HDFS) as shared storage, enabling data to be shared among distributed processes using files. Briefly, the HDFS implementation has a master/slave (or master/worker) architecture, wherein a master process (“NameNode”) manages the global name space and controls operations on data and files. A slave process (“DataNode”) performs operations on data blocks stored locally upon instruction from the NameNode. More specifically, the Hadoop™ runtime consists of two processes: “JobTracker” and “TaskTracker”. JobTracker is a single instance process which partitions the input data (“job”) into subsets (“tasks”) as defined by the programmer. After the job has been split, JobTracker populates a local task queue based on the number of splits and distributes the tasks to TaskTrackers for distribution, computation or operation. If a TaskTracker becomes idle, the JobTracker picks a new task from its queue for processing. Thus, the granularity of the tasks has an immediate impact on the balancing ability of the scheduler, i.e., the greater the number/variance in size of tasks the greater complexity. Thus, the granularity of the splits has considerable influence on the balancing capability of the scheduler. Another consideration is the location of the data blocks, as the JobTracker tries to minimize the number of remote blocks accessed by each TaskTracker.
In this framework, the runtime is responsible for assigning and dispatching tasks to worker nodes and ensuring their completion. As is commonplace in the cloud computing field, submitted jobs may have significantly varying priorities and dependencies, e.g., low priority tasks requiring hours for completion, or interactive tasks requiring input from a second task execution. Task selection/scheduling of slave nodes directly impacts job performance and overall Quality of Service (QoS) of the system. Accordingly, it is to be appreciated that scheduling algorithms play a critical role in providing increased QoS in the cloud computing environment.
Several methods are well known in the art to provide scheduling of tasks. For example, “Fair Scheduler” provides a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. Briefly, when there is a single job running, that job uses the entire cluster. When other jobs are submitted, task slots that free up are assigned to the new jobs, so that each job gets roughly about the same amount of CPU time. Fair sharing may also work with job priorities—the priorities are used as weights to determine the fraction of total compute time that each job gets. Jobs in the Fair Scheduler are organized into pools, wherein resources are divided fairly between the pools and a minimum share size may be assigned to specified pools. Another method known in the art is the “Capacity Scheduler” which provides support for multiple job queues and guarantees a fraction of the capacity of the cluster to a queue. In this implementation, free resources can be allocated to any queue beyond its guaranteed capacity, but these excess allocated resources may be reclaimed and made available to another queue in ensure that all queues receive their capacity guarantee. This implementation further provides rules for managing greedy processes and providing priority jobs first access to a queue's resources. Both “Fair Scheduler” and “Capacity Scheduler” do not take into account the locality of the nodes, the local availability of relevant data on considered nodes, or the suitability of the node for the particular job.
Alternative scheduling methods have been proposed by those skilled in the art. For example, M. Zaharia, et al., “Job Scheduling for Multi-User MapReduce Clusters,” Techn'l Rprt. UCB/EECS-2009-55, Univ. Berkley at Cal., propose two algorithms for the improvement of a FAIR scheduler, Delay Scheduling and Copy-Compute Splitting. Delay Scheduling attempts to achieve efficiency in MapReduce operations by running tasks on the nodes that contain their input, wherein if a node requests a task, and if the head-of-the-line job cannot launch a local task, the job is skipped and subsequent jobs are considered. In this method, if a job has been skipped for a specified length of time, then it may be launched as a non-local task in order to avoid starving the job. Copy-Compute Splitting attempts to address the problem of slot hoarding, i.e., the interdependence between reduce and map tasks, for large jobs, wherein a reduce operation begins copying map outputs while the remaining maps are still running However, in a large job having tens of thousands of map tasks, the map phase may take a long time to complete. That is, at any time a reduce slot is either using the network to copy map outputs or using the CPU to apply the reduce function, but not both. The Copy-Compute Splitting method splits reduce tasks into two logically distinct types of tasks: copy tasks and compute tasks, wherein the compute tasks are managed by an admission control system that limits the number of reducers computing at any time.
Also in the art, J. Polo et al., “Performance-driven task co-scheduling for MapReduce environments,” IEEE/IFIP Network Operations and Management Symposium, 2010, propose a dynamic scheduler that estimates the completion time for each MapReduce job in the system, taking advantage of the fact that each MapReduce job is composed of a large number of tasks (maps and reduces) known in advance during the job initialization process (when the input data is split), and that the progress of the job can be observed at runtime. The scheduler takes each submitted and not yet completed job and monitors the average task length for already completed tasks. Based on these estimates, the scheduler is able to dynamically adapt the number of task slots such that each job is allocated. Another technique is provided by I. Stoca et al., “On the duality between resource reservation and proportional share resource allocation,” In Proc. of Multimedia Computing and Networking, 2007, proposing a scheduler that characterizes jobs in terms of their weight, as is commonly used in proportional share allocation, and by their share, as is commonly used in resource reservation methods—as opposed to either parameter individually.