1. Technical Field
The present disclosure generally relates to supporting multiple concurrent jobs, and more particularly to a flexible allocation scheme for data parallel tasks.
2. Discussion of Related Art
MapReduce is a framework for processing large datasets (e.g., terabytes of data) involving distributable problems using a large number of commodity hosts (e.g., thousands of computers or nodes). A group of hosts is referred to as a cluster. Processing can occur on data stored either in a file system (unstructured) or within a database (structured). The MapReduce framework is designed to be parallelized automatically. It can be implemented on large clusters, and it inherently scales well. Scheduling, fault tolerance and communications are all handled automatically, without direct user assistance.
MapReduce was originally aimed at large, generally periodic, production batch jobs. As such, the natural goal would be to decrease the length of time needed for a batch window. A dual goal may be to maximize the number of jobs which could be accommodated within a fixed batch window, effectively maximizing throughput. For such scenarios, a simple job scheduling scheme such as First In, First Out (FIFO) works well. FIFO was the first scheduler implemented in Google's MapReduce environment and in Hadoop, an open-source MapReduce implementation.
The use of MapReduce has evolved towards more user interaction. There are now many ad-hoc query MapReduce jobs, and these share cluster resources with the batch production work. For users who submit these queries, expecting quick results, schemes like FIFO do not work well. That is because a large job can “starve” a small, user-submitted job which arrives later. Further, if the large job was a batch submission, the exact completion time of that job might not even be regarded as particularly important.
This unfairness associated with FIFO scheduling motivated the Hadoop Fair Scheduler (HFS). To understand this, MapReduce is described in more detail below together with the goal and design of HFS itself.
MapReduce jobs include two processing phases, a Map phase and a Reduce phase. Each phase is broken into multiple independent tasks, the nature of which depends on the phase. In the Map phase the tasks include the steps of scanning and processing (extracting information) from equal-sized blocks of input data. Each block is typically replicated on disks in three separate racks of hosts (in Hadoop, for example, using the HDFS file system). The output of the Map phase is a set of key-value pairs. These intermediate results are also stored on disk. Each of the Reduce phase tasks corresponds to a partitioned subset of the keys of the intermediate results. There is a shuffle step in which all relevant data from all Map phase output is transmitted across the network, a sort step, and finally a processing step, which may include transformation, aggregation, filtering and/or summarization.
HFS can be said to include two hierarchical algorithmic layers, which will be called the allocation layer and the assignment layer.
Referring to the allocation layer, each host is assumed to be capable of simultaneously handling some maximum number of Map phase tasks and some maximum number of Reduce phase tasks. These are the number of Map slots and Reduce slots, respectively. Typically a host has two Map slots per core, and two Reduce slots per core for processing data tasks. Aggregating these slots over all the hosts in the cluster, the total number of Map slots, and similarly the total number of Reduce slots may be determined. The role of the allocation layer scheme is to partition the number of Map slots among the active Map jobs in some intelligent manner, and similarly the number of Reduce slots among the active Reduce jobs. The node that produces these allocations is known as the master. The HFS allocation layer is referred to as FAIR.
Referring to the assignment layer, it is this layer that makes the actual job task assignment decisions, attempting to honor the allocation decisions made at the allocation level to the extent possible. Host slaves report any task completions at heartbeat epochs (e.g., on the order of a few seconds). Such completions free up slots, and also incrementally affect the number of slots currently assigned to the various jobs. The current slot assignment numbers for jobs are then subtracted from the job allocation goals. This yields an effective ordering of the jobs, from most relatively under allocated to most relatively over allocated. For each currently unassigned slot, the HFS assignment model then finds an “appropriate” task from the most relatively under allocated job that has one, assigns it to the slot, and performs bookkeeping. It may not find an appropriate task for a job, for example, because of rack affinity issues. That is why HFS relaxes fidelity to the precise dictates of the master allocation goals for a time. This is known as delay scheduling.
In the Map phase, for example, the FAIR allocation scheme is fair in the following sense: It determines, for each of J Map phase jobs j, a minimum number mj of Map slots. This minimum number is chosen so that the sum Σj=1Jmj≦S. The minima are simply normalized if needed. This minimum number mj acts as a fairness guarantee, because FAIR will always allocate a number of Map slots sj≧mj, thereby preventing job starvation. Slack (the difference S−Σj=1Jmj) is allocated in FAIR according to a waterline based greedy scheme. Analogous statements hold for the Reduce phase.
While HFS mentions standard scheduling metrics such as throughput or average response time as a goal of the scheduler, and compares its scheduler to others in terms of these, FAIR is motivated by and oriented towards system issues. It makes no direct attempt to optimize such metrics.
According to an embodiment of the present disclosure, a need exists for a flexible allocation scheme for data parallel tasks that improves the performance of a system.