The MapReduce programming model is designed to be parallelized automatically and can be implemented on large clusters of commodity hosts. Scheduling, fault tolerance and necessary communications can also be handled automatically without direct user assistance.
An example issue in MapReduce environments is the design of a high quality schedule for multiple MapReduce jobs. A scheduler such as, for example, First In, First Out (FIFO), however, is known to have starvation problems. That is, a large job can “starve” a small job which arrives even minutes later. Further, if the large job was a batch submission and the small job was an ad-hoc query, the exact completion time of the large job would not be particularly important, while the completion time of the small job would be.
This basic unfairness associated with FIFO scheduling motivated the FAIR scheduler, designed to be fair to jobs of various sizes. However, FAIR makes no direct attempt to optimize scheduling metrics such as, for example, maximum stretch or average response time. It is noted that schedules designed to optimize one metric will generally be quite different from those designed to optimize another.
The FLEX scheduler was designed to address this limitation of FAIR. However, a common scenario in MapReduce workloads involves multiple jobs arriving at close but distinct times and scanning the same dataset or datasets. There may be many of these common datasets, each associated with one or more of the jobs. In such a scenario, it is typical that most of the cost of the Map jobs can be traced to the scan of the data itself. This presents an opportunity for sharing scans of datasets. However, these existing approaches do not amortize the costs of the scans of common data by sharing them.
Other existing approaches for amortizing the sharing of scans can include finding batching by finding an optimal batching window per dataset. However, batching forces a tradeoff of efficiency for latency, which causes all but possibly the last scan arriving within a batching window to be delayed. Also, a larger batching window causes a longer average delay. Additionally, in such approaches, an assumption that the arrival rates of the jobs are known in advance is improper, as, at best, such an assumption will be a rough approximation, and may affect the quality of the optimization output. Further, the schedule produced by such approaches is inherently static, and therefore cannot react dynamically to changing conditions.