An effective framework involves distributed parallel computing, which operates to disperse processing tasks across multiple processors operating on one or more computing devices such that parallel processing may be executed simultaneously. Important implementations of large scale distributed parallel computing systems are MapReduce by Google®, Dryad by Microsoft®, and the open source Hadoop® MapReduce implementation. Google® is a registered trademark of Google Inc. Microsoft® is a registered trademark of the Microsoft Corporation in the United States, other countries, or both. Hadoop® is a registered trademark of the Apache Software Foundation.
Generally, MapReduce has emerged as a dominant paradigm for processing large datasets in parallel on compute clusters. As an open source implementation, Hadoop has become popular in a short time for its success in a variety of applications, such as social network mining, log processing, video and image analysis, search indexing, recommendation systems, etc. In many scenarios, long batch jobs and short interactive queries are submitted to the same MapReduce cluster, sharing limited common computing resources with different performance goals. It has thus been recognized that, in order to meet these imposed challenges, an efficient scheduler can be helpful if not critical in providing a desired quality of service for the MapReduce cluster.