With the exponential data growth, more and more companies are relying on MapReduce frameworks such as Hadoop for their data processing needs. Estimating the performance of MapReduce jobs in advance can lead to various benefits such as designing better scheduling policies; Tuning Hadoop Parameters; Optimize the performance of MapReduce applications. MapReduce is a programming model and a software framework introduced by Google in 2004 to support distributed computing and large data processing on clusters of commodity machines. Predicting the performance of MapReduce applications is a challenging research problem because it involves the prediction of individual MapReduce jobs and jobs are composed of various sub phases. There are several parameters to tune, which can impact MapReduce job performance. Moreover, data skewness and distributed system networking issues make the prediction more difficult.
There are various state of the art techniques for performance prediction and resource scheduling of MapReduce applications. They use various approaches such as modeling, benchmarking and statistical approaches for analyzing the performance of MapReduce applications. For example, Starfish applies dynamic Java instrumentation to collect a run-time monitoring information about job execution at a fine grain level. Such type of detailed job profiling enables the authors to predict job execution under different Hadoop configuration parameters, automatically derive an optimized cluster configuration. However, there is also some overhead to do the detailed profiling.
Tarazu incorporates application and cluster characteristics using online measurement and performs predictive load balancing of Reduce computations. However, this model is specifically designed to consider the heterogeneity of the clusters and provide optimal performance in such clusters, and thus does not resolve the long need of predicting accurate job execution in an efficient manner.
Another tool proposes a benchmarking approach that derives a performance model of Hadoop's generic execution phases (once) for predicting the performance of different applications. Based on experiment conducted using the above tool. It is observed that the execution of each map (reduce) tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic and depend on the amount of data processed by the phase and the performance of underlying Hadoop cluster. They do two separate level of profiling to measure the generic phases and map/reduce functions respectively. To measure generic phases they design a set of parameterizable synthetic micro benchmarks. To characterize execution times of generic phases, they run micro benchmarks on a Hadoop cluster by varying various parameters of MR jobs. Such job profiles capture the inherent application properties that are used to compute a lower bound and upper bound on response time by applying the analytical model. Primarily, the execution time of various phases depends on the data processed.
It is also observed that on same input data the processing time of Map/Reduce changes according to the increase in concurrent map tasks per node. This is because of the resource (disk) bottleneck at I/O level, when multiple map tasks are waiting to access the data from shared disk and there is a contention at the disk level for I/O as shown in FIG. 1. For example, when a read phase is considered, disk contention increases with increase in number of map waves which is proportional to both data size in read phase as well as total number of maps on the disk. FIG. 2 is a graphical representation that illustrates impact of increasing the concurrent map tasks per node on the processing time of read phase. As can be seen from FIGS. 1 through 2, existing tools and solutions only consider the data processed in each phase as the only parameter to build a model and implement such models. Moreover, larger data size leads to more map tasks and hence more workload for underlying local disk which increases execution time of a map task due to increased disk contention. In other words, such models are difficult to implement on applications that have high diversity in input data, and may yield in poor predicting performance and execution of applications which may have direct impact on resource utilization that result in system overhead.