MapReduce is a well-known programming model and implementation for processing and large data sets with a parallel, distributed algorithm on a cluster. MapReduce has become ubiquitous for processing large data volume jobs. An important aspect of MapReduce is that the size of the clusters is often in hundreds or thousands, while it is used for processing infrequent batch and interactive jobs in parallel across multiple machines. The large number of machines in cluster consume a high amount of power, so it is important to utilize them optimally for the specific task.
Known techniques for improving energy efficiency of MapReduce projects include reducing ideal periods on nodes by having lesser number of active nodes in cluster. This is achieved by using job consolidation, data re-distribution and nodes re-configuration. The present techniques alter the design of either one or both of the two underlying frameworks, i.e. a Hadoop Distributed File System (HDFS) cluster and a MapReduce programming model. The current techniques for improving energy efficiency of MapReduce can be classified as MapReduce programming model modification techniques, HDFS cluster modification techniques and nodes or tasks classification and frequency scaling techniques.
The MapReduce programming model modification techniques do workload consolidation or distribution either based on workload characteristics and/or hardware characteristics. A simple way is to consolidate workload on fewer servers and put idle servers in sleep mode. To realize this, multiple dynamic workload placement and Virtual Machine (VM) consolidation techniques have been used. The HDFS cluster modification techniques work by consolidating the data of use on fewer active nodes so that other nodes can be put in sleep mode. For this, the data placement is altered, i.e. the data distribution strategy of the cluster is modified. The data is segregated either to ensure one replica or to ensure critical data availability. In nodes or tasks classification techniques the nodes are classified based on their Central Processing Unit (CPU) speed and used to run map/reduce tasks. So, map tasks are considered to be CPU intensive and scheduled on faster nodes, while reduce tasks are scheduled on low speed and low power nodes. In frequency scaling techniques, the frequency scaling is done based on the type of the tasks running on node. High frequency is maintained for map and reduce tasks, low frequency for shuffle tasks and idle durations.
There are drawbacks of the above mentioned techniques. The workload redirection and server shutdown techniques work well for workloads which are not data intensive and access little required data from remote databases so are not bound to any machine. But these techniques often are not easily applied to MapReduce applications due to its distributed nature and the fact that computation is bound to a machine. Thus the existing methods described above trade-off either performance or data availability to achieve energy efficiency. The frequency scaling methods take a blanket approach to schedule map/reduce tasks and scale frequency based on heuristic nature of map/reduce tasks. These heuristics do not always yield optimal results. Also, they don't work well in multiple jobs scenarios where map, reduce and shuffle tasks of different jobs may be running in parallel. In such cases, the frequency scaling approach highly impacts the performance of the project.