Processing of Big Data is typically performed by large computer clusters using parallel and distributed computing techniques. Resource allocation, i.e., determining how many task resources to provision for a job and how to divide the processing among the different nodes of the cluster, is often a challenge in such processing, since performance depends upon specifics of the data and the nature of the processing being performed on the data. In addition to performance, energy efficiency is also an important criterion to consider when determining resource allocation, further complicating the problem of finding an optimal resource allocation.
U.S. Pat. No. 8,935,702 discloses a technique for parallel processing resource provisioning using collected performance data from job runs to compute an optimal number of nodes. A correlation coefficient is calculated for each performance cause in a model responsive to the cause and effect performance data.
The 2009 Marquette University MS thesis by Thomas S. Wirtz entitled “Energy Efficient Data-Intensive Computing With MapReduce” provides a metric, energy performance efficiency, to assess energy performance in MapReduce. The metric is useful to identify the number of workers to optimize energy efficiency.
These techniques, however, do not teach or suggest calculating an exact optimal number of tasks for a job that provides the best trade-off point between performance and energy efficiency vs. task resources on a runtime elbow curve fitted from sampled executions of the target cluster.