Map-reduce is a data processing system that supports distributed processing of an input across a cluster of computing nodes. Traditionally, the map step of the map-reduce data processing system involves a root level computing node dividing an input into smaller sub-problems, and distributing the smaller sub-problems to a cluster of lower level computing nodes for processing. Each lower level computing node may process a corresponding sub-problem and return a solution to the root level computing node. In the reduce step of the map-reduce data processing system, the root level computing node may collect the solutions from the lower level computing nodes and combine the solutions to form an output that is the solution to the input. In some instances, there may be multiple lower levels of computing nodes.
However, current map-reduce data processing systems are no longer limited to performing map operations and then reduce operations. Instead, current map-reduce data processing system use logical graphs to process operations. As a result, current map-reduce data processing systems are highly dependent on an initial selection of an appropriate execution plan. The selection of an appropriate execution plan may include the selection of properties for the map-reduce code used, properties of the data to be processed, and properties related to the interactions of the data and the code. However, such properties may be difficult to estimate due to the highly distributed nature of map-reduce data processing systems and the fact that the systems offer users the ability to specify arbitrary code as operations on the data. As a result, today's map-reduce data processing systems may use fixed apriori estimates to choose execution plans. In many instances, the use of such fixed a prior estimates may result in poor data processing performance.