The present disclosure relates scheduling reduce tasks in a MapReduce framework, and more specifically, to minimizing the cost of shuffling intermediate data following the finishing of the mapping task.
Intermediate data shuffling is a bottleneck in the MapReduce framework, especially in systems where all of the computing nodes start shuffling the intermediate data immediately after all of the map tasks finish. The intermediate data is mostly in the memory in these scenarios. At this time the network input/output bursts such that the shuffling of the intermediate data takes a long time. For a large cluster (e.g. hundreds or thousands of nodes) which do not have a high network bandwidth between each the nodes, the poor performance of shuffling is a pain point for users and administrators.