A distributed data processing system has been introduced to process a large number and amount of data sets where the data sets are in a difficult form to use (sometimes referred to as “messy data”) using physically distributed computing nodes. In order to efficiently and rapidly process the messy data sets, distributed data processing systems have employed various types of parallel technologies. MapReduce may be one of the parallel technologies employed to improve performance and efficiency in data processing.
MapReduce has been introduced by Google to support a distributed data processing system. MapReduce is a software framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a “cluster,” if all nodes use the same hardware, or a “grid,” if the nodes use different hardware. Computational processing might occur on data stored either in a file system or in a database. The “Map” step comprises a master node receiving an input problem, partitioning the problem into smaller sub-problems, and distributing the sub-problems to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The “Reduce” step comprises the master node then collecting and combining the answers to all the sub-problems to form an output solution to the problem.
MapReduce as a software framework for processing messy data uses a large number of computing nodes to process the messy data into a useful format. MapReduce may include an input reader, a map function, a partition function, a reduce function, and an output writer. The input reader may divide input data into a plurality of data blocks and assign one data block to each map function. Each map function may process the assigned data block and generate intermediate data. The partition function may allocate each map function to a particular reducer. The reducer may be a computing node performing the reduce function. The reducer may process, at least intermediately, a given map function and generate intermediate (output) data.
The map function may store intermediate data at a storage node included in a distributed memory cluster including a plurality of storage nodes. The map function may randomly select one of the storage nodes to store the intermediate data. Accordingly, many intermediate data portions may be concentrated at one particular storage node. Furthermore, a reducer is allocated to a map function without considering physical locations of associated nodes or network congestion between a storage node and a reducer. Such a manner of storing the intermediate data may degrade an overall performance of the distributed data processing system.