1. Field of the Invention
The present invention relates to a distributed parallel computing technology for processing a large amount of data.
2. Discussion of Related Art
MapReduce, which has been developed by Google for the purpose of data processing for information retrieval, is a programming model for parallel data processing in a distributed environment and provides a method for processing a large amount of data. While hiding a distributed structure of a system by using the concepts of functions “map” and “reduce” of a programming language, MapReduce enables parallel programming by using a general-purpose programming language.
A MapReduce processing process involves a map calculation operation, a reduce calculation operation, and a shuffle and sort operation (referred to as a “shuffle operation” below) of moving data to reduce map calculation results. A map calculator reads a record and calculates a record having a new key and value by filtering the read record or converting the read record into another value. The calculated record is referred to as intermediate data and is stored in a local disk of the map calculator. A reduce calculator groups result values output through a map process based on the new key and then outputs a result of executing an aggregation operation. A shuffle operation involves a process of dividing a key bandwidth through partitioning, sorting and storing a map calculation result in the local disk, and then transferring the map calculation result as input data of reduce calculators via a network.
In the shuffle operation, all of the reduce calculators simultaneously copy map calculation results to their nodes through divided predetermined key bandwidths, and thus a load on the network is abruptly increased.
A combiner positioned between a map calculator and a reduce calculator is referred to as a “small reducer” and is selectively used to reduce a network load in the shuffle operation. When combiners are used, a combiner is executed for each of all map calculators that perform map calculations, receives data calculated by the corresponding map calculator, and performs a job corresponding to a designated function, thereby calculating summarized intermediate data. The calculated intermediate data is transmitted to a reduce calculator and used as input data of a reduce operation. However, even when combiners are used, it is not possible to avoid problems of a bandwidth difference between network layers and a network bottleneck between racks, that is, an inter-rack network bottleneck, resulting from an increase in the number of nodes in a cluster.