MapReduce mechanism is a software framework for distributed computing proposed by Google, which can implement parallel computing on large scale data sets. The concepts and main ideas of “Map” and “Reduce” are originated from functional programming languages. Current MapReduce middleware implementations require an application developer to specify a map function to map a set of key-value pairs to some new key-value pairs (referred to as intermediate key-value pairs), and also require the application developer to specify a reduce function to further process the intermediate key-value pairs outputted from the map function. In the map process, input data are partitioned into M input data splits automatically, and then these input data splits are distributed to multiple machines for parallel processing. In the reduce process, the intermediate key-value pairs are partitioned into R splits (e.g., hash (key) mod R) based on intermediate key names by a partition function, and the R splits are also distributed to multiple machines. The number of partition R and the partition function may be specified by users. The MapReduce mechanism can achieves scalability by distributing operations on the large scale data sets to multiple nodes in a network.
Currently, the MapReduce mechanism is considered as an important program design specification for building a data center, and has a very wide range of applications. The typical applications include: distributed grep, distributed sorting, web access log analysis, reverse index building, document clustering, machine learning, statistics-based machine translation, and so on.
FIG. 1 shows a schematic diagram of an existing MapReduce architecture, wherein a job tracker and multiple task trackers are two most basic services in the MapReduce architecture. Generally, the job tracker is deployed on a master node, receives jobs submitted by users, schedules all the jobs, manages all the task trackers, divides each of the jobs submitted by the users into multiple tasks including map tasks and reduce tasks, and is responsible for distributing the tasks to the corresponding task trackers. A task, as a basic element to be performed, is distributed to an appropriate task tracker for execution. The multiple task trackers poll the job tracker to acquire the tasks. A task tracker executing a map task is a map task tracker, and a task tracker executing a reduce task is a reduce task tracker. The task tracker reports the states of the tasks to the job tracker while executing the tasks, thereby helping the job tracker to know the whole job execution.
Specifically, an input file is uploaded to a distributed file system deployed on the data center, and is partitioned into M input data splits according to a partition rule. The size of each split is generally from 16 to 64 MB. The program files required for job execution are also uploaded to the distributed file system, including job configuration files (including a map function, a combine function, a reduce function, etc.) and the like. When receiving a job request from a client program, the job tracker divides the job into multiple tasks, which include M map tasks and R reduce tasks, and is responsible for distributing the map tasks or reduce tasks to the idle task trackers.
Next, the map task trackers read the corresponding input data splits based on the distributed tasks, and analyze them to obtain input key-value pairs. Then, the map task trackers invoke the map function (e.g. map( )) to map the input key-value pairs into the intermediate key-value pairs, and the intermediate key-value pairs generated by the map function are buffered in a memory. For the buffered key-value pairs, the combine function is invoked to aggregate all key values for each key name and the partition function is invoked to partition the buffered key-value pairs into R splits, then the R splits are written into R regions of local disk periodically. After the map tasks are completed, the map task trackers inform the job tracker of task completion and of position information of the intermediate key-value pairs on its local disk.
When the reduce task trackers receive the reduce tasks from the job tracker, they read the intermediate key-value pairs from the local disk of one or more map task trackers based on the position information, then sort the intermediate key-value pairs based on the key name, and aggregate the key values of the same key name. The reduce task trackers invoke the reduce function (e.g. reduce ( )) to reduce these intermediate key-value pairs, and add the outputs of the reduce function into a final output file.
When the existing MapReduce mechanism is used to process the huge data sets, the involved overhead, e.g., data calculation overhead, data transfer overhead, etc., is usually proportional to the sizes of the input data sets. Therefore, when the sizes of the input data sets increase, the above overheads increase too. In addition, the sizes of input data sets usually increase along with the time, for example, a Call Detail Record (CDR) data set in the telecommunication field and web logs data set in network sites are growing day by day. As a result, the sizes of the accumulated data sets could reach a very large scale soon and continue to increase day by day, which makes the MapReduce jobs over them require more time or resources. In the existing MapReduce mechanism, each time when the data addition occurs in the data sets, the whole data sets will be MapReduced again. However, in many cases, although the accumulated data sets are growing larger and larger, the delta addition generated in a day or a week may be much smaller relatively. That is, the affected data are relatively fewer, and thus it may waste many unnecessary time and resources to re-MapReduce the whole data sets, and as the data sets increase, the time and resources required for processing increase too.