Field of the Invention
The present invention relates to the MapReduce architecture, and more specifically, to a method and apparatus for resource management in the MapReduce architecture and a MapReduce architectural system having such an apparatus.
Description of Related Art
The MapReduce architecture is a programming model for the parallelized computation of large-scale data sets (larger than 1 TB, for example). MapReduce may distribute large-scale operations on data sets to computing nodes over a network under the control of a master node for distributed processing, so as to improve execution speed and efficiency for the large-scale data sets. The MapReduce may divide a MapReduce job, such as word frequency statistics on a large amount of data, into multiple Map tasks and multiple Reduce tasks, wherein the output of the Map tasks is input to the Reduce tasks.
Currently, the MapReduce architecture has almost 200 systematic parameters. A user may set some or all of these systematic parameters to specify resources available for processing a MapReduce job and how to use these resources. However, the settings of these systematic parameters are determined artificially based on such as experiences of the user, regardless of the processing capacity and/or resource situations of a node. The systematic parameters obtained in this way are usually not optimal. For example, systematic parameters set by a user may lead to some issues including low processing efficiency of a node.
For instance, provided an input split size to be processed by a Map task in the MapReduce is 1000 MB, and its corresponding output data is 300 MB. If the Map task is assigned with 100 MB memory after occupying a Map slot, because the amount of its output data is larger than the memory amount, each record obtained after the Map operation has to be, at first, spilled to a disk as an intermediate result. Then the Map task will fetch the intermediate results from the disk by three times, sort and merge them, and spill a final Map output result to the disk for the access by Reduce tasks.
In this case, due to the overlarge input split of the Map task, the Map output result has a data amount (300 MB) larger than the memory size (100 MB) available for the process of the Map task, causing a problem that the data obtained by performing the Map operation on the input data need to be spilled to the disk at first and a final Map output result can be obtained only after multiple times of repeated read/write processes performed on the disk, which may produce severe impacts on processing efficiency.