With the popularization of cloud computing, distributed processing systems are used that execute processes, in a distributed manner, on mass data stored in a cloud system by using a plurality of servers. Hadoop (registered trademark) that uses, as the fundamental technology, the Hadoop Distributed File System (HDFS) and MapReduce processes are known as the distributed processing system.
HDFS is a file system that stores data in a plurality of servers in a distributed manner. MapReduce is a mechanism that performs the distributed processing on data in HDFS in units of tasks and that executes Map processes, Shuffle sort processes, and Reduce processes.
In the distributed processing performed by using MapReduce, tasks related to the Map processes or the Reduce processes are assigned to a plurality of slave nodes and then the processes are performed in each of the slave nodes in a distributed manner. For example, a master server assigns tasks related to Map processes to the plurality of slave nodes and each of the slave nodes performs the assigned Map task. Patitioner performed in each of the slave node calculates, in a Map task, a hash value of a key and decides, on the basis of the value obtained by the calculation, a Reduce task that is performed at the distribution destination.
In this way, the assignment of Reduce tasks to the slave nodes is equally performed by using a hash function or the like; however, the throughput of each of the Reduce tasks is not always equal due to an amount of data targeted for the Reduce operation associated with a key or the like that is associated with a Reduce task.
For example, even if distribution keys are equally distributed to each of the slave nodes, the throughput differs for each Reduce task. Namely, a unit of process performed in a Reduce task, i.e., the processing time, differs among Reduce slots and thus overall processing time may possibly be extended. In this way, because completion time of the process in each of the slave nodes differs, the completion of the overall job constituted by several tasks depends on the completion of the process performed by a slave node whose performance is the lowest.
Because of this, as a technology that adjusts Reduce tasks to be assigned to each slave node, for example, there is a known technology that investigates the number of appearances of a key by sampling input data and that previously assigns Reduce tasks that have different throughput.
Patent Document 1: Japanese Laid-open Patent Publication No. 2012-190078
Patent Document 2: International Publication Pamphlet No. WO 2010/114006
Japanese Laid-open Patent Publication No. 2013-235515
However, even if the number of appearances of a key has already known by using the sampling or the like and the previously assignment of the Reduce tasks is appropriately performed, there may be a case in which the processing time taken when data associated with keys is processed in the Reduce tasks differs due to various factors. In this case, the processing time differs among the Reduce slots and thus the overall processing time is extended.
Furthermore, to reduce the investigation time of input data, if the number of appearances of a key is estimated by using sampling or past data, the processing time is extended due to a lack of balance caused by estimation or due to handling of a key that is not included at the time of estimation.